Thursday, 23 October 2025

Playing with a simple revisitor script for monitoring changes to website content

Introduction

A key part of the analysis workflow in brand monitoring services is often the maintenance of a 'watchlist' of sites. This requirement arises most frequently in services comprising domain monitoring, which detect newly-registered names containing a brand name of interest, but which may not yet feature significant or infringing content.

In these cases, enforcement action may not immediately be possible or appropriate, but there might be a concern that higher-threat content may appear in the future. There is often therefore a need to monitor the domains for changes to their content and provide an alert when a significant change is identified. At that point, a decision can then be made regarding appropriate follow-up action. Requirements for 'revisitor' functionality along these lines can also arise in other brand-protection contexts, such as when enforcement action has already been taken against an infringing target (such as a website or marketplace listing), and the targeted page is then tracked to verify compliance to the takedown action. 

There exist a number of automated tools which track content in this way, but key components of a highly effective version include the ability to analyse an appropriate set of different characteristics of the websites in question, and options to set the sensitivity appropriately - it is not generally desirable (for example) for an alert to be generated every time any change to website content is identified, since many websites incorporate dynamic features which differ every time the webpage is called. Conversely, sometimes a change which is only small, or of a particular type (e.g. the appearance of an explicit brand reference) can be significant.

In this article, I briefly explore the development and use of a Python-based revisitor script to inspect and then subsequently review a set of domain names of potential interest (using data from a domain monitoring service for a retail brand, as a case study). Having a simple, easily deployed script of this nature can be advantageous, in terms of being quick and efficient to roll out, and being fully customisable regarding the specific website characteristics analysed and the sensitivity thresholds to be used. These types of tools generally can be highly useful in the cases of watchlists which may feature many hundreds or thousands of URLs to be reviewed, and can, of course, also be expanded to cover other website features and more complex types of site analysis.

Script specifics

The workflow is built on the basis of a 'site visitor' script, which inspects each of the domains in the watchlist, and extracts the following features (which are 'dumped' to a readable database file):

  • HTTP status[1] - a numerical code corresponding to the type of response received when the domain name is queried; a code of '200' indicates a live website response (i.e. potentially an active webpage)
  • Page title[2] (as defined in the HTML source code of the page)
  • Full webpage content[3] (all text, plus formatting features and other content such as embedded scripts - i.e. the full HTML content)
  • Presence / absence of each of a set of pre-defined keywords[4] - applicable keywords for analysis might typically include brand terms or other relevance keywords (e.g. for a retail brand, terms indicating that e-commerce content is present ('buy', 'shop', 'cart', etc.))
  • Final URL[5] - e.g. the destination URL (e.g. after following any site re-direct)

The basic element of the functionality of the revisitor is then to inspect the same list of sites at subsequent times, as required (or on a regular basis if configured to run accordingly), extract a list of the same features, and then compare these with the corresponding features from the same site from the previous round of analysis (as read from the database file). In an initial simple implementation of the script, the following are deemed to be significant changes (i.e. denoting that the site is now worthy of further (manual) inspection and consideration for follow-up action):

  • A change to an HTTP status of 200 (i.e. the appearance of a live website response)[6]
  • Any change to the page title
  • Any(*) change to the webpage content
  • Any instance of the appearance of a keyword of interest (where not previously present)
  • Any change to the final URL (e.g. the appearance or disappearance of a re-direct)

Of course, none of these changes guarantee that the website is now definitively of concern or infringing, but it does generate a 'shortlist' of sites generally then requiring manual review for a definitive determination of appropriate next steps (much more efficiently than having to review the whole set of sites in the watchlist manually on a regular basis). 

Considering content-change thresholds

As discussed above, one of the trickiest features is the determination of an appropriate 'threshold' for alerting to changes to webpage content. The simplest configuration is simply to trigger a notification for any change(*), but in some cases this option may turn out to be too 'sensitive' and might generate too many candidate sites for convenient further manual review (depending on the size of the watchlist and the interval between successive inspections).

As a further exploration, it is instructive to investigate a numerical basis for quantifying degrees of webpage change, and what these differing degrees 'look like' in practice. There are a number of potential algorithms for quantifying the degree of difference between two passages of text (as discussed, for example, in previous work on mark comparison[7]); however, the simple script discussed in this article employs the Python library module difflib.SequenceMatcher[8] applied to the full HTML of the page (split across spaces into individual 'words') to calculate a difference score. This simple score is based on the ratio of the number of 'similar matches' (i.e. words in common) between the two versions of the page in question, to the total number of elements (words). Furthermore, the script has been configured to also provide a more granular view of the exact nature of the change, comprising a summary of which elements (i.e. words in the HTML) have been removed from the (HTML of the) page between the two successive inspections, and which have been added (Figure 1).

(a)

(b)

Figure 1: Examples / illustrations of identified content changes for specific individual webpages between successive inspections: 

  • a) a change to a single dynamically generated string (in this case, Javascript elements) 
  • b) a change from showing an error message to featuring distinct (Javascript) content

Discussion and Conclusions

The examples in Figure 1 provide some initial illustration that the nature of the identified changes are potentially much more important in any determination of significance than (for example) a numerical quantification of the extent of the change (as a proportion of the website as a whole). The first example (i.e. 'a') - a change to a dynamically generated string - is potentially something which might be seen on every occasion the site is inspected and might not correspond to any material change to the page (the visible site content may be entirely unaffected, for example). Conversely, the second example ('b'), representing a change from a simple error message (which, in this case, comprised essentially the content of the website in its entirety) to the appearance of some sort of live, script-generated content (potentially wholly different website content), might be much more significant. 

However, these differences may not be apparent from an inspection of just the numerical 'size' of the change on the page (i.e. the 'difference score'); a variation in a piece of scripted content (such as in Figure 1a) might, for example, just pertain to a small element on a much larger page, or could constitute the dominant component of the webpage as a whole. For example, in a sample dataset, examples of single changes similar to that shown in Figure 1a were found to be equivalent - across the examples of websites in the dataset - to anywhere between less than 5%, or more than 50%, of the whole content of the website in question.

For these reasons, there is always some danger in specifying a specific threshold below which degrees of change to the page are disregarded. In some senses, it is safer to conduct a more detailed inspection of all pages which show any change in content between successive revisits, so as to avoid missing significant cases. However, depending on the numbers of sites under review, this may not be feasible. Accordingly, in future developments or more sophisticated versions of the script, it may be appropriate to refine the scoring algorithm to reflect the nature and/or content of any change. 

However, regardless of the specifics, the general approach discussed in this article is generally able to build efficiency into the review process of sites of future possible concern, potentially filtering down large numbers of sites to be reviewed into much smaller 'shortlists' of candidates identified for deeper inspection and analysis on any given occasion.

References

[1] Using Python library module: urllib.request.urlopen([URL]).status

[2] Using Python library module: bs4BeautifulSoup([URL],’html.parser’).title.text

[3] Using Python library module: urllib.request.urlopen([URL]).read()

[4] Using Regex matching (Python library module: re.search) as applied to the full webpage (HTML) content

[5] Using Python library module: urllib.request.urlopen([URL]).url

[6] However, care must also be taken to distinguish a 'real' change in site status from an 'apparent' change which can arise in instances where (for example) the connection speed to the site is slow, and a connectivity time-out may be mistaken from a real case of site inactivity.

[7] https://www.linkedin.com/posts/dnbarnett2001_measuring-the-similarity-of-marks-activity-7331669662260224000-rh-R/

[8] https://www.geeksforgeeks.org/python/compare-sequences-in-python-using-dfflib-module/

This article was first published on 23 October 2025 at:

https://www.iamstobbs.com/insights/playing-with-a-simple-revisitor-script-for-monitoring-changes-to-website-content




Friday, 10 October 2025

How the growth of AI may drive a fundamental step-change in the domain name landscape

by David Barnett and Lars Jensen (ShortDot)

Introduction

The rate of adoption of artificial intelligence (AI) systems over the last few years, particularly in online and technology-related contexts, has been striking. Automated web-based queries now account for over half of all traffic (51% as of 2024)[1], and nearly three-quarters (74%) of webpages now include some AI-generated content[2]. Overall, traffic generated by AI technologies saw a growth of over 500% in the five months to May 2025[3], and a 2025 study of 3,000 websites found that 63% of them already receive traffic from AI-generated referrals[4]. Looking forward it is predicted that, by 2028, AI-powered search and recommendation engines will drive more web traffic than traditional search[5].

Looking more generally at the landscape, it is estimated by Gartner and other sources that, by 2026 or 2028, 20% of online transactions will be carried out by AI agents[6,7,8,9]. Furthermore, by the end of 2026, 40% of enterprise applications may be integrated with task-specific AI agents, potentially generating 30% of enterprise application software revenue by 2035[10]. Additionally, by 2030, there may be in the region of 500 billion to 1 trillion connected devices, comprising the wider ecosystem of the 'Internet of Things' (IoT)[11,12,13] and (in the absence of mediating factors[14]) this will almost invariably result in an enormous growth in the proportion of DNS traffic categorised as 'machine-to-machine' communication.

It is likely that a significant proportion of these connected entities will require unique DNS identifiers, and many industry commentators are increasingly of the opinion that there will be a desire for a many - particularly agentic AI systems - to be associated with unique domain names[15]. These names could serve as a 'birth certificate' or 'trusted identity' for the systems in question, helping to establish user confidence and familiarity. Any evolution along these lines would have an enormous impact on the overall size of the domain landscape (currently around 350 million names), and it may not be unreasonable to suggest that, by 2050, there may be of the order of 10 to 50 billion registered domains. This propounded evolution of the landscape echoes previous studies suggesting that, in the future, the growth of agentic AI will demand a new layer of verifiable identify infrastructure[16] and that it may be desirable for each distinct AI agent to be tied to an 'immutable root' (i.e. identifier)[17]. This trend would be in some ways analogous to the transition from the IPv4 to the IPv6 system for allocating IP addresses, which created a step-change in capacity from 232 (around 4 billion) to 2128 (around 3 × 1038) possible combinations.

Of course, the shape of the AI-related domain name landscape is already changing. Numbers of .ai domains (for example) have massively spiked since the launch of ChatGPT (notably also driving a fundamental boost to the revenues of parent country Anguilla)[18]. Across the full domain name landscape more generally, there are many tens of thousands of examples featuring keywords pertaining to popular and emerging technologies ('ai', 'crypto', etc.), and this demand is only likely to grow. Such trends may emerge in parallel with the forthcoming second phase of the new-gTLD (generic top-level domain) programme, which might see a push towards the availability of much larger numbers of new brand-, industry- or technology-specific domain-name extensions. Other possible evolutions in business behaviour - such as a possible move towards technology entrepreneurs taking advantage of greater opportunities for AI use and automation, so as to establish and run much larger numbers of businesses - may also drive increased demand in the domain-name landscape.

These comments must also be considered against the backdrop of the fact that the current domain landscape is already - in some regards - beginning to run low on capacity. Whilst the total proportion of all possible domain names which are actually registered is still extremely tiny, there is a relative shortage of short, memorable domain names (particularly those comprising dictionary terms) across popular domain name extensions (TLDs). For example, there are currently essentially no .com domains of 4 characters or fewer available for registration, and very few (short) dictionary terms[19]. These observations are already generating a push towards the use of alternative domain name styles and emerging TLDs, in addition to distinct channels altogether (such as blockchain domains and the Web3 environment)[20].

In terms of the overall landscape of web addresses associated with (agentic) AI systems specifically, what might these trends look like? Two possible directions for development include: (a) the emergence and growth of dedicated domain names for specific AI agents (potentially of the form (for example) [role]AI.[TLD]), with the name signifying the function of the system in question; or (b) the increasing use of AI-specific subdomains (say, AI.[site].[TLD]) within the trusted webspaces (i.e. hosted on the primary domain names) of popular companies, to host agentic systems or other AI functionality. Companies are likely predominantly to continue to use popular legacy TLDs such as .com for the foreseeable future but - as part of these evolving trends - may start to branch out into other existing TLDs, or new extensions emerging from phase two of the new-gTLD programme. Exactly which extensions do succeed will ultimately depend on issues around usability and trust (rather than necessarily just comprising an AI-specific label).

Case studies - the current landscape

As illustrations of the current state of the landscape pertaining to the two specific possibilities discussed above, we consider two datasets, as outlined below.

1. Agentic-AI-style domain names

For this analysis, we consider a list of 100 keywords relating to professions or industry areas (with a specific focus, where possible, on examples where AI applications may be relevant). For each of these, we consider whether a domain name consisting of the keyword, either prefixed or suffixed by the string 'ai', is registered, across each of the top-50 largest existing gTLDs (by size of the domain name zone file, i.e. the data file containing the names and configuration information of all registered domains). Therefore, for 'accountant' (for example), on .com, the analysis looks to determine whether accountantai[.]com or aiaccountant[.]com are registered as domain names. This methodology thereby yields 200 possible (or 'candidate') domain names for consideration, across each of the 50 TLDs, or 10,000 candidate domain names in total.

The analysis shows that, of the 10,000 possible domain names of this format, 2,053 (20.5%, or just over one in five) are already registered. A more granular analysis is shown in Figure 1, showing a 'registration map' of which names are already registered (shown in red), versus those which are absent from the zone file (and therefore potentially unregistered and available) (in green).

Figure 1: 'Registration map' for 'agentic-AI-style' domain names (red = registered, absent from zone file = green), where the second-level name (SLD) (i.e. the part of the domain name to the left of the dot) is shown on the vertical axis and the TLD (domain name extension) is shown on the horizontal axis. The dataset is sorted by (vertically, decreasing from top to bottom) the number of TLDs (out of 50) across which the SLD exists as a registered domain, and (horizontally, decreasing from left to right) the total number of SLDs (out of 100) which exist as a registered domain across the TLD in question. Results are shown for the top 50 most commonly registered SLDs.

The top five most commonly registered SLD strings in the dataset are aiagent (with 'agent' likely referring to its technical, AI-related definition in most cases), agentai, aiart, aimusic, and aimarketing, existing as registered domains across 47, 41, 38, 38, and 36 (respectively) of the 50 TLDs considered in the analysis. Only three of the 200 strings do not appear as the SLDs of registered domain names across any of the 50 TLDs.

The top TLD in the dataset is .com (for which 197 of the 200 considered strings exist as the SLDs of registered domains), followed by .net (144), .org (139), .xyz (134), and .app (107). Only one TLD of the 50 (.ovh) does not feature any of the considered SLD strings as registered domains.

Some examples of some of the registered .com domains which also resolve to live website content are shown in Figure 2. Many of the remainder resolve to lower-threat content such as placeholder and parking pages, suggesting perhaps that they have been proactively registered for future intended use, or may be being held as tradable commodities in their own right, given the potential use-cases for these types of name. aiagent[.]com (for example) resolves to a page offering the domain name for sale and requesting offers in excess of $1.5 million, and aibanking[.]com, aibarrister[.]com, aicontroller[.]com, aidesigner[.]com, and aiinvestment[.]com are all explicitly soliciting offers in excess of $100k.

Figure 2: Examples of 'agentic-AI-style' .com domain names resolving to live website content: aiaccountant[.]com, aianalyst[.]com, aidoctor[.]com, aiparalegal[.]com, aiphotographer[.]com, aireceptionist[.]com

2. AI-specific subdomains

The second piece of analysis considers the extent of the existence of AI-related subdomains (taking the specific example of URLs of the form AI.[site].[TLD]), on each of a series of the most popular (i.e. highest traffic) websites across the Internet. In particular, we consider the 47 most highly visited websites generally, derived from data from Similarweb and Semrush[21] (truncated from a top-50 list, but considering only examples comprising full, second-level domain names), and a dataset of the top 20 information technology (IT) company websites (according to Semrush[22]) - i.e. one example of an industry vertical where AI may be particularly relevant (noting that two domains, live.com and office.com, appear in both lists).

The analysis shows that a specific hostname of the form AI.[site].[TLD] was found to resolve (i.e. is configured with an active DNS entry) for 20 of the top 47 websites globally (i.e. 43%, with 19 of these explicitly also generating a live HTTP (i.e. website) response) (Figure 3), and for 8 of the top 20 IT websites (40%, with 6 also showing a live HTTP response). This does not, of course, preclude the existence of other AI-specific areas of the websites which may use alternative naming conventions, such that these figures represent very much a lower limit on the proportion of these sites already featuring dedicated AI-related sections.

Figure 3: Examples of AI-specific subdomains (of the form AI.[site].[TLD]) on domains within the top-50 list of most popular websites: ai.google.com (re-directs to ai.google), ai.facebook.com (re-directs to ai.meta.com), ai.baidu.com, ai.microsoft.com (re-directs to microsoft.com/en-us/ai)

Discussion and conclusions

Many of the points discussed in this article are reminiscent of terminology used in the futurology study 'From Malthus to Mars'[23]; the work describes certain emerging capabilities as '10x technologies', referring to their capacity to be ten times more effective than their predecessors, and expand accessibility to a far wider audience. Furthermore, some of the predictions referenced in this article are even more significant, and potentially have the ability to push 'from 10x to 100x' growth, representing a fundamental step-change in capabilities and with the power to drive fundamental evolutions of the online landscape.

As AI continues to evolve in an ever-more-interconnected online ecosystem, it is likely that domain names will remain a foundational component of the overall landscape, comprising a permanent, trusted layer which is able to give every connected entity a unique identifier.

Some of these trends are already being observed, even across the existing legacy infrastructure, with significant growth in the numbers of registered domains with specific relevant name structures and/or containing relevant keywords. It will be interesting to see how near-future developments, such as the forthcoming second phase of the new-gTLD programme, the inevitable continued growth and evolution of AI technologies, the increasing interconnectedness of online channels, and the ongoing emergence of new AI use-cases and other areas of online technology, will contribute to this overall picture.

References

[1] https://www.imperva.com/blog/2025-imperva-bad-bot-report-how-ai-is-supercharging-the-bot-threat/

[2] https://ahrefs.com/blog/what-percentage-of-new-content-is-ai-generated/

[3] https://searchengineland.com/ai-traffic-up-seo-rewritten-459954

[4] https://ahrefs.com/blog/ai-traffic-study/

[5] https://www.semrush.com/blog/ai-search-seo-traffic-study/

[6] https://www.linkedin.com/pulse/2026-one-five-retail-transactions-completed-ai-agent-question-amit-6wl1e/

[7] https://onereach.ai/blog/agentic-ai-adoption-rates-roi-market-trends/

[8] https://www.gartner.com/en/documents/6894066

[9] https://www.pymnts.com/artificial-intelligence-2/2024/ai-to-power-personalized-shopping-experiences-in-2025/

[10] https://www.gartner.com/en/newsroom/press-releases/2025-08-26-gartner-predicts-40-percent-of-enterprise-apps-will-feature-task-specific-ai-agents-by-2026-up-from-less-than-5-percent-in-2025

[11] https://www.cisco.com/c/dam/global/fr_fr/solutions/data-center-virtualization/big-data/solution-cisco-sas-edge-to-entreprise-iot.pdf

[12] https://pmc.ncbi.nlm.nih.gov/articles/PMC11085491/

[13] N. Quadar, A. Chehri, G. Jeon, M.M. Hassan, G. Fortino (2022). Cybersecurity Issues of IoT in Ambient Intelligence (AmI) Environment. IEEE Internet Things Mag., 5, pp. 140-145. doi: 10.1109 / IOTM.001.2200009.

[14] https://pdfs.semanticscholar.org/f6fb/3f56f29f23cb8724fce2a7667f08e1641eb4.pdf

[15] For example, from Domain Summit Europe 2025:

[16] https://www.kuppingercole.com/watch/future-of-identity

[17] 'A Novel Zero-Trust Identity Framework for Agentic AI: Decentralized Authentication and Fine-Grained Access Control'; https://arxiv.org/html/2505.19301v2

[18] https://www.imf.org/en/News/Articles/2024/05/15/cf-an-ai-powered-boost-to-anguillas-revenues

[19] 'Patterns in Brand Monitoring' (D.N. Barnett, Business Expert Press, 2025), Chapter 9: 'Domain landscape analysis'

[20] 'Patterns in Brand Monitoring' (D.N. Barnett, Business Expert Press, 2025), Chapter 13: 'Analyzing trends in Web3'

[21] https://en.wikipedia.org/wiki/List_of_most-visited_websites

[22] https://www.semrush.com/website/top/global/information-technology/

[23] https://frommalthustomars.com/

This article was first published on 9 October 2025 at:

https://circleid.com/posts/how-the-growth-of-ai-may-drive-a-fundamental-step-change-in-the-domain-name-landscape

Playing with a simple revisitor script for monitoring changes to website content

Introduction A key part of the analysis workflow in brand monitoring services is often the maintenance of a 'watchlist' of sites. Th...