Tuesday, 17 June 2025

The new new-gTLDs - Part 2: A wider domain of language support

As the build-up to the second round of the new-gTLD programme[1] continues towards its launch in April 2026, we take a look at the issue of non-English-language support within the framework.

The programme itself initially began in 2012, involving the addition of large numbers of new domain-name extensions (global top-level domains, or gTLDs) to the Internet infrastructure. It incorporated a process whereby individual entities were able to apply to run (i.e. act as registry organisation for) their own extension, thereby maintaining control over features such as whether the TLD would be reserved for their own use (e.g. as a 'dot-brand'[2]) or open for registrations by third parties. A second round of applications is set to begin in Q2 of next year.

As the new phase approaches, ICANN (the Internet Corporation for Assigned Names and Numbers, as the organisation overseeing the initiative) has announced a number of improvements to the way multiple languages are supported within the programme[3]. Key points include:

  • The addition of three additional non-Latin scripts in which applied-for TLDs can be expressed.
  • The support of a greatly increased number of languages within the programme generally (to 380, up from 23 in the first round) - e.g. in areas surrounding technical provisions (such as compatibility with associated portals) and DNS infrastructure.
  • Improvements to the process for assessing string similarity and potential for name 'collisions' (i.e. the same name existing in different namespaces), including the incorporation of visual and phonetic similarity evaluations. The application process will also feature the ability for specifying a 'second-choice' string (which must contain the first-choice version as a sub-string), for cases where the preferred version is deemed unacceptable, in addition to a more transparent process for resolving contentions.

These changes will give greater flexibility for entities operating in the non-English-speaking world, and are another area to consider for organisations assessing their place in the new-gTLD landscape (e.g. those considering an application for their own brand-specific extension). 

How might the implications of these changes manifest themselves as the second phase of the programme comes into fruition? One way of gaining possible insights in this area - e.g. regarding potential use-cases for foreign-language domains - is to consider the current state of the landscape, with an obvious source of relevant 'clean' data being the existing set of internationalised domain names (IDNs)[4] (i.e. those incorporating non-Latin characters). The IDNs specifically are a special subset of the full universe of non-English domain names generally, which do (of course) include large numbers of examples written just in Latin characters. The non-English Latin-character domain landscape already includes many whole gTLDs, such as (Chinese) .xin and .weibo; (French) .moi and .maison; (German) .jetzt, .kaufen, .reise and .versicherung; (Hindi) .desi; (Italian) .casa, .immo and .moda; (Portuguese) .bom; (Spanish) .abogado, .futbol, .gratis, .tienda, .uno and .viajes), all of which (as non-IDNs) do not require any 'special' technical infrastructure. 

As of the current time there are, however, around 150 internationalised new-gTLDs (i.e. where the domain-name extension itself is written in (or includes characters written in) a non-Latin script) which have been delegated into the Internet infrastructure[5]. Domain names (or domain-name extensions) of this type are sometimes expressed in an 'encoded' format called Punycode (in which they are converted to a string written wholly in Latin characters, denoted by the characters 'xn--' at the start), which is how they are expressed in zone-file raw data, for example.

Domain name zone files (containing lists of all registered domains across the extension in question, in addition to other technical configuration information) are published by (and are publicly available from) ICANN, for around 80 of these extensions, providing a ready source of data which can easily be analysed to identify trends and patterns in usage. Many of the remainder of the delegated IDN extensions are country-specific examples (e.g. comprising just a country name written in local language, or an abbreviation (analogous to familiar non-IDN ccTLDs such as .co.uk, .fr, .de, etc.)), or are extensions which may no longer be in active use.

For the approximately 80 IDN gTLDs for which zone-file data is available, it is possible to drill into the datasets to gain an overview of the specific domain names registered. Table 1 shows (for example) the most popular of these extensions by total numbers of registered domain names (for all IDN TLDs associated with more than 250 domains). 34 of the 80 extensions feature only five domains or fewer.

Table 1: The most popular IDN gTLDs currently, by numbers of registered domains (additional information mostly provided by Wikipedia[6])

It is noteworthy that the list of the most popular extensions is dominated by Chinese-language examples, mostly comprising generic terms (but with two brand-specific (China Mobile Communications Corporation, and CITIC Group) and two geographic (Guangdong and Foshan) extensions). 

As an illustrative example, it is informative to consider the list of 31,192 individual domains with the most popular of these extensions (.在线, a Chinese-language extension meaning 'online'). In the vast majority of these cases (29,811, or 95.6% of the total), the second-level domain (SLD) - i.e. the part of the domain name to the left of the dot - is also written in (wholly or partly) non-Latin script (Chinese in most cases), thereby comprising fully internationalized domain names. Of the remainder, 107 of the domain names consist purely of digits as the SLD (i.e. numeric domain names[7], which are often popular in markets such as China, where their use can circumvent language barriers and particular numbers may have specific cultural significance). The remainder of the domains feature a range of different types of (Latin-character) terms in their SLD, including transliterations of Chinese words, a range of generic terms, and numerous brand references (including, presumably, both official and non-legitimate (potentially infringing) examples). 

Overall, therefore, the landscape of IDNs essentially just comprises a microcosm of the domain landscape generally, but offers additional options and flexibility for brands and consumers in markets where the local language includes non-Latin characters. In light of the increased support for a wider range of languages in the forthcoming phase of the new-gTLD programme, brand owners will once again want to consider the opportunities and risks within the ever-expanding online landscape.

References

[1] https://www.iamstobbs.com/opinion/the-new-new-gtlds

[2] https://www.iamstobbs.com/opinion/a-review-of-the-current-state-of-the-new-gtld-programme-dot-brands

[3] https://www.worldtrademarkreview.com/article/hundreds-of-languages-added-help-internationalise-new-gtlds-in-2026

[4] https://www.iamstobbs.com/opinion/idn-tifying-trends-insights-from-the-set-of-non-latin-domain-names

[5] https://data.iana.org/TLD/tlds-alpha-by-domain.txt

[6] https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains

[7] https://www.iamstobbs.com/opinion/the-universe-of-numeric-domain-names

This article was first published on 17 June 2025 at:

https://www.iamstobbs.com/insights/the-new-new-gtlds-a-wider-domain-of-language-support

Thursday, 5 June 2025

An updated view of bad TLDs

A central component of the analysis of results identified through a brand monitoring programme is an assessment of the level of threat associated with each finding. Domain names specifically have a number of associated characteristics which can be used to quantify their potential threat, and much previous work has focused on the frequency of association of these characteristics with malicious content. This type of analysis can serve as a basis for the construction of algorithms to quantify the likely level of potential threat associated with any arbitrary identified website[1]. Threat scoring - as a method for prioritising findings - has a range of uses, including the identification of priority targets for further analysis, content tracking, or enforcement.

One such key feature is the domain-name extension (the top-level domain, or TLD - which includes examples such as .com or .xyz), with differing TLDs having wildly varying rates of popularity with infringers and other bad actors, due to a range of factors including registration costs, existence of IP protection programmes, and ease of enforcement. 

One previous study on the subject[2] compiled overall threat scores for a group of highly affected TLDs, based on an aggregation of data from other sources, including Spamhaus[3], Netcraft[4] and Palo Alto Networks[5], each of which encompassed insights relating to differing aspects of infringing behaviour. 

The release of Spamhaus' latest Domain Reputation Update report[6] - relating to classification of domains as malicious or suspicious based on a range of features, including association with spam, phishing, malware, ransomware and other fraudulent activities - provides an updated view of the highest-threat TLDs covering many relevant areas of infringement, and offers useful insights towards the construction of threat-scoring algorithms.

In considering the main insights, we focus primarily on the subset of Spamhaus' data concerning the rates of infringement within each TLD (i.e. the numbers of malicious domains as a proportion of the total number of domains across the TLD), rather than the absolute numbers, as this offers a more meaningful input into any potential threat-scoring metric. 

Amongst the main points to take away from the study are the facts that:

  • There are four TLDs (.xin, .qpon, .locker[7], and .lgbt) for which more than half of the total set of domains in the zone file are marked as malicious - with the top example (.xin, popular with a Chinese audience, and usually translated as the word for 'new'[8]) having over 82% of its domains marked as malicious - and 11 TLDs where more than one-in-three of the domains are malicious. All of the top twenty TLDs (by proportion of malicious domains) are mid-size extensions, each with total numbers of domains in a range between 11,000 and 182,000.
  • A significant proportion of the domains marked as malicious have been found to be associated with Chinese gambling sites. Previous research has also found some suggestion that these types of sites may operate in conjunction with other types of infringement, such as comprising material which is an alternative to the 'primary' content displayed to certain users, but which may only be visible in certain locations (i.e. geoblocked[9]) or at certain times or days. 
  • Eight of the top 20 highest-threat TLDs in the Spamhaus dataset are associated with a single registry, BinkyMoon LLC, whose business model involves the offer of highly competitive prices for domains, a tactic which can often drive high levels of abuse.
  • A large number of the .xin domains - including many of the malicious examples - have names beginning with 'com-', a strategy noted in a recent Stobbs study[10] as being one way to create compelling deceptive infringements. The vast majority of these are registered through Dominet (HK) Limited, the registrar which rebranded from Alibaba.com Singapore E-Commerce Private Limited, following the issuing of a compliance notice by ICANN in March 2024[11]
  • Amongst the set of malicious domains, use of brand-related terms appears to be decreasing - perhaps, in part, due to their relative ease of detection and enforcement through brand-protection programmes - in favour of more generic, industry- or subject-related terms.

As a follow-up piece of analysis, even considering just a direct visual inspection of the raw domain data in the zone files of the highest-threat TLDs (as given in the Spamhaus study), some trends are immediately apparent: the domains across the TLDs in question appear to include disproportionately high numbers of numeric domains[12] (i.e. where the SLD - or second-level domain name; the part to the left of the dot - contains digits only), perhaps indicating a popularity of such domains for infringing use; and the .xin file indeed does appear to contain large numbers of 'com-' examples, but also in addition to 'us-' domains, which may have a similar potential use-case (i.e. constructing URLs resembling legitimate domains on the .us domain extension). 

Results from a more detailed quantitative analysis of these and other relevant points, carried out using the full zone-file data, are outlined in Table 1 and Figure 1.

Table 1: Top level statistics for the domains in Spamhaus' top ten highest-risk TLDs

Figure 1: Numbers of numeric domain names, by SLD length, for each of Spamhaus' top ten highest-risk TLDs

Some of the main insights from the overall analysis are as follows:

  • Across the top-ten highest-risk TLDs (from Spamhaus' data), numerical domain names account for a significant proportion of the total (53% of all domains across the full set of ten zone files. For five of the TLDs, numeric domains account for more than 70% of the total. The highest proportions are seen for .loan (87.66% numeric domain names in total) and .locker (84.56%). Of the numeric domain names, the vast majority are 4, 5 or 6 characters in length (4.1%, 70.7% and 23.9% of the total, respectively).
  • There are no obvious patterns in domain name entropy (a mathematical measure of the length and 'randomness' of a domain name) and, despite the fact that a significant proportion of the domains under consideration are (by definition) malicious, this is not reflected by a prevalence of particularly high entropy names[13] (as might typically be associated with automated registrations)[14]. Of the ten TLDs considered, the highest mean entropy was seen for the domains on .xin.
  • A prevalence of (potentially deceptive) 'com-' and 'us-' domains was seen only on .xin. For many of these domains, the portion of the SLD after 'com-' consisted of what appeared to be essentially random characters, but some specific use-patterns were identified, such as groups of domains with SLDs featuring specific keywords. These have a range of potential fraudulent use-cases, such as the construction of deceptive URLs resembling official sites for making payments (e.g. for road tolls) or accessing other financial or technical information.

Example groups included:

    • Toll-related domains - e.g. names of the form: com-highroadXXX, com-roadXXXXXX, com-tollXXXX or com-tollbillXXX
    • (Other) billing- or payment-related domains - e.g. com-lnvoiceXX [sic], and mis-spellings of com-payment and com-statementXX
    • Other classes of keywords: e.g. mis-spellings of com-lucky, com-passX, com-serviceXXX, com-shtmlXXXXX, com-ticketXXXX, com-updateXXX, and us-etcXX (perhaps in reference to the cryptocurrency Ethereum Classic)

Overall, these types of insights into (in this case) the highest-threat TLDs can greatly aid in the construction of metrics for prioritising brand monitoring results, and can thereby build efficiencies into the analysis and enforcement processes. The TLD of a webpage (or other online finding) is just one relevant characteristic, so this type of analysis would generally need to be combined with findings from other studies, in the construction of any overall threat-scoring algorithm. 

References

[1] https://circleid.com/posts/towards-a-generalised-threat-scoring-framework-for-prioritising-results-from-brand-monitoring-programmes

[2] https://circleid.com/posts/20230117-the-highest-threat-tlds-part-2

[3] https://www.spamhaus.org/statistics/tlds/

[4] https://trends.netcraft.com/cybercrime/tlds

[5] https://unit42.paloaltonetworks.com/top-level-domains-cybercrime/

[6] https://www.spamhaus.org/resource-hub/domain-reputation/domain-reputation-update-oct-2024-mar-2025/

[7] https://www.iamstobbs.com/opinion/some-more-new-domains-in-the-.locker

[8] https://tld-list.com/tld/xin

[9] https://circleid.com/posts/20220531-do-you-see-what-i-see-geotargeting-in-brand-infringements

[10] https://www.iamstobbs.com/insights/com-away-with-me-use-of-com-domains-in-the-construction-of-deceptive-url-like-hostnames

[11] https://www.icann.org/uploads/compliance_notice/attachment/1221/hedlund-to-chu-27mar24.pdf

[12] https://www.iamstobbs.com/opinion/the-universe-of-numeric-domain-names

[13] https://www.iamstobbs.com/opinion/un-.zip-ping-and-un-.box-ing-the-risks-associated-with-new-tlds

[14] https://circleid.com/posts/20230703-an-overview-of-the-concept-and-use-of-domain-name-entropy

This article was first published on 5 June 2025 at:

https://www.iamstobbs.com/insights/an-updated-view-of-bad-tlds

Tuesday, 3 June 2025

Using clustering and investigation techniques to connect and identify scam law-firm websites

During the Spring of 2025, a series of reports emerged of a campaign of scams, in which fraudsters have been impersonating trademark law firms and targeting brand owners with allegations of third-party attempts to register the brand name. In some cases, the fraudsters have been observed to create wholly fake organisations; in others, the names of legitimate firms have been used, in instances of brand impersonation (often with the use of a deceptive domain name similar to that of the real company).

In one such set of scams, for example, a group of company names and associated contact details as shown in Table 1 were found to have been used. All of these examples have explicitly been reported as scams on the website of the Solicitors Regulation Authority[1,2,3,4]

Table 1: Company names and credentials used in identified trademark law-firm scams

Whilst there is no reason to suppose that all of these examples are necessarily connected to each other as a single coordinated campaign, there are certainly connections between at least some of the scam sites. For example, four of the five entities use the same (apparently London-based) telephone number. This might be indicative of nothing more than the use of a common website template being used across these scams, but if the telephone numbers are actively being used as means of contact in the campaigns, there is therefore a strong indication that at least these four are associated with the same underlying fraudsters.

These types of insights are key to the idea of clustering[5], which is a technique used in brand protection and other related areas of analysis, intended to establish links between infringements. An extension of this idea can also be combined with open-source intelligence ('OSINT') investigation techniques to identify other related examples. 

For example, a simple search for the shared phone number referenced above shows that it has also been used in yet another law-firm scam ('Wozi Law Firm', wozilawfirm[.]org)[6], impersonating a legitimate company of the same name (wozilaw[.]com). 

As of the date of analysis (08-May-2025), none of the scam websites in question were found to be live, although all were found previously to have been active for a sufficient length of time to have been indexed by Google. The associated abstracts provide some insights into the content which was formerly present (Figure 1), which can also serve as a basis for searches for other (potentially related) sites featuring similar content, or for further establishing similarities between the sites.

Figure 1: Example of the Google abstract for the scam site formerly present at cromfordlaws[.]com

In one of the above cases, a historical cached view of the site was available from the Archive.org website[7] (Figure 2).

Figure 2: Historical cached screenshots (from 24-Mar-2025) (courtesy of Archive.org) of the scam site formerly present at cromfordlawfirm[.]com

Carrying out a deeper analysis of the domains utilised in the various scams can also serve as a basis for establishing further clusters of associated examples. Table 2 shows the dates of registration, host-IP addresses, named registrants and registrars for the domains in question (for the most recent available whois records).

Table 2: Registration and configuration details for the domains utilised in the scams referenced above

It is also worth noting that, in some cases, the DomainTools website also possesses cached historical views of the sites in question (Figure 3).

Figure 3: Historical cached screenshot (from 13-Mar-2025) (courtesy of DomainTools) of the scam site formerly present at wozilawfirm[.]org

In cases where the registrant details are redacted (which is very common following the introduction of GDPR), information such as the identity of the privacy service provider does not serve as a very effective means of clustering together related results. However, the other details (as shown in Table 2) can be more diagnostic. It is particularly noteworthy that two of the IP addresses appear twice in the table which, whilst not definitive of a link between the co-hosted sites, can be a useful indicator if other commonalities are also present. 

Reverse-IP-address look-ups reveal four further sites which are co-hosted with at least one of the examples shown in Table 2, also feature references to 'law' in the domain name, and show other characteristics such as registrar, name patterns and nearby registration dates in common (Table 3). It is highly likely that these comprise additional clusters of related scam sites and, whilst again none is currently active, a cached screenshot was again available in one case (Figure 4).

Table 3: Additional potential scam domains sharing hosting characteristics with one or more examples from Table 2

Figure 4: Historical cached screenshot (from 30-Apr-2025) (courtesy of DomainTools) of the scam site formerly present at cndlawfirms[.]com

It is also possible to extend these ideas to much broader domain searches. For example, zone-file analysis reveals that there are over 47,000 domains with named ending with (for example) 'lawfirm(s)'. Considering just the .org domains (to provide an easily manageable dataset, and by analogy with the wozilawfirm[.]org example identified previously) and focusing just on the domains registered through Hostinger, PDR, or Namecheap as registrar, since the start of 2025 (i.e. those most likely to be live and associated with the identified campaign(s)), we find 17 further candidate domains, of which seven resolve to additional live sites of potential concern. Two examples are shown in Figure 4 - both are registered via Hostinger Operations, UAB, hosted at the same IP address (34.120.137.41) and registered in a similar timeframe (on 19-Jan and 23-Jan respectively), and feature other suggestions of possible non-legitimacy (such as the use of placeholder content, webmail addresses, inconsistent contact details, etc.). The sites also have a broadly similar appearance, possibly suggestive of the use of a common website template.

Figure 4: Example of a 'mini-cluster' of two further sites of potential concern ('Nazakat' and 'Elite Law Firm')

The ideas presented in this article - namely the use of analysis and investigation techniques to connect infringements and identify additional related examples - are key to a highly significant area of brand protection analysis. These types of approaches can be used to provide early identification of sites which may pose a threat - potentially before they are utilised extensively and subsequently reported online - and can be built into the analysis and prioritisation approaches used for active monitoring services. This is also an area where AI-based analysis can provide a compelling addition to traditional analysis techniques, in the identification of key features from highly rich datasets.

References

[1] https://www.sra.org.uk/consumers/scam-alerts/2025/apr/cromford-law/

[2] https://www.sra.org.uk/consumers/scam-alerts/2025/apr/spectre-law/

[3] https://www.sra.org.uk/consumers/scam-alerts/2025/may/cnd-law-ltd-david-marks-nick-cross/

[4] https://www.sra.org.uk/consumers/scam-alerts/2025/mar/ballard-and-trademark-expressive/

[5] https://circleid.com/posts/braive-new-world-part-1-brand-protection-clustering-as-a-candidate-task-for-the-application-of-ai-capabilities

[6] https://regulationandcomplianceoffice.co.uk/raco-roundup-9/

[7] https://web.archive.org/web/20250324145542/https://www.cromfordlawfirm.com/

This article was first published on 3 June 2025 at:

https://www.iamstobbs.com/insights/using-clustering-and-investigation-techniques-to-connect-and-identify-scam-law-firm-websites

The new new-gTLDs - Part 2: A wider domain of language support

As the build-up to the second round of the new-gTLD programme [1] continues towards its launch in April 2026, we take a look at the issue o...