Friday, 29 December 2023

IDN-tifying trends: Insights from the set of non-Latin domain names

BLOG POST

Internationalized domain names (IDNs) are domain names featuring characters in non-Latin scripts, including examples featuring accented characters (such as münchen.de) and those which are entirely written in alternative character sets (such as яндекс.рф - Yandex Russia). This infrastructure allows brand owners to create domain names in local languages and target content to specific markets, but also provides potential for bad actors to create names which are deceptively similar to the official domain names of trusted brands (e.g. by substituting a character with a non-Latin equivalent appearing visually similar - a so-called 'homoglyph').

In this study, we consider the full set of registered IDNs across all gTLDs (generic top-level domains, or domain extensions) for which zone files are available, covering around 1,000 different extensions, to identify trends and patterns, and indicators of potential abuse.

Overall, there are around 1.3 million gTLD IDNs currently in existence, across 470 distinct domain extensions, with the most popular being .com (853k IDNs), .net (136k), and .在线 (Chinese for 'online') (28k).

267 distinct domain names were found to comprise homoglyph variations of any of the top ten most valuable global brands in 2023, and not to be under the control of the brand owner. A significant proportion of these feature indicators that they have been registered for infringing use, with 79 (30%) found to have active MX (mail exchange) records, indicating that they have been configured to be able to send and receive e-mails and could therefore be associated with phishing activity, and 128 (48%) having privacy-protected whois records. Various examples were identified as explicitly hosting fraudulent or infringing content, including instances of lookalike sites (e.g. ǥoogłe[.]com, googļe[.]com, googłe[.]online (re-directs to googłe[.]co) and ɠoogle[.]com (re-directs to gooqle[.]cm)), misdirection and brand confusion (e.g. gooqłe[.]com, gooġlɵ[.]com, visã[.]com and ɢoogle[.]net).

Across the set of these non-official homoglyph domains, the average number of replaced characters in the SLD name (the part of the domain name to the left of the dot) is 1.62, highlighting the necessity for the use of detection technologies able to analyse strings in full in order to detect visual similarity, rather than just identifying instances which differ from the official string by (say) a single character. Ten examples were identified of domains in which more than half of the characters have been replaced with non-Latin homoglyphs, including (all with 100% non-Latin characters): ᴀᴘᴘʟᴇ[.]com, арріе[.]com, арріе[.]net, арріө[.]com, аррӏе[.]com, ᴍɪᴄʀᴏꜱᴏꜰᴛ[.]com, ᴍᴄᴅᴏɴᴀʟᴅꜱ[.]com and ᴀᴍᴀᴢᴏɴ[.]com. Three of these domains have active MX records.

The following additional points warrant specific consideration by brand owners:

  • The number of homoglyph domains targeting trusted brands - and the significant proportion of these found to be actively infringing or to feature indicators of suspicious intentions - highlights the need for brand owners to monitor activity in this space, combined with tracking examples of concern for content changes and launching enforcement actions when appropriate. 
  • Many top brands incorporate instances of potentially-deceptive IDNs in their defensive domain portfolios; however, this approach in isolation is likely to be of limited effectiveness because of the infinite potential variations available to would-be infringers. Where domains are held for defensive reasons, it may be advisable for them to be configured to re-direct to the official brand website, to maximise traffic and minimise the risk of customer confusion.

Similar trends in potentially fraudulent domain registration activity have also been observed in the landscape of Web3 blockchain domains, which also allow for a wide range of non-Latin characters[1]. This arena is also worthy of careful consideration by brand owners, who may wish to explore brand protection strategies across these emerging technologies. This approach may be particularly valid as the availability of desirable domain names begins to run low across traditionally popular areas of the domain landscape, such as .com[2].

References

[1] https://www.iamstobbs.com/trends-in-web3-ebook

[2] https://www.iamstobbs.com/availability-of-domains-ebook

This article was first published on 29 December 2023 at:

https://www.iamstobbs.com/opinion/idn-tifying-trends-insights-from-the-set-of-non-latin-domain-names

* * * * *

WHITE PAPER

Executive Summary

Internet technology allows users to create domain names with characters in non-Latin scripts, allowing targeting of content to local markets. These so-called internationalized domain names (IDNs) can, however, also be abused by bad actors to create deceptive websites with names which appear visually extremely similar to the official domains of trusted brand websites.

In this study, we consider the set of IDNs currently registered across all (approximately 1,000) gTLDs (generic top-level domains, or domain extensions) which have zone files available from ICANN's Centralized Zone Data Service, to identify trends and patterns, and indicators of potential abuse.

The main findings of the analysis are as follows:

  • Across the set of gTLDs, around 1.3 million IDNs are currently registered, covering 470 distinct domain extensions. The top three TLDs, by numbers of IDNs, are .com (853k IDNs), .net (136k), and .在线 (Chinese for 'online') (28k).
  • The top three languages utilised for IDNs are Chinese (506k IDNs), Korean (136k), and German (113k).
  • Across the dataset, the IDNs range in length from 1 to 57 characters.
  • 388 IDNs were identified with SLD[1] names appearing visually similar to those of the main corporate website of any of the top ten most valuable global brands. In these cases, one or more characters from the brand name have been replaced by a non-Latin character which appears visually similar - these are so-called 'homoglyph domains', and present the potential for deceptive misuse by bad actors.
  • Excluding the domains which appear to be under the ownership of the brand in question (121 instances; presumably defensive registrations, etc.), the following observations can be made about the other homoglyph domains for these top ten brands in the dataset:
    • 73 (27%) return a live website response
    • 79 (30%) have active MX (mail exchange) records, indicating that they have been configured to be able to send and receive emails and could therefore be associated with phishing activity
    • 128 (48%) have registrant information redacted using a privacy-protection service
  • Many of these domains are being actively used to host fraudulent or infringing content, including instances of lookalike sites, misdirection and brand confusion.
  • Across the dataset of non-official homoglyph domains for the top ten brands, the average number of replaced characters in the SLD name is 1.62. There are ten examples of domains in which more than half of the characters have been replaced with non-Latin homoglyphs, including (all with 100% non-Latin characters): ᴀᴘᴘʟᴇ[.]com, арріе[.]com, арріе[.]net, арріө[.]com, аррӏе[.]com, ᴍɪᴄʀᴏꜱᴏꜰᴛ[.]com, ᴍᴄᴅᴏɴᴀʟᴅꜱ[.]com and ᴀᴍᴀᴢᴏɴ[.]com. Three of these domains have active MX records.

Introduction

Modern Internet infrastructure allows for the creation of domain names containing non-Latin characters, such as accented characters and text from wholly distinct character sets (internationalized domain names, or IDNs). Whilst this presents opportunities for brand owners to create domain names in local languages, and engage with target audiences, it does also present opportunities for fraudsters to create names which are deceptively similar to the official domain names of trusted brands, for example by substituting a character with a non-Latin equivalent appearing visually similar (so-called 'homoglyphs').

In this study, we consider the set of IDNs present across the full range of generic top-level domains (gTLDs) or domain extensions, using information from ICANN’s Centralized Zone Data Service[2]. In these zone files (domain configuration data files), IDNs are represented in an encoded form called Punycode, in which they are represented in Latin-character strings beginning 'xn--'. The encoded version displays any Latin characters from the domain name and also represents (as Latin characters) any non-Latin characters and their relative positions within the string (e.g. hermès.com is represented in Punycode as xn--herms7ra.com). For the analysis, all Punycode domains are translated to their true IDN equivalents, and trends and patterns in the dataset inspected.

Analysis

1. Top-level statistics for the full dataset

Overall, around 1.3 million IDNs exist across the set of gTLDs for which zone files are available. 470 distinct TLDs have at least one IDN registered. Table 1 and Figure 1 show the top ten most popular TLDs for IDN registrations.

TLD
                                                                        
No. of IDNs
                                
  com853,308
  net135,775
  在线 (xn--3ds443g) (Chinese for 'online')27,956
  top24,988
  商 (xn--czr694b) (Chinese for 'trademark')24,894
  公司 (xn--55qx5d) (Chinese for 'company')23,972
  org22,470
  info17,365
  网 (xn--io0a7i) (Chinese for 'network')16,377
  online15,670

Table 1: Top TLDs by numbers of IDNs

Figure 1: Top TLDs by numbers of IDNs

It is noteworthy that four of the top ten most popular TLDs for non-Latin domain names are themselves internationalized extensions (which can also be alternatively represented in Punycode format). All of the examples in this case are in the Chinese language.

Table 2 and Figure 2 show the top ten most popular languages[3] represented in the second-level domain names (SLDs) (i.e. the part of the domain name to the left of the dot) of the set of IDNs.

SLD language
code
                                
SLD language[4,5]
                                
No. of IDNs
                                
zh  Chinese505,952
ko  Korean136,248
de  German112,566
ja  Japanese104,861
th  Thai42,153
en  English38,844
zh-Hant  Chinese (trad.)35,952
tr  Turkish34,859
es  Spanish34,576
fr  French32,802

Table 2: Top SLD languages by numbers of IDNs

Figure 2: Top SLD languages by numbers of IDNs

Perhaps unsurprisingly, the set of IDNs is dominated by languages using entirely non-Latin alphabets, with four of the top five languages utilising alternative character sets. In total, 125 different languages are represented within the dataset.

Figure 3 shows the distribution of domain-name (SLD) lengths (in characters) across the full dataset, considering the IDN representations of the names (rather than their Punycode equivalents).

Figure 3: Distribution of IDN (SLD) lengths

The longest domain name (SLD) in the dataset is 57 characters (1 instance). A full list of all IDNs of 56 characters in length or greater is shown in Appendix A. The dataset also includes over 14,000 IDNs with an SLD length of one character.

2. Deceptive homoglyph domain names

In this section we consider homoglyph domain names where the SLD is identical to the that of the main official corporate domain name of any of the top ten most valuable global brands in 2023[6], apart from the replacement of one or more characters with a (non-Latin) character which appears visually similar, but with no additional keywords or other terms - in other words, these are the IDNs with the greatest potential for customer confusion and fraudulent use relating to these brands. Table 3 shows the number of such variants identified within the dataset, noting that some of these appear likely (on the basis of registrant details and/or the use of an enterprise-class domain registrar) to be under the control of the official brand owner, presumably being held as defensive registrations or for other purposes (e.g. for use in internal phishing tests).

Brand string
                                
.com
                                
Other gTLDs
                                
Total
                                
  apple30 (10)6 (1)36 (11)
  google159 (44)27 (6)186 (50)
  microsoft34 (0)7 (0)41 (0)
  amazon82 (42)17 (6)99 (48)
  mcdonalds4 (0)04 (0)
  visa5 (2)05 (2)
  tencent1 (0)01 (0)
  louisvuitton1 (0)01 (0)
  mastercard10 (10)010 (10)
  coca(-)cola*5 (0)05 (0)
  Total331 (108)57 (13)388 (121)

* Hyphen optional

Table 3: Total numbers of homoglyph domain names for each of the top ten most valuable global brands. Shown in brackets are the numbers of these domains which appear to be under the control of the official brand owner.

The visual similarity of some of these domains to the names of the official sites in question is striking; for example, the list of homoglyph domains for Google (the top-ten brand most heavily targeted by this type of infringement) is shown in Appendix B.

Considering only the 267 domains which are apparently not under the control of the official brand owner in question, the following observations are apparent:

  • 73 (27%) return some sort of live website response (i.e. an HTTP status code of 200)
  • 79 (30%) have active MX records, indicating that they have been configured to be able to send and receive e-mails and could therefore be associated with phishing activity. 53 of these have no active website and may be being used for their e-mail functionality only.
  • 128 (48%) explicitly make use of some sort of privacy-protection service in their whois record, as is often the case for domains registered for egregious use.
  • The registrar breakdown is dominated by retail-grade providers, often popular with infringers[7], with the top three within the dataset found to be GoDaddy.com, LLC (102 domains), Squarespace Domains II LLC (38) and NameCheap, Inc. (25).
  • Many of the domains have been long-lived, with the earliest examples found to have creation dates within 2001. Only 46 of the domains were registered during 2023, though activity appears to be ongoing, with the newest example registered on 13-Sep-2023.

Amongst the homoglyph domains resolving to live content, there are a number of examples of particular concern (Figure 4).

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

Figure 4: Examples of live sites of concern hosted on homoglyph domains targeting any of the top ten most valuable global brands:
  • Lookalike sites: 
    • (a) ǥoogłe[.]com (xn--ooge-21a88g[.]com); 
    • (b) googļe[.]com (xn--googe-m6a[.]com); 
    • (c) googłe[.]online (xn--googe-n7a[.]online) - re-directs to googłe[.]co (xn--googe-n7a[.]co); 
    • (d) ɠoogle[.]com (xn--oogle-kmc[.]com) - re-directs to gooqle[.]cm
  • Misdirection / brand confusion / other brand misuse: 
    • (e) gooqłe[.]com (xn--gooqe-n7a[.]com); 
    • (f) gooġlɵ[.]com (xn--gool-dxa55r[.]com); 
    • (g) visã[.]com (xn--vis-ola[.]com); 
    • (h) ɢoogle[.]net (xn--oogle-wmc[.]net)
  • Possible piracy site: 
    • (i) äpple[.]online (xn--pple-koa[.]online)
  • Other brand issues: 
    • (j) googľe[.]com (xn--googe-y6a[.]com), gᴑȯgle[.]com (xn--ggle-v0b5042b[.]com), gȯoglɵ[.]com (xn--gogl-v0b73b[.]com) and goȯglɵ[.]com (xn--gogl-w0b63b[.]com)
  • Domain name offered for sale:
    • (k) ɢooɢle[.]com (xn--oole-47bc[.]com)
  • Parking page featuring industry-relevant pay-per-click ads: 
    • (l) amazoñ[.]com (xn--amazo-sta[.]com)

We next further analyse the set of (non-officially owned) homoglyph domains with SLD names similar to any of the top ten most valuable global brands, with a view to identifying the proportion of each string consisting of replaced characters (e.g. for 'ᴀpple[.]com' (xn--pple-k13a[.]com), where the only non-Latin character is the 'ᴀ', there is one replaced character out of five (i.e. 20% of the whole string)). For this analysis, we exclude seven examples where the whole SLD is in a consistent non-Latin script (e.g. 'γοογλε' (Greek) for 'google' and 'амазон' (Cyrillic) for 'amazon', as these domains may be intended for targeting towards a non-English market, rather than being explicitly deceptive), leaving a dataset of 260 domains.

Table 4 shows the total number of domains in the dataset for each of the ten brand strings, and the average number of characters (and proportion of the total) in the brand string which are replaced, calculated across all relevant domains in each case.

Brand string
                                
No. domains
                                
Mean no. of
replaced characters
                                          
Mean % of
replaced characters
                                          
  apple251.9238%
  google1351.6728%
  microsoft411.3915%
  amazon451.4424%
  mcdonalds43.0033%
  visa31.0025%
  tencent12.0029%
  louisvuitton12.0017%
  mastercard0--
  coca(-)cola*51.2015%

* Hyphen optional

Table 4: Number of (non-official) homoglyph domain names for each of the top ten most valuable global brands, and the average number and proportion of replaced characters across the set in each case

Across the full dataset, the average number of replaced characters in a homoglyph domain is 1.62 (26% of the whole string), highlighting the necessity for the use of detection technologies able to analyse strings in full in order to detect visual similarity, rather than just identifying instances which differ from the official string by (say) a single character. The dataset includes ten domains in which more than one half of the characters in the string are replaced by non-Latin homoglyphs. These are listed in Table 5.

Domain name
                                
Punycode representation
                                          
No. of replaced
characters
                                
% of replaced
characters
                                
  ᴀᴘᴘʟᴇ[.]com  xn--spa916kwa0ea[.]com5100%
  арріе[.]com†  xn--80ak6aa4i[.]com5100%
  арріе[.]net†  xn--80ak6aa4i[.]net5100%
  арріө[.]com†  xn--80a6aa2gv8a[.]com5100%
  аррӏе[.]com  xn--80ak6aa92e[.]com5100%
  ᴍɪᴄʀᴏꜱᴏꜰᴛ[.]com  xn--9na8b158j8ana8f5252lha[.]com9100%
  ᴍᴄᴅᴏɴᴀʟᴅꜱ[.]com  xn--koa0gs43goafd4cs67392a[.]com9100%
  ᴀᴍᴀᴢᴏɴ[.]com  xn--koa507ka5cl7i[.]com6100%
  ɡᴏᴏɡle[.]com  xn--le-igba3625aa[.]com467%
  gᴑᴑgḷɵ[.]com  xn--gg-9hb063ya97o[.]com467%

Table 5: Domains in which more than half of the characters are replaced by non-Latin homoglyphs

None of these domains resolves to any significant content as of the time of analysis (one resolves to a page featuring pay-per-click links; one (titled 'IDN Homograph Example') links to an article on IDN-based phishing[8]; and one re-directs to the official Yahoo website). However, three of the other domains (marked with a dagger (†)) have active MX records, which is an obvious source of potential concern.

Outside the set of top ten brands, a number of additional strings were also found to have been particularly highly represented in the set of homoglyph domains (though potentially comprising a mixture of infringements and officially-owned domains). Some examples are shown in Appendix C. By far the most common strings for which variants were observed in the dataset are 'aresmgmt' (presumably in reference to investment management company Ares Management, aresmgmt.com) and 'united' (referring to United Airlines, united.com). In the latter case, the domains appear generally to be owned by United Airlines (though are not resolving to any significant content); in the former case, however, the domains appear generally not to be under official ownership (registered via GoDaddy / Domains By Proxy LLC) and may consequently be of concern. Overall, these types of homoglyph infringements appear to be much more common on .com than across the other gTLDs - presumably a reflection of the frequency of use of .com for official sites, and the corresponding potential for confusion.

Conclusion

The greatest source of concern from this analysis is the large numbers of homoglyph domains which appear visually extremely similar to the official websites of trusted brands and thereby present significant potential for customer confusion and corresponding fraudulent activity. For the top ten most valuable global brands, the significant numbers definitively found to be actively resolving to infringing content, together with the large number of others featuring indicators of risk (active MX records, privacy-protected whois records and/or use of retail-grade registrars) is indicative that these types of domain are indeed frequently used for brand attacks.

These observations highlight the importance of brand owners employing proactive programmes of brand monitoring and enforcement, using technologies which are able to detect these types of brand variants, rather than just exact- or substring brand matches. The importance of monitoring dormant domains for subsequent changes to site content is also clear.

Part of the solution may also be a defensive registration policy, though - as we have seen from the examples in this analysis - the infinite scope for homoglyph-type variations means that this approach in isolation will only take a brand owner so far (and may be costly); in cases where brands have been found to be holding portfolios of homoglyph domains for defensive purposes, there are typically at least as many equally convincing other variant domains available for registration, or currently held by third parties. It may also be generally advisable - where domains are held for defensive reasons - for brand owners to ensure that they are configured to re-direct to the official brand website, to maximise traffic and minimise the risk of customer confusion.

Appendix A: List of all IDNs with a SLD length of 56 characters or greater

Domain name
  
Language
code
                
SLD length
(characters)
                
  ဪဪဪဪဪဪဪဪဪဪဪဪဪ
  ဪဪဪဪဪဪဪဪဪဪဪဪဪ
  ဪဪဪဪဪဪဪဪဪဪဪဪဪ
  ဪဪဪဪဪဪဪဪဪဪဪဪဪ
  ဪဪဪဪဪ[.]name
my57
  abogadoynotariosalvadoreñoenelvalledesanfernandoenviosde[.]comes56
  adrianameneghini-psicológapsicoterapeutanaabordagempsica[.]compt56
  alloservicetaxiconventionnévsltransenprovencelucarcsmuyc[.]comen56
  amcouvertureétenchéité-nettoyage-hydrofuge-anti-mousse-t[.]comfr56
  asociacioncolombianadeconductoresdevehículosparticulares[.].comes56
  authenticaspécialistedesbrunchsmariageanniversairecockta[.]comen56
  carineinstitutdrainagelymphatiquerenatafrançahydrofacial[.]comen56
  christelleprevotarata-agenceimmobilière-parentis-en-born[.]comfr56
  cottunettoyageetpréparationesthétiqueautoathouars79100et[.]comfr56
  coupeénergétiquevibratoirecoiffeurenconsciencelavillaauc[.]comfr56
  dessoyrénovdemoussagetoiturepeinturehydrofugecornichetra[.]comen56
  énergétique-traditionnelle-chinoise-acupuncture-bordeaux[.]comfr56
  entreprisedepeintureksn91-peintredintérieuretdextérieurr[.]comfr56
  fındıklıpvcpencerevekapıtamirotamatikkepenktamirsineklik[.]comtr56
  gîte-4personnes-jacuzzi-sauna-privatif-stmalo-mtstmichel[.]comfr56
  nuestra-señora-del-rosario-sanfernando-y-santiago-merced[.]comes56
  recyclagedemetauxàdomicilecommercialetchantierconstructi[.]comen56
  secure-hizmet-24932495ı249u2492-sahibinden-param1guvende[.]comtr56
  slovianhair-uslugifryzjerskieprzedluzanieizageszczaniewł[.]compl56
  asociacioncolombianadeconductoresdevehículosparticulares[.]orges56
  die-neue-sammlung-museum-für-angewandte-kunst-verwaltung[.]bayernde56
  frenteintercontinentalventanadelaluchapopularsalvadoreña[.]orges56
  guinchopantaneirocpa-serviçodereboque-autosocorro-maisba[.]devpt56
  ministerialbeauftragter-für-die-gymnasien-in-oberfranken[.]bayernde56
  staatliches-bauamt-augsburg-strassenmeisterei-nördlingen[.]bayernde56
  staatliches-bauamt-würzburg-strassenmeisterei-ochsenfurt[.]bayernde56
  wasserwirtschaftsamt-ansbach-seemeisterstelle-altmühlsee[.]bayernde56
  wasserwirtschaftsamt-kempten-flussmeisterstelle-türkheim[.]bayernde56
  wasserwirtschaftsamt-münchen-flussmeisterstelle-freising[.]bayernde56
  wasserwirtschaftsamt-nürnberg-flussmeisterstelle-rothsee[.]bayernde56
  頂頂頂頂頂頂頂頂頂頂頂頂頂頂頂頂頂頂頂頂頂頂頂頂頂頂頂頂
  頂頂頂頂頂頂頂頂頂頂頂頂頂頂頂頂頂頂頂頂 頂頂頂頂頂頂頂頂[.]top
zh-Hant56
  顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶
  顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶顶 顶顶顶顶顶顶顶顶[.]top
zh56

Appendix B: List of homoglyph domains for Google

N.B. Examples shown in square brackets are likely to be under the control of the official brand owner.

  • googǀe[.]com
  • googǀɵ[.]com
  • googıe[.]com
  • googíe[.]com
  • googīe[.]com
  • [googɩe[.]com]
  • goọgie[.]com
  • gọọgie[.]com
  • ǥooǥɩe[.]com
  • ɡooɡɩe[.]com
  • googıɵ[.]com
  • googɨɵ[.]com
  • gꝏgle[.]com
  • googʟe[.]com
  • [gooɢle[.]com]
  • goᴏgle[.]com
  • gᴏogle[.]com
  • gᴏᴏgle[.]com
  • [ɢoogle[.]com]
  • ɢooɢle[.]com
  • googlé[.]com
  • googlė[.]com
  • googlê[.]com
  • [googlë[.]com]
  • googlě[.]com
  • googlĕ[.]com
  • googlē[.]com
  • googlę[.]com
  • [googlẹ[.]com]
  • googlǝ[.]com
  • googlə[.]com
  • googlɘ[.]com
  • [googĺe[.]com]
  • googľe[.]com
  • googļe[.]com
  • googƚe[.]com
  • googłe[.]com
  • googłė[.]com
  • googłę[.]com
  • googḷe[.]com
  • googḷė[.]com
  • googḷẹ[.]com
  • googɭe[.]com
  • googḻe[.]com
  • gooġle[.]com
  • [gooģle[.]com]
  • gooǥle[.]com
  • gooǥlė[.]com
  • gooǥƚe[.]com
  • gooǥłe[.]com
  • gooɠle[.]com
  • [gooɡle[.]com]
  • goᴑgle[.]com
  • gᴏᴑgle[.]com
  • goógle[.]com
  • goóglé[.]com
  • [goògle[.]com]
  • goòglè[.]com
  • [goȯgle[.]com]
  • [goôgle[.]com]
  • goôglê[.]com
  • [goögle[.]com]
  • goōgle[.]com
  • goõgle[.]com
  • goøgle[.]com
  • [goơgle[.]com]
  • goọgle[.]com
  • goọglẹ[.]com
  • gᴑogle[.]com
  • gᴑᴑgle[.]com
  • gᴑȯgle[.]com
  • [góogle[.]com]
  • góógle[.]com
  • góóglè[.]com
  • góóglę[.]com
  • góògle[.]com
  • [gòogle[.]com]
  • gòógle[.]com
  • gòóglè[.]com
  • gòògle[.]com
  • [gȯogle[.]com]
  • gȯᴑgle[.]com
  • gȯȯgle[.]com
  • gȯȯglė[.]com
  • gôogle[.]com
  • gôógle[.]com
  • gôôgle[.]com
  • [gôôglè[.]com]
  • gôôglê[.]com
  • [gôōgle[.]com]
  • gôõgle[.]com
  • göogle[.]com
  • göógle[.]com
  • göôgle[.]com
  • göögle[.]com
  • [gööglë[.]com]
  • göõgle[.]com
  • gōogle[.]com
  • gõogle[.]com
  • gõôgle[.]com
  • gõõgle[.]com
  • gøogle[.]com
  • gøøglé[.]com
  • [gơogle[.]com]
  • gơoǥle[.]com
  • gơơgle[.]com
  • [gọogle[.]com]
  • [gọọgle[.]com]
  • ġoogle[.]com
  • ġooglė[.]com
  • ġøøgle[.]com
  • ĝoogle[.]com
  • [ğoogle[.]com]
  • ḡoogle[.]com
  • [ḡooḡle[.]com]
  • [ģoogle[.]com]
  • [ǥoogle[.]com]
  • ǥooglė[.]com
  • ǥoogłe[.]com
  • ǥooġle[.]com
  • ǥooǥle[.]com
  • ɠoogle[.]com
  • [ɡoogle[.]com]
  • ɡooɡle[.]com
  • ɡᴏᴏɡle[.]com
  • googlɞ[.]com
  • gꝏglɵ[.]com
  • googĺɵ[.]com
  • googľɵ[.]com
  • googɫɵ[.]com
  • googłɵ[.]com
  • gooġlɵ[.]com
  • goᴑglɵ[.]com
  • goȯglɵ[.]com
  • gᴑᴑglɵ[.]com
  • gᴑᴑgḷɵ[.]com
  • gȯoglɵ[.]com
  • ǥooglɵ[.]com
  • [googlœ[.]com]
  • [googlе[.]com]
  • gooqłe[.]com
  • gȯoqle[.]com
  • gooqlɵ[.]com
  • [goοgle[.]com]
  • [goοglе[.]com]
  • [goоgle[.]com]
  • [goоglе[.]com]
  • [gοogle[.]com]
  • [gοoglе[.]com]
  • [gοοglе[.]com]
  • [gοоgle[.]com]
  • [gοоglе[.]com]
  • [gоogle[.]com]
  • [gоoglе[.]com]
  • [gоοgle[.]com]
  • [gоοglе[.]com]
  • [gооgle[.]com]
  • [gооglе[.]com]
  • γοογλε[.]com
  • göögle[.]biz
  • [goögle[.]info]
  • [göogle[.]info]
  • [göögle[.]info]
  • ɢoogle[.]net
  • googlé[.]net
  • googlè[.]net
  • googlê[.]net
  • googlë[.]net
  • googlə[.]net
  • googłe[.]net
  • [góogle[.]net]
  • góógle[.]net
  • gòògle[.]net
  • gôôgle[.]net
  • göogle[.]net
  • göögle[.]net
  • gõõgle[.]net
  • gøøgle[.]net
  • [ɡoogle[.]net]
  • ɡooɡle[.]net
  • googłe[.]online
  • gøøgle[.]online
  • googłe[.]org
  • göogle[.]org
  • [göögle[.]org]
  • góógle[.]xyz 

Appendix C: Numbers of homoglyph domain names for a series of other heavily featured brand / keyword strings

Brand / keyword string
                                          
.com
                                
Other gTLDs
                                
Total
                                
  admiral70070
  alibaba-inc9797194
  alipay424385
  alipay-inc6060120
  allgau13031
  allstate1170117
  allstatecorporation2210221
  allstateinsurance2560256
  allstateinvestments2490249
  anthropic36036
  aresmgmt3,18603,186
  arrow55055
  avril64064
  bankia48048
  bitcoin411253
  boursobank1300130
  brainlab1310131
  calvinklein1720172
  canva69069
  cignahealthcare2020202
  coinbase16610176
  csileasing18392275
  divvypay79079
  facebook74983
  getdivvy69069
  gmail49150
  greentechrenewables4290429
  gulfstream82082
  hackerone1270127
  iledefrance7366139
  instagram66369
  investwithconfidence64064
  janestreet1980198
  ledger38846
  mailchimp1331134
  mdrbrand73073
  mdrcyber73073
  mdrdiscover96096
  optelgroup3900390
  paypal1408148
  prologis72072
  retirewithconfidence44044
  rogers67067
  rolex47047
  sailpoint40040
  snowflake62062
  sustainabilitywithsubstance66066
  tailwind5656112
  taitcommunications46046
  thecignagroup1270127
  thedebtbox44044
  trustwallet30131
  twosigma59059
  united2,34202,342
  verical34034
  wakanime83083
  wellington72072
  williams-int87087
  youtube46147
  zoom35136

References

[1] Second-level domain - the part of the name to the left of the dot

[2] https://czds.icann.org; all data based on the versions of the zone files downloaded on 28-Sep-2023 (1,082 TLDs)

[3] Language recognition is as per the 'DETECTLANGUAGE' function available via Google Sheets: https://support.google.com/docs/answer/3093278?hl=en

[4] https://developers.google.com/admin-sdk/directory/v1/languages

[5] https://www.w3schools.com/tags/ref_language_codes.asp

[6] https://www.kantar.com/inspiration/brands/revealed-the-worlds-most-valuable-brands-of-2023

[7] https://www.iamstobbs.com/opinion/web-dot-coms-but-once-a-year-holiday-shopping-activity-part-1-black-friday-domains

[8] https://www.xudongz.com/blog/2017/idn-phishing/

This article was first published as an e-book on 29 December 2023 at:

https://www.iamstobbs.com/idns-ebook

Thursday, 21 December 2023

The shape of things to .com: An overview of domain availability

BLOG POST

As Internet usage continues to grow, certain aspects of the underlying infrastructure - notably IP addresses and domain names - are beginning to run short of capacity in key areas. In this study, we consider the availability of registerable alphabetic[1] domains, considering the .com extension (or top-level domain, TLD) - still the most popular by a significant margin - and short domain names across the set of around 1,000 other gTLDs.

For .com, over 99.6% of all two-, three- and four-letter alphabetic domain names, and even many of the remainder are reserved or otherwise unavailable. Just under one-quarter of five-letter names are already taken, though this includes the vast majority of dictionary terms. A similar comment holds true even for domains of greater length. .net and .org also have more than 99% of all possible three-letter domains names already registered. 

For two-letter domain names, seven different TLDs are at least 98% unavailable, and for one-letter domain names, there are 27 different TLDs for which all 26 possible options are already taken.

However, across the gTLD landscape in general, significant capacity does remain, with the proportion of registered domains across all ~1,000 extensions sitting at only 13%, 7%, and 3% for one-, two-, and three-letter domain names respectively, and even smaller values for longer domain names, providing a range of possibilities for prospective registrants. 

Following on from the analysis, the following points may be borne in mind by brand owners:

  • For registrations in accordance with the traditional preference for short, memorable .com domains, available options are significantly limited. Consequently, brand owners may need to resort to brokerage or (where IP protection permits) acquisition processes to secure their preferred domain names. Monitoring of third-party activity across the landscape of pre-registered domains, and new registrations, also remains key.
  • One associated recommendation for potential new brand owners is to select a longer, unusual and/or novel term for their brand name. This not only raises the possibility of the respective domain being available for registration, but also makes it possible to secure stronger intellectual property protection and makes the prospect of brand monitoring more straightforward.
  • Additionally, it may be wise for brand owners to consider TLDs other than .com for their primary website presence. A number of pre-existing TLDs are beginning to be 'repurposed' for alternative use, including .io (primarily for technology-related brands) .ai (for brands relating to artificial intelligence), .tv (relating to television or streaming services), and .co (as an alternative to .com for company websites). We are also seeing a continued series of new TLD launches as part of the new-gTLD programme, with a new round of applications set to launch in Q2 2026. Some brand owners may find it advantageous to consider applying to run a new dot-brand extension, giving them full control over all domains across the TLD in question. Failing this, utilisation of programmes such as the TMCH and registration-alert and blocking schemes can be an effective way of defending IP and receiving early warning of infringements[2]. As an associated point, brand owners should consider registering relevant domain names defensively across key TLDs, where they are available.
  • Finally, it may also be advantageous to secure IP within the emerging Web3 landscape. In particular, the blockchain domain ecosystem provides options across both generic extensions and dot-brands.

References

[1] i.e. those containing only the characters a-z

[2] https://www.iamstobbs.com/opinion/the-new-new-gtlds

This article was first published on 21 December 2023 at:

https://www.iamstobbs.com/opinion/the-shape-of-things-to-.com-an-overview-of-domain-availability

* * * * *

WHITE PAPER

Executive Summary

Key pieces of Internet infrastructure are beginning to run towards full capacity, in response to the rapidly increasing numbers of connected devices and Internet users. One such example is the set of available IP addresses, which is seeing a transition from the old ('IPv4') system, with 4 × 109 available possibilities, to a newer ('IPv6') infrastructure, with 3 × 1038 options.

Similarly, despite the essentially infinite number of possible variations, key areas of the domain name landscape (particularly short domain names across popular top-level domains (TLDs, or domain extensions)) are running short of available options for potential registrants. In this analysis, we consider the availability of alphabetic (i.e. containing only the characters a-z) .com domains generally (still the most popular extension by a significant margin), and of short domain names across the full set of other gTLDs.

The main findings of the study are as follows:

  • For .com domains, over 99.6% of all two-, three- and four-letter alphabetic domain names, and just under one-quarter of five-letter names are already taken, with several of the remainder also reserved or otherwise unavailable. Only two two-letter domains (out of 262 = 676), and only 44 three-letter domains (out of 263 = 17,576) are not currently present in the .com zone file.
  • After .com, the most 'full' namespaces are .net and .org. All three extensions are more than 99% full for three-letter domain names, with .net actually having even lower three-letter domain name availability than .com.
  • For two-letter domain names, seven different extensions (.com, .net, .org, .biz, .law, .amsterdam and .country) are at least 98% unavailable.
  • There are 27 distinct gTLDs for which all 26 possible one-letter domains are registered.
  • However, across the full gTLD landscape (the 1,078 extensions for which zone-files are available), there remains significant capacity, due to the large number of extensions available. Overall, only 13% of all one-letter domain names are already registered, 7% of two-letter domains, and 3% of three letter domains, with the proportions continuing to drop off as the domain length increases.

Part 1: Availability of domains on the Internet's largest TLD

Introduction

It has long been known that key pieces of Internet infrastructure are beginning to run towards full capacity. IP addresses, for example - the string of four numbers between 1 and 255 representing the location of any server or other device, usually written as XX.XX.XX.XX – have a total universe of 2564 (232 or 4.29 billion) possibilities which, particularly given the growth of the range of connected devices collectively known as the 'Internet of Things', will soon be insufficient for requirements. Accordingly, the intention is ultimately to transition to a new naming system (IPv6, rather than the older IPv4), in which addresses are most usually represented as X:X:X:X:X:X:X:X, where each 'X' is a value between 1 and 216 (written as 'ffff' in hexadecimal), giving 2128 (3.4 × 1038, or 340 trillion trillion trillion) possible addresses in total[1].

For domain names - the alphanumeric strings used as website addresses - there is a similar problem. Whilst the actual number of potential domain names (which can have second-level names (SLDs) - the part of the name to the left of the dot - up to 63 characters in length, and can consist of any alphanumeric characters and hyphens, without even counting non-Latin variants) is extraordinarily large (around 1098 just for 63-character names on a single domain-name extension), certain areas of the domain-name landscape are already essentially 'full'.

Despite the continued growth in the numbers of available domain-name extensions (top-level domains, or TLDs), .com remains by far the most popular choice for organisations and other entities, and shows the greatest number of live registrations by around an order of magnitude[2], currently sitting at over 160 million. Furthermore, for many use cases there is often a preference for short, memorable domain names. Many of the shortest .com domains are already registered (or otherwise restricted), leaving few options for new users, and meaning that in many cases the existing domains are considered 'premium' and can be traded for extremely high prices.

Only three one-character .com domains are currently in existence (including x.com - recently acquired by Elon Musk following Twitter's rebrand[3] - together with q.com and z.com), and the majority of one-letter names were explicitly reserved by IANA (the Internet Assigned Numbers Authority) in the early 1990s[4]. The vast majority of other short .com domains are currently taken, with many used by major corporations for their public-facing websites and e-mail infrastructure[5]. Many instances of multi-million dollar sales of two-letter domain names have been reported[6]. Even considering domain names up to around 5 characters in length, although not all possible names are taken, it has often been reported that the vast majority of dictionary terms are no longer available[7]. Generally, domains are offered on a 'first come, first served' basis, meaning that (depending on IP protection), brands may often need to resort to acquisition processes in order the secure their preferred domain[8].

In this study, we use zone-file analysis to inspect the .com landscape, in order to consider the availability of domains. All figures are correct based on the version of the .com zone file downloaded from ICANN's Centralized Zone Data Service[9] on 11-Sep-2023. For simplicity in this study, we look only at domains containing (Latin) alphabetic characters (a-z) (sometimes written as 'LLL', for three-letter domains, for example), although numeric domains are also popular in the domain-name industry, particularly in regions such as China, where their use can circumvent language barriers and particular numbers can have special cultural significance.

Analysis

Table 1 and Figures 1 - 3 show the total numbers of registered domains for each domain length, based on their inclusion in the ICANN .com zone file. In each case, these values are also expressed as a proportion of the total 'pool' of available domain names (where the total possible number of n-length domain names is 26n).

n (SLD length)
                          
Possible no.
                          
No. registered
                          
No. unregistered
                          
% registered
                          
1 26   3   23     11.54 %
2 676   674   2     99.70 %
3 17,576   17,532   44     99.75 %
4 456,976   455,325   1,651     99.64 %
5 11,881,376   2,744,780   9,136,596     23.10 %
6 308,915,776   5,779,635   303,136,141     1.87 %
7 8,031,810,176   6,828,656   8,024,981,520     0.085 %
8 208,827,064,576   8,482,121   208,818,582,455     0.0041 %
9 5,429,503,678,976   9,609,092   5,429,494,069,884     0.00018 %
10 141,167,095,653,376   10,658,182   141,167,084,995,194     0.0000076 %
11 3,670,344,486,987,780   10,963,997   3,670,344,476,023,780     0.00000030 %
12 95,428,956,661,682,200   10,843,645   95,428,956,650,838,500     0.000000011 %

Table 1: Statistics for .com domains of SLD length n characters

Figure 1: Total numbers of registered .com domains for each SLD length (n characters)

Figure 2: Total numbers of unregistered .com domains for each SLD length (n characters)

Figure 3: Proportion of all possible domain names registered for each SLD length (n characters)

For two-, three- and four-letter alphabetic domain names, over 99.6% of the available names are already registered. Just under one quarter of the five-letter domain names are taken and beyond this, although the absolute numbers of registered domains continues to rise up to an SLD length of 11 characters, the proportion of the namespace which is registered drops off rapidly, due to the exponential growth in the number of names available as the SLD length increases.

For the two-letter domains, only two (dm.com and jh.com) (out of 262 = 676) are not currently registered. With three letters, all but 44 combinations (out of a possible set of 263 = 17,576) are registered.

The following is a list of all three-letter strings which are not currently registered as .com domains:

  • baq
  • bfh
  • btz
  • bzg
  • ciz
  • eth
  • exu
  • fkd
  • gdy
  • hfh
  • ilq
  • jig
  • jrx
  • kgr
  • kkk
  • mag
  • ndq
  • njq
  • nnr
  • oys
  • pbq
  • pqk
  • pwe
  • qag
  • qgt
  • qvz
  • qzk
  • rfc
  • ruu
  • sfj
  • soe
  • sok
  • trc
  • ucl
  • wxa
  • xjz
  • xkd
  • xko
  • ykn
  • ykz
  • zig
  • zip
  • zkb
  • zkn

This means that all three-letter combinations beginning with 'a', 'd', 'l' and 'v' are taken. Amongst the unregistered strings, some have particular relevance. 'kkk.com', for example, most recently expired in October 2022 and was subsequently offered for sale via GoDaddy Auctions. By mid-November, the domain had received a high bid of nearly $100,000, before being withdrawn from sale and blocked following concerns about the domain's possible association with the Klu Klux Klan[10].

Conclusion

The analysis shows the very low availability of unregistered short .com domain names. This finding is of great significance for organisations looking to launch new brands with a website presence, meaning they may need to resort to purchases of pre-existing domains, utilise longer or unusual brand / domain names - or use brand variants or keywords in the domain name rather than just the brand name itself, or look to TLDs other than .com for their primary website presence. The domain-name industry is already seeing growth in the popularity of other domain-name extensions, such as .io, .ai, .tv, .co, etc. as a reflection of this fact. Also relevant is the ongoing new-gTLD programme, which has seen the launch of over 1,000 new extensions since its start in 2012, continues to generate new releases (with around a dozen in 2023, including .kids, .zip, .box and .music)[11], and has a new round of applications scheduled to begin in 2026[12]. These developments have significantly increased the available domain-name space, and we may also see a growth in the popularity of dot-brand extensions.

A related point to consider is that new companies may be wise to select novel or invented terms for their brand names. Not only does this raise the probability that the associated domain names will be available, but it also has the added benefits of being able to secure stronger intellectual property protection, and making brand monitoring more straightforward and less subject to the difficulties associated with the detection of 'false positives'.

Part 2: Availability of short domains across the gTLD landscape

Introduction

In Part 1, we considered the availability of .com domains of various lengths, finding that, for two-, three- and four-character alphabetic domain names, over 99.6% of the total universe of possible domain names are already registered. In Part 2, we extend the same ideas to look at the availability of short domains across the full range of gTLDs - the global top-level domains, or domain extensions (according to the zone files published via ICANN's Centralized Zone Data Service as of 11-Sep-2023, of which 1,078 were available).

Analysis

Figure 4 shows the proportion of the set of all possible domain names which are currently already registered, for each gTLD, for one- to six-character alphabetic domain names (i.e. those consisting only of Latin alphabet characters). The TLDs are sorted by the total number of one- to six-character registered domains.

Figure 4: Proportion of the set of all possible domain names which are already registered, for each of the top 40 gTLDs, as a function of second-level domain name (SLD) length (n characters)

A number of top-level observations are apparent:

  • Overall, .com is by far the most 'full' namespace, with 23.1% of all possible 5-letter domains and 1.9% of 6-letter domains registered (followed next by .net in both cases, with 3.7% and 0.2%, respectively).
  • For 3-letter domain names, .net, .com and .org are all more than 99% full; .net actually has even lower availability than .com (17,537 domains registered out of a possible 17,576, compared with 17,532 for .com).
  • For 2-letter domain names, the .com, .net, .org, and .biz namespaces are all at least 98% taken (in addition to .country (99.26%), .law (98.67%) and .amsterdam (98.22%)).
  • There are 27 gTLDs for which all 26 possible one-letter domains are registered. These extensions are: .biz, .ltd, .icu, .digital, .company, .wtf, .fyi, .cool, .run, .capital, .berlin, .law, .casa, .beer, .fashion, .hamburg, .wales, .srl, .country, .wedding, .cymru, .garden, .luxury, .irish, .esq, .abogado and .prof.

Considering all gTLDs together, it is also possible to calculate the total proportion of all possible domain names of length n which are registered (where the total number of possible names is (26n × T), where T is the total number of gTLDs - in this case, 1,078) (Table 2 and Figure 5).

n (SLD length)
                                
Possible no.
                                
No. registered
                                
% registered
                                
1 28,028   3,584   12.79 %  
2 728,728   51,322   7.04 %  
3 18,946,928   622,712   3.29 %  
4 492,620,128   2,394,695   0.49 %  
5 12,808,123,328   5,667,908   0.044 %  
6 333,011,206,528   9,549,608   0.0029 %  

Table 2: Statistics for all gTLD domains of SLD length n characters

Figure 5: Proportion of all possible domain names registered for each SLD length (n characters)

Conclusion

Although there are some subsets of the domain-name landscape which are nearing capacity (notably .com, .net and .org for two- and three-letter domains), the overall landscape is by no means full. Even for highly desirable three-letter domain names, only around 3% of all possible names are taken, when considering the full set of gTLDs. As discussed in the first part of the study, the likelihood is that brand owners may simply need to reassess their requirements when looking to acquire business domain names, and perhaps set their expectations away from the traditional .com environment.

References

[1] 'Brand Protection in the Online World: A Comprehensive Guide' by David N. Barnett - Box E.2: 'IP addresses'

[2] https://research.domaintools.com/statistics/tld-counts/

[3] https://www.iamstobbs.com/opinion/x-trademarks-the-spot-not-a-textbook-example-of-a-successful-rebranding-exercise

[4] https://www.quora.com/Why-are-there-no-single-letter-domain-names

[5] https://smartbranding.com/ll-type-domains/

[6] https://www.globenewswire.com/en/newsrelease/2019/04/29/1811388/9865/en/Coveted-Two-Letter-Domain-Name-Potentially-Worth-Millions-to-Auction-Exclusively-on-NameJet.html

[7] https://www.quora.com/Have-all-5-character-com-domain-names-been-taken

[8] https://nz.news.yahoo.com/world-running-domain-names-gone-130011203.html

[9] https://czds.icann.org/

[10] https://domaininvesting.com/godaddy-cancels-kkk-com-expiry-auction/

[11] https://www.iamstobbs.com/opinion/music-to-brand-owners-ears

[12] https://www.iamstobbs.com/opinion/the-new-new-gtlds

This article was first published as an e-book on 21 December 2023 at:

https://www.iamstobbs.com/availability-of-domains-ebook

Unregistered Gems Part 6: Phonemizing strings to find brandable domains

Introduction The UnregisteredGems.com series of articles explores a range of techniques to filter and search through the universe of unregis...