David Barnett's Brand Protection Articles: The randomest domain names: entropy as an indicator of TLD threat level

by David Barnett and Richard Ferguson

Introduction

Domain registrations and abuse have had something of a renaissance in recent years, with increases in the numbers of people working from home and shopping online giving rise to countless opportunities for scammers. However, with almost 1,600^[1] different top-level domains (TLDs, or domain extensions) to choose from, it can be difficult for brand owners to identify which TLDs to register across - indeed, the annual cost of owning a domain portfolio can soon spiral. Beyond the simple consideration of which TLDs are the 'best fit' for a brand's area of interest based on name alone (e.g. .shop for an online retailer), a statistical analysis of the most extensively abused TLDs can also provide further insights.

This post analyses a wide set of TLDs to assess whether patterns in the length and randomness of domain names shows any correlation with other independent estimates of the level of threat associated with different domain extensions.

Primer

The universe of registered domains includes large numbers in which the domain name consists just of long, apparently random strings of characters. Several previous studies have suggested that these types of domains are often associated with fraudulent or malicious activity, such as phishing (where the domains can be used in the generation of deceptive URLs) or the distribution of malware. In many cases, these domain names arise using automated domain name generation algorithms and associated automatic registrations, by bad actors^[2,3].

The existence of domains potentially set up for underhand purposes can be analysed through consideration of a parameter known as Shannon entropy, which provides a measure of the amount of information stored in a string of characters - broadly, long domain names, and/or those containing large numbers of distinct characters (such as the random domain names discussed here), will have high entropy^[4].

The entropy of domains differs between TLDs, with some showing a markedly greater frequency of long, random domain names than others. For example, in a previous blog post^[5], we discussed how the set of new .zip domains contains many more high-entropy (long, random) names than other TLDs. All other factors being equal, this might suggest that TLDs such as .zip are more prone to abuse by online bad actors.

Analysis

For the study, we consider the set of domain zone files published by ICANN^[6], which covers gTLDs (.com, .net, etc.) and new-gTLDs (.top, .xyz, .online, etc.). In total, the dataset covers approximately 1,050 TLDs. For each TLD, the mean domain name entropy value, across all domains registered with that extension, is calculated (noting that small TLDs - where fewer than 100 domains are registered - have been excluded from the analysis, as the results are deemed to be of lower significance; this leaves a dataset of 576 TLDs). The results are shown in Table 1 and Figures 1 and 2.

TLD	Mean entropy	N
bayern	3.578820	60,318
crs	3.556059	1,144
man	3.548192	361
nrw	3.543092	36,313
xn--mgbca7dzdo	3.533396	117
gov	3.524858	19,542
goog	3.470524	543
med	3.461878	69,735
page	3.461800	102,978
eus	3.444771	27,950
mov	3.419044	6,724
esq	3.417947	3,565
amsterdam	3.416103	41,989
rsvp	3.415646	4,572
channel	3.408561	631
swiss	3.404208	37,801
dev	3.396982	769,971
app	3.394302	1,274,223
abudhabi	3.390945	2,060
zip	3.389665	30,223
google	3.380865	318
top	3.362711	4,512,204
komatsu	3.359931	133
day	3.353672	20,345
kyoto	3.326108	2,042
nexus	3.323493	2,250
how	3.320968	7,987
radio	3.319183	5,793
soy	3.317902	3,467
phd	3.312976	2,793

Table 1: Top 30 TLDs with greatest mean domain name entropy (N = no. of domains in dataset)

Figure 1: Top 30 TLDs with greatest mean domain name entropy

Figure 2: Bottom 30 TLDs by mean domain name entropy

The highest-entropy TLDs can indeed be seen through visual inspection to contain disproportionately high numbers of long, random domain names, with significant numbers of 32-character examples (Figure 3). The reason for this exact number (compared with the absolute maximum possible number for a SLD^[7] of 63 characters) is not clear; however, it was the greatest length historically considered to be 'good practice'^[8] for a domain name and can (depending on usage and provider) be a value beyond which functionality limitations may apply. The value may also be related to the type of algorithm(s) used to automatically generate the domain names, or the functionality available through the registrars utilised.

The alphabetical list of .bayern domains (the highest-entropy TLD in the dataset), for example, begins:

000.bayern
0008cp8d8h7jgqmddh0kciot4gousac0.bayern
002s0ldfq8l8uo0qr63fbtnjirgc2058.bayern
003v242nno6b91ppgtfr54rc820dvkqu.bayern
0057tcga35h7en9cro4vtbqr2sual0ju.bayern
0070fq4boldtihbvangusggq5r4jc8u7.bayern
0077bcqmb64p5odoa0pfhedmuv8nrdo9.bayern
007dqkp5jvh8qn7b8m5i3tlrgcm3t5cl.bayern
007dv5edpr3rgpam4lnlq6v6147hdbub.bayern
0081mlfvlec3qj5m508633l9sjvbsiph.bayern
00846bmbh82ovq0n1kr78jc97c3dhh7e.bayern
009a705ptm7dfi1uk37kfmkp5dqec1lo.bayern
00a71os7ja4mrjcg32hvs4tcgephthpr.bayern
00amv24rasudpcoj4ddniqujf4qd00ha.bayern
00b8jv3gs972inad2cipm20gqvohmn0v.bayern
00bu3lvu54afr3egplojrpamqu4onhck.bayern
00clcm817v8sra5aqpcru0u8t5lrcjti.bayern
00dfkkjfmhpqll6ladjs3tqlpaqhuijc.bayern
00espnkvp4ohdq7dm35o7v4po4rpm4bp.bayern
00f2n0s19mqn3s34ij3rpnju85arfth8.bayern

Figure 3: Numbers of .bayern domains, by domain name (SLD) length

It is also instructive to compare the mean entropy for each TLD with previous estimates of the general level of risk associated with that TLD, considering factors such as the frequency of their use in phishing, spam, and malware. In one such study^[9], TLDs were allocated a normalised 'threat frequency' score (between 0 and 1), based on threat statistics taken from a range of independent datasets. Figure 4 shows a comparison between the mean entropy of the domains for each TLD, and the threat score from this previous study, for all TLDs present in both datasets.

Figure 4: Comparison between mean domain name entropy (this study) and normalised threat frequency score (previous study) for each TLD

Whilst there is no strong correlation between the two datasets (though there is a weak positive correlation, with a coefficient of +0.07), there is a suggestion that the highest-entropy TLDs (those with a mean entropy value of > 3.2) do tend to sit at the higher end of the risk spectrum (threat score > approx. 0.2). This is at least suggestive of some self-consistency in terms of the assertion that higher-entropy domain names (and the TLDs with which they are more frequently associated) tend to be more likely to be linked to a range of classes of fraudulent and malicious activity.

Conclusions

Previous research suggests that long, random (high entropy) domain names are more likely to associated with automated algorithmic registrations, and to be used for malicious activity. It is also noteworthy that many of the most suspicious domain names are (exactly) 32 characters in length.

Certain domain extensions are associated with greater proportions of high entropy domains, and the top 30 TLDs (by mean entropy) includes a number of popular extensions like .top (4.5m domains), .app (1.3m) and .page (103k). The additional finding that many of these same TLDs are generally found more frequently to be associated with phishing, spam, and malware is suggestive of a correspondence between mean domain entropy and overall level of risk for a particular TLD.

Quantitative studies such as this can help inform and validate brand protection strategies, especially when overlaid with qualitative analysis (such as consideration of what string the domain extension itself actually is, in terms of a keyword or description). This assessment provides guidance not just on which domains to register, but also which domain extensions warrant attention when monitoring, and prioritisation when enforcing. The Internet isn’t getting any smaller, but combining metrics can help with zoning in on targets.

References

[1] https://www.iana.org/domains/root/db

[2] https://circleid.com/posts/20230703-an-overview-of-the-concept-and-use-of-domain-name-entropy

[3] https://www.splunk.com/en_us/blog/security/random-words-on-entropy-and-dns.html

[4] https://www.linkedin.com/pulse/investigating-use-domain-name-entropy-clustering-results-barnett/

[5] https://www.iamstobbs.com/opinion/un-.zip-ping-and-un-.box-ing-the-risks-associated-with-new-tlds

[6] https://czds.icann.org/home

[7] The SLD (second-level domain name) is the part of the domain name before the dot

[8] https://docs.oracle.com/cd/E19683-01/806-4077/6jd6blbdi/index.html

[9] https://circleid.com/posts/20230117-the-highest-threat-tlds-part-2

This article was first published on 11 September 2023 at:

https://www.iamstobbs.com/opinion/the-randomest-domain-names-entropy-as-an-indicator-of-tld-threat-level