David Barnett's Brand Protection Articles: Investigating the use of domain-name entropy for clustering results

by Lan Huang and David Barnett

Introduction

The importance of being able to 'cluster' together similar or connected brand infringements has been noted in numerous studies^[1]. Clustering has a number of benefits, including the ability to identify serial infringers for prioritised enforcement action, reveal instances of bad-faith activity, and providing the potential for efficient bulk enforcement actions.

A key associated idea is the concept of quantifying threat - i.e. determining which domains (or other results) may pose the greatest potential for infringing use in the future, even where no content is currently present - allowing prioritisation of results for initial analysis, enforcement or tracking for content changes.

Expanding on these ideas, previous work^[2] has revealed that large coordinated infringement or attack campaigns (such as the registration of domains for use in spamming activity, malware distribution or botnet creation) are often associated with batches of domains purchased through registrars who offer easy access to bulk registrations using automated algorithms. These registrations can be generated via automated recommendations by the registrar, or through the upload of lists of requested domain names. In many cases, the domain names used for these purposes may contain no meaningful keywords (appearing just as random strings of characters), and may be very long. It is also noteworthy that the use of (pseudo-)random domain names may be beneficial to bad actors, as they are unlikely to contain brand terms and are therefore more difficult to detect using classic brand-monitoring techniques.

In order to explore domain registrations of this type, we utilise the concept of Shannon entropy^[3]. This is a mathematical concept in information theory, used to quantify the amount of information (or 'surprise') stored in a string or, equivalently, the number of bits needed to optimally encode the string (i.e. a lower bound). In this study, we apply the idea to domain names by calculating the Shannon entropy associated with the second-level domain (SLD) name string^{[cf. 4]} (i.e. the part of the domain name before the dot, and excluding the TLD (domain extension))^[5]. Broadly, this means that domain names which are short and/or have large numbers of repeated characters will have low entropy, and domain names which are longer and/or contain large numbers of distinct characters will have high entropy. Our hypothesis is that a batch of domains registered for a coordinated campaign, with a specific algorithm used to generate the domain names, will tend to be clusterable together on the basis that they will share a common date of registration, common registrar, and will have similar entropy values. Overall, long random domain-name strings associated with automated registrations will tend to have high entropy values.

Methodology and analysis

In order to look more closely at these ideas, we consider the case study of all domains registered on a particular day (13-Dec-2022), using zone-file information. This dataset consists of approximately 205,000 domains - however, for simplicity we exclude from the analysis those featuring non-Latin characters (i.e. Punycode, or homoglyph, domains - accounting for 0.6% of the total), and focus on the remainder, consisting of domain names containing the characters a-z, 0-9, and the hyphen ('-').

Across the dataset, the domains are associated with a range of entropy values, from 0.000 to 4.700, as shown in Figure 1.

Figure 1: Distribution of Shannon entropy values for the set of all (non-Punycode) domains registered on a single day (divided into entropy-value 'bins' of width 0.1)

The top and bottom domain names in the dataset (by entropy values) are shown in Tables 1 and 2.

SLD-name string	TLD (domain extension)	SLD length (chars.)	Shannon entropy
abcdefghijklmnopqrstuvwxyz	space	26	4.700
viqxacb7wo6l3hfujw3agf3stcce6eenl4kovfza3rzri4gwyxg6auid	com	56	4.642
b4su4qo65fkefg3cpd5muxwekbn4vx6fr7ieroavxqwco2xrqmrrwlad	com	56	4.591
oz5winfavnvbmgdspa633wdnpmbjjrp6crwutyt4uxgxkvytbjdmdc	com	54	4.569
hydraclubbioknikokex7njhwuahc2l67lfiz7z36md2jvopda7nchidshop	com	60	4.550
q374uuwdlgtkveh2acqi6ubhic4m3bnwb32kc2yqmxf2ilv36leujnid	com	56	4.539
mekck2mf2uju3ssjl2woyddfrunwcnevfql3imp4tfr3z6wmjmo4jvid	com	56	4.497
facebook-domain-verificationyx7q3wstorn4idf9xqtzdz842q0b6x	com	58	4.475
vh6bjre5lw9iuegs1b9fspitswrdnbtsm1emunvlulbo6uc0	top	48	4.470
skjcd-98729871cnf5bnb8ewr2e-vq438vnjy0mtg1mdcumty2n	xyz	51	4.464
a3n3mq7c3xl7u4mfvhhjyjz2x7lqd7sf5jfm66mhf33fxlyodb5pibyd	com	56	4.455
osli77ygq5myyquqzc2sva7wgnjc2m7yozz67k3kkgkrync4puw3cqyd	com	56	4.439
7tl2qxwot624do6kbkvqwsg6knaz6jnlx5kfktni7bzt3qlo4imk4tqd	com	56	4.412
q6g5o01vsfyw95all7x1krjdki	com	26	4.393
12mnbvcxzasdfghjklpoi	com	21	4.392
qwertyuiop12asdfghjkkl	com	22	4.369
owsyeuxoyy4qtm4bkazrkxjtzhydedxgoxkd2yqddmxgcjevmhnbenyd	com	56	4.327
metanamepdomainverifycontent38656d7bbe27c23f182336255	com	53	4.306
fqskypondteieqxoxgizamgqrwlb	info	28	4.280
zaqwsxcderfvbgtgbnhy12	com	22	4.278
vij8q5xcentralr2hm910v	sbs	22	4.278

Table 1: Top domain names by entropy values

SLD-name string	TLD (domain extension)	SLD length (chars.)	Shannon entropy
n	camp	1	0.000
d	supplies	1	0.000
s	camera	1	0.000
4	flights	1	0.000
n	reise	1	0.000
n	clinic	1	0.000
0	condos	1	0.000
9	events	1	0.000
9	photography	1	0.000
rr	center	2	0.000
cc	degree	2	0.000
999	guide	3	0.000
7777	best	4	0.000
44444	tel	5	0.000
88888	tel	5	0.000
ooooo	events	5	0.000
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv	vip	63	0.000
00-00	co.uk	5	0.258
000000004	xyz	9	0.503
000000002	xyz	9	0.503
000000002	com	9	0.503

Table 2: Bottom domain names by entropy values

In order to look more closely at the characteristics of the highest-entropy sites, we conducted a deeper dive into the top 1,000 domain names within the dataset (those with entropy values of 3.823 and above, incorporating the majority of the examples which appear visually to consist of apparently 'random' strings).

Of these top 1,000 domain names:

847 (84.7%) have active A records (i.e. are associated with a specific IP address, and potentially live website content)
275 (27.5%) have active MX records (i.e. are configured to be able to send and receive e-mails and which may therefore be associated with phishing activity)
777 (77.7%) use domain privacy services or have redacted registration information (see also Table 3) (demonstrating the domain owners' attempts to mask their identity and which could indicate nefarious intentions^[6]).

The top registrants and registrars represented within the dataset (for those domains where whois information is available) are shown in Tables 3 and 4.

Registrant	No. domains
Domains By Proxy, LLC	219
REDACTED FOR PRIVACY	197
Privacy service provided by Withheld for Privacy ehf	82
Contact Privacy Inc. Customer 7151571251	75
c/o whoisproxy.com	36
Privacy Protect, LLC (PrivacyProtect.org)	29
Wix.com Ltd.	15
PrivacyGuardian.org llc	11
Data Protected	9
Domain Protection Services, Inc.	7

Table 3: Top registrants

Registrant	No. domains
GoDaddy.com LLC	219
DYNADOT LLC	128
Google LLC	78
NAMECHEAP INC	78
TUCOWS, INC.	53
Key-Systems GmbH	43
Wix.Com Ltd.	30
PDR Ltd. d/b/a PublicDomainRegistry.com	16
Tucows Domains Inc.	13
Atak Domain Hosting	12
Ionos SE	12

Table 4: Top registrars

The top registrars within the dataset are almost exclusively consumer-grade registrars, as would be expected for domains associated with automated registrations for infringing use, and which mirrors the landscape seen in other studies of infringing and potentially threatening domains^[7,8]. Furthermore, just a small number of registrars accounts for the vast majority of the high-entropy domain names; the top eleven shown in Table 4 together account for 682 of the domains within the dataset (i.e. 68.2% of the total).

Furthermore, the dataset does incorporate the types of domain clusters we might expect to see arising from automated registrations for infringing use. The best example is a set of 125 domains with the following characteristics:

All with SLD names consisting of apparently-random 15-character alphanumeric strings, and with identical entropy values (all 3.907^[9])
All hosted on the .buzz extension (one of the top thirty overall highest-threat TLDs, according to a recent CSC study^[10])
All registered through Dynadot LLC with redacted whois records
(Of the 114 with active A records), associated with just 5 distinct IP addresses, most of which are in similar netblocks

It is highly likely that this cluster comprises a coordinated registration event by a single entity, and may well have been registered with intention of use for threatening or infringing activity.

As of the date of analysis (15-Dec-2022; two days after registration), the domains resolved to a mix of Chinese-language gambling-portal pages (see Figure 2) and dead pages. Whilst this may be the ultimate intended content (potentially as part of a revenue-generating affiliate scheme), it could also be just 'placeholder' content, uploaded until the sites are weaponised for higher-threat purposes, or may be material designed to be visible only from certain geographical regions to mask the 'real' content and evade detection (so-called 'geotargeted' content^[11]). In any case, these sites may certainly warrant further monitoring for changes.

Figure 2: Examples of webpage content visible within the cluster of related .buzz domain names

Conclusion

This analysis highlights how the determination of entropy values for the SLD-name strings of registered domain names can be a valuable component of algorithms to determine which examples are most likely to be intended for infringing or fraudulent use. The calculation can also help to link together clusters of related domain names, to build up a picture of activity by specific bad actors, even in cases where the individual whois records are redacted.

These ideas can be applied in the development of technology allowing brand owners to identify key threat vectors and areas of risk, and determine where mediating action is most urgently required.

References

[1] https://www.linkedin.com/pulse/holistic-brand-fraud-cyber-protection-using-domain-threat-barnett/

[2] https://interisle.net/sub/CriminalDomainAbuse.pdf

[3] https://arxiv.org/ftp/arxiv/papers/1405/1405.2061.pdf

[4] https://www.farsightsecurity.com/blog/txt-record/automatingdetection-20190517/

[5] The Shannon entropy (H) of the SLD-name string is calculated as:

H = - Σ_i [ p_i × log₂(p_i) ]

where p_i is the proportion of the string made of the i^th character (the 'probability'). The summation is carried out over the pool of possible characters.

[6] https://www.cscdbs.com/en/resources-news/supply-chain-report/

[7] https://www.cscdbs.com/en/resources-news/impact-of-covid-on-internet-security/

[8] https://www.cscdbs.com/en/resources-news/threatening-domains-targeting-top-brands/

[9] This value arises because, rather than actually being truly random, the SLD names in all cases consist of 15 distinct characters. Therefore:

H = - 15 × [ (1/15) × log₂(1/15) ] = log₂(15) = 3.907

[10] https://www.cscdbs.com/blog/the-highest-threat-tlds-part-2/

[11] https://www.cscdbs.com/blog/do-you-see-what-i-see-geotargeting-in-brand-infringements/

This article was first published on 2 February 2023 at:

https://www.linkedin.com/pulse/investigating-use-domain-name-entropy-clustering-results-barnett/