Thursday, 2 February 2023

Investigating the use of domain-name entropy for clustering results

by Lan Huang and David Barnett

Introduction

The importance of being able to 'cluster' together similar or connected brand infringements has been noted in numerous studies[1]. Clustering has a number of benefits, including the ability to identify serial infringers for prioritised enforcement action, reveal instances of bad-faith activity, and providing the potential for efficient bulk enforcement actions. 

A key associated idea is the concept of quantifying threat - i.e. determining which domains (or other results) may pose the greatest potential for infringing use in the future, even where no content is currently present - allowing prioritisation of results for initial analysis, enforcement or tracking for content changes. 

Expanding on these ideas, previous work[2] has revealed that large coordinated infringement or attack campaigns (such as the registration of domains for use in spamming activity, malware distribution or botnet creation) are often associated with batches of domains purchased through registrars who offer easy access to bulk registrations using automated algorithms. These registrations can be generated via automated recommendations by the registrar, or through the upload of lists of requested domain names. In many cases, the domain names used for these purposes may contain no meaningful keywords (appearing just as random strings of characters), and may be very long. It is also noteworthy that the use of (pseudo-)random domain names may be beneficial to bad actors, as they are unlikely to contain brand terms and are therefore more difficult to detect using classic brand-monitoring techniques.

In order to explore domain registrations of this type, we utilise the concept of Shannon entropy[3]. This is a mathematical concept in information theory, used to quantify the amount of information (or 'surprise') stored in a string or, equivalently, the number of bits needed to optimally encode the string (i.e. a lower bound). In this study, we apply the idea to domain names by calculating the Shannon entropy associated with the second-level domain (SLD) name string[cf. 4] (i.e. the part of the domain name before the dot, and excluding the TLD (domain extension))[5]. Broadly, this means that domain names which are short and/or have large numbers of repeated characters will have low entropy, and domain names which are longer and/or contain large numbers of distinct characters will have high entropy. Our hypothesis is that a batch of domains registered for a coordinated campaign, with a specific algorithm used to generate the domain names, will tend to be clusterable together on the basis that they will share a common date of registration, common registrar, and will have similar entropy values. Overall, long random domain-name strings associated with automated registrations will tend to have high entropy values. 

Methodology and analysis

In order to look more closely at these ideas, we consider the case study of all domains registered on a particular day (13-Dec-2022), using zone-file information. This dataset consists of approximately 205,000 domains - however, for simplicity we exclude from the analysis those featuring non-Latin characters (i.e. Punycode, or homoglyph, domains - accounting for 0.6% of the total), and focus on the remainder, consisting of domain names containing the characters a-z, 0-9, and the hyphen ('-'). 

Across the dataset, the domains are associated with a range of entropy values, from 0.000 to 4.700, as shown in Figure 1.

Figure 1: Distribution of Shannon entropy values for the set of all (non-Punycode) domains registered on a single day (divided into entropy-value 'bins' of width 0.1)

The top and bottom domain names in the dataset (by entropy values) are shown in Tables 1 and 2.

SLD-name string TLD
(domain
extension)
 
SLD length
(chars.)
 
Shannon
entropy
                  
  abcdefghijklmnopqrstuvwxyz   space 26 4.700
  viqxacb7wo6l3hfujw3agf3stcce6eenl4kovfza3rzri4gwyxg6auid   com 56 4.642
  b4su4qo65fkefg3cpd5muxwekbn4vx6fr7ieroavxqwco2xrqmrrwlad   com 56 4.591
  oz5winfavnvbmgdspa633wdnpmbjjrp6crwutyt4uxgxkvytbjdmdc   com 54 4.569
  hydraclubbioknikokex7njhwuahc2l67lfiz7z36md2jvopda7nchidshop   com 60 4.550
  q374uuwdlgtkveh2acqi6ubhic4m3bnwb32kc2yqmxf2ilv36leujnid   com 56 4.539
  mekck2mf2uju3ssjl2woyddfrunwcnevfql3imp4tfr3z6wmjmo4jvid   com 56 4.497
  facebook-domain-verificationyx7q3wstorn4idf9xqtzdz842q0b6x   com 58 4.475
  vh6bjre5lw9iuegs1b9fspitswrdnbtsm1emunvlulbo6uc0   top 48 4.470
  skjcd-98729871cnf5bnb8ewr2e-vq438vnjy0mtg1mdcumty2n   xyz 51 4.464
  a3n3mq7c3xl7u4mfvhhjyjz2x7lqd7sf5jfm66mhf33fxlyodb5pibyd   com 56 4.455
  osli77ygq5myyquqzc2sva7wgnjc2m7yozz67k3kkgkrync4puw3cqyd   com 56 4.439
  7tl2qxwot624do6kbkvqwsg6knaz6jnlx5kfktni7bzt3qlo4imk4tqd   com 56 4.412
  q6g5o01vsfyw95all7x1krjdki   com 26 4.393
  12mnbvcxzasdfghjklpoi   com 21 4.392
  qwertyuiop12asdfghjkkl   com 22 4.369
  owsyeuxoyy4qtm4bkazrkxjtzhydedxgoxkd2yqddmxgcjevmhnbenyd   com 56 4.327
  metanamepdomainverifycontent38656d7bbe27c23f182336255   com 53 4.306
  fqskypondteieqxoxgizamgqrwlb   info 28 4.280
  zaqwsxcderfvbgtgbnhy12   com 22 4.278
  vij8q5xcentralr2hm910v   sbs 22 4.278

Table 1: Top domain names by entropy values

SLD-name string TLD
(domain
extension)
 
SLD length
(chars.)
 
Shannon
entropy
                  
  n   camp 1 0.000
  d   supplies 1 0.000
  s   camera 1 0.000
  4   flights 1 0.000
  n   reise 1 0.000
  n   clinic 1 0.000
  0   condos 1 0.000
  9   events 1 0.000
  9   photography 1 0.000
  rr   center 2 0.000
  cc   degree 2 0.000
  999   guide 3 0.000
  7777   best 4 0.000
  44444   tel 5 0.000
  88888   tel 5 0.000
  ooooo   events 5 0.000
  vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv   vip 63 0.000
  00-00   co.uk 5 0.258
  000000004   xyz 9 0.503
  000000002   xyz 9 0.503
  000000002   com 9 0.503

Table 2: Bottom domain names by entropy values

In order to look more closely at the characteristics of the highest-entropy sites, we conducted a deeper dive into the top 1,000 domain names within the dataset (those with entropy values of 3.823 and above, incorporating the majority of the examples which appear visually to consist of apparently 'random' strings).

Of these top 1,000 domain names:

  • 847 (84.7%) have active A records (i.e. are associated with a specific IP address, and potentially live website content)
  • 275 (27.5%) have active MX records (i.e. are configured to be able to send and receive e-mails and which may therefore be associated with phishing activity)
  • 777 (77.7%) use domain privacy services or have redacted registration information (see also Table 3) (demonstrating the domain owners' attempts to mask their identity and which could indicate nefarious intentions[6]).

The top registrants and registrars represented within the dataset (for those domains where whois information is available) are shown in Tables 3 and 4.

Registrant
 
No. domains
                           
  Domains By Proxy, LLC 219
  REDACTED FOR PRIVACY 197
  Privacy service provided by Withheld for Privacy ehf 82
  Contact Privacy Inc. Customer 7151571251 75
  c/o whoisproxy.com 36
  Privacy Protect, LLC (PrivacyProtect.org) 29
  Wix.com Ltd. 15
  PrivacyGuardian.org llc 11
  Data Protected 9
  Domain Protection Services, Inc. 7

Table 3: Top registrants

Registrant
 
No. domains
                           
  GoDaddy.com LLC 219
  DYNADOT LLC 128
  Google LLC 78
  NAMECHEAP INC 78
  TUCOWS, INC. 53
  Key-Systems GmbH 43
  Wix.Com Ltd. 30
  PDR Ltd. d/b/a PublicDomainRegistry.com 16
  Tucows Domains Inc. 13
  Atak Domain Hosting 12
  Ionos SE 12

Table 4: Top registrars

The top registrars within the dataset are almost exclusively consumer-grade registrars, as would be expected for domains associated with automated registrations for infringing use, and which mirrors the landscape seen in other studies of infringing and potentially threatening domains[7,8]. Furthermore, just a small number of registrars accounts for the vast majority of the high-entropy domain names; the top eleven shown in Table 4 together account for 682 of the domains within the dataset (i.e. 68.2% of the total). 

Furthermore, the dataset does incorporate the types of domain clusters we might expect to see arising from automated registrations for infringing use. The best example is a set of 125 domains with the following characteristics:

  • All with SLD names consisting of apparently-random 15-character alphanumeric strings, and with identical entropy values (all 3.907[9])
  • All hosted on the .buzz extension (one of the top thirty overall highest-threat TLDs, according to a recent CSC study[10])
  • All registered through Dynadot LLC with redacted whois records
  • (Of the 114 with active A records), associated with just 5 distinct IP addresses, most of which are in similar netblocks

It is highly likely that this cluster comprises a coordinated registration event by a single entity, and may well have been registered with intention of use for threatening or infringing activity. 

As of the date of analysis (15-Dec-2022; two days after registration), the domains resolved to a mix of Chinese-language gambling-portal pages (see Figure 2) and dead pages. Whilst this may be the ultimate intended content (potentially as part of a revenue-generating affiliate scheme), it could also be just 'placeholder' content, uploaded until the sites are weaponised for higher-threat purposes, or may be material designed to be visible only from certain geographical regions to mask the 'real' content and evade detection (so-called 'geotargeted' content[11]). In any case, these sites may certainly warrant further monitoring for changes.

Figure 2: Examples of webpage content visible within the cluster of related .buzz domain names

Conclusion

This analysis highlights how the determination of entropy values for the SLD-name strings of registered domain names can be a valuable component of algorithms to determine which examples are most likely to be intended for infringing or fraudulent use. The calculation can also help to link together clusters of related domain names, to build up a picture of activity by specific bad actors, even in cases where the individual whois records are redacted.

These ideas can be applied in the development of technology allowing brand owners to identify key threat vectors and areas of risk, and determine where mediating action is most urgently required.

References

[1] https://www.linkedin.com/pulse/holistic-brand-fraud-cyber-protection-using-domain-threat-barnett/

[2] https://interisle.net/sub/CriminalDomainAbuse.pdf

[3] https://arxiv.org/ftp/arxiv/papers/1405/1405.2061.pdf

[4] https://www.farsightsecurity.com/blog/txt-record/automatingdetection-20190517/ 

[5]  The Shannon entropy (H) of the SLD-name string is calculated as:

H = - Σi [ pi × log2(pi) ] 

where pi is the proportion of the string made of the ith character (the 'probability'). The summation is carried out over the pool of possible characters. 

[6] https://www.cscdbs.com/en/resources-news/supply-chain-report/

[7] https://www.cscdbs.com/en/resources-news/impact-of-covid-on-internet-security/

[8] https://www.cscdbs.com/en/resources-news/threatening-domains-targeting-top-brands/

[9]  This value arises because, rather than actually being truly random, the SLD names in all cases consist of 15 distinct characters. Therefore: 

H = - 15 × [ (1/15) × log2(1/15) ] = log2(15) = 3.907

[10] https://www.cscdbs.com/blog/the-highest-threat-tlds-part-2/

[11] https://www.cscdbs.com/blog/do-you-see-what-i-see-geotargeting-in-brand-infringements/

This article was first published on 2 February 2023 at:

https://www.linkedin.com/pulse/investigating-use-domain-name-entropy-clustering-results-barnett/

No comments:

Post a Comment

Phishing trends 2024 - and a look at some new data for domain threat quantification

Overview This year's annual phishing report by Internet technology consultants Interisle [1] has provided a number of key insights into...