Wednesday, 29 May 2024

A new TLD to .ad to the collection

The domain name registry of Andorra, which operates the .ad TLD (top-level domain, or domain extension) has recently announced that registration restrictions are to be lifted later this year. Previously, registration of .ad domains was only possible to Andorran brands or organisations, looking to register their name as a domain name, directly through the registry. From the 22nd October 2024, however - following a transition phase - the domains will be offered in General Availability, with any entity able to register an available domain, on a first-come, first-served basis, through any accredited registrar[1,2].

The .ad extension thereby has significant potential to be used in a generic capacity, with '.ad' used to refer to advertisements or advertising. Similar trends have been observed with various other ccTLDs (country-codeTLDs), such as .io (commonly used for technology brands), .ai (for brands relating to artificial intelligence), .tv (relating to television or streaming services) and .co (as an alternative to .com for company websites)[3]. The potential commercial popularity of .ad will likely also make the extension popular with infringers and cybersquatters.

Currently - and perhaps unsurprisingly given the previous restrictions - the number of registered .ad domains is low. As a ccTLD (for which comprehensive zone-file data is often unavailable), the exact numbers are difficult to quantify. Nevertheless, DomainTools[4] provides an estimate of 2,336 domains (as of the 23rd May 2024), which makes it only the 673rd-largest TLD (out of the 1,554 for which data is available). More granular insights are available from other data sources, such as the zonefiles.io[5] database (which lists the names of 871 of the .ad domains, as of March 2024), and the use of Google "site:" queries (which presents listings for 119 live .ad websites indexed by the search engine as of the 23rd May 2024). Of the 871 specific domain names identified, 630 (72%) return a live website response. The great majority of the domains point at IP addresses which are physically hosted in Europe, with the top five hosting countries found to be Andorra (AD) (28%), France (FR) (19%), Spain (ES) (19%), the United States (US) (18%) and Germany (DE) (5%) (Figure 1).

Figure 1: Hosting countries of the 871 identified .ad domains

Again unsurprisingly, given the restrictions in place, the websites associated with the registered domains are dominated by companies and organisations with presences in Andorra, and with domain names which are directly descriptive of the entity in question. For example, the top three websites returned by Google in response to a search for "site: .ad" are residencialaltavista.ad (Altavista), zoo.ad (Zoo Studio), and aca.ad (Automòbil Club d'Andorra). Amongst the wider dataset of registered domain names, examples pertaining to a number of popular brands and well-known organisations are already registered, including amazon.ad, aol.ad, google.ad, mcdonalds.ad, orange.ad, unesco.ad, and unicef.ad. In all of these cases, the domains resolve or re-direct to an official website for the company in question.

As the restrictions are dropped later in the year, it will be interesting to see how quickly the number of registrations grows, and how the patterns of use, abuse and infringement emerge. It is also possible, as alluded to earlier, that the extension may become extensively associated with advertising fraud (such as abuse of affiliate schemes). A primary recommendation is for brand owners to secure key or defensive registrations as early as possible, and to monitor the space more generally for developments. This is particularly crucial in view of the fact that, as yet, the registry has made no announcements about the nature of any domain dispute procedure which may be put in place.

References

[1] https://iptwins.com/2024/05/23/andorra-registry-announces-major-simplification-in-registration-procedures/

[2] https://www.domini.ad/vullunad/

[3] 'Patterns in Brand Monitoring' by D.N. Barnett, Chapter 9: 'Domain landscape analysis' [awaiting publication]

[4] https://research.domaintools.com/statistics/tld-counts/

[5] https://zonefiles.io/

This article was first published on 29 May 2024 at:

https://www.iamstobbs.com/opinion/a-new-tld-to-.ad-to-the-collection

Exploring the domain of subdomain discovery

BLOG POST

Domain-name monitoring is a well-established component of a holistic brand protection programme. This reflects the central role played by domain names as an element of the IP of any business with an online presence, and the potential for third-party domain names to be utilised as part of abuse or infringement campaigns. A related issue is the use of brand references in the subdomain part of a URL - where the subdomain is the part before the domain name, such as 'play' in play.google.com - which can similarly be used in cases of false affiliation, brand impersonation and traffic misdirection. However, subdomains present an entirely different prospect from a discovery point of view, since there is no mechanism (akin to the process of zone-file analysis which allows for detection of domain names themselves) to comprehensively detect brand-related subdomains.  

In this article, we explore the effectiveness of a range of approaches to identify relevant subdomains. The techniques include the use of search engines to identify indexed content, queries to a range of public databases, and 'brute-force' searches. As a case study, we consider the discovery of subdomains of each of the top 50 most popular global websites. 

Using the full set of discovery methods, over 640,000 unique subdomains of these 50 domains were identified, including instances up to 231 characters in length, with 28 subdomain levels. A detailed analysis was then carried out in order to identify the most frequently occurring keyword patterns, and tie them to popular use-cases. The next stage of analysis considered the presence of potentially abusive subdomains, using the Apple brand as an example, with numerous cases of live potential infringements identified.  

Overall, the study highlights the popularity of subdomain usage, and the significant potential for associated abuse, illustrating the importance of discovery tools able to detect relevant subdomains. This is an area where traditional monitoring methods tend to provide relatively poor coverage. The analysis has shown that the application of a range of discovery techniques working together can achieve a relatively good level of detection. Accordingly, best practice for the most sophisticated brand monitoring solutions going forward would be the inclusion of these relevant data sources. 

This article was first published on 13 June 2024 at:

https://www.iamstobbs.com/opinion/circle-id-blog-exploring-the-domain-of-subdomain-discovery

* * * * *

FULL ARTICLE

Domain name monitoring - that is, the detection of domains with names containing a brand-term (or other string) of interest - is a very well-established element of brand protection services. Branded domain names are of key importance to brand owners (as the basis for business-critical infrastructure (i.e. 'core' domain names), and as part of a 'tactical' portfolio of strategic and defensive registrations), but also to infringers, who can utilise domains as a means of impersonation, passing off, claimed affiliation, or traffic direction and monetisation. These types of third-party registrations are often of great concern by virtue of factors such as their explicit abuse of IP, and their high potential visibility in search engines. However, they are (up to a point) relatively straightforward to detect, through methods such as domain zone-file analysis, which most brand protection service providers will utilise as a standard methodology.

A more complex world is the ecosystem of subdomain names. A subdomain is the part of the URL prior to the dot preceding the domain name (e.g. 'play' in play.google.com). The owner of a domain name can create whatever hierarchy of subdomain names they wish, and can (for example) configure each distinct hostname (i.e. a subdomain plus domain-name combination) to resolve to a different IP address and webpage content. Additionally, some Internet service providers ('private subdomain registries') offer the sale of subdomains of one of more domains under their ownership, as a business model. Subdomains can be used legitimately for a range of different purposes, including the creation of subject- or region-specific microsites, but can be abused by infringers in many of the same ways as domain names[1,2,3,4]. This issue is made more concerning by the fact that there is, in general, no comprehensive way of detecting potentially relevant subdomains (akin to the zone-file methods used for domain names themselves), which is one of the great unsolved issues in brand monitoring[5].

Definitions

Terminologically, 'subdomain monitoring' as a service description is often used in two distinct ways in the context of brand protection and cybersecurity. The first - most usually carried out by the registrar responsible for the management of a brand owner's official domain portfolio, and therefore with a full overview of their domain and subdomain infrastructure - refers to the monitoring of subdomains of the brand owner's official domains, with a view to identifying potential cybersecurity issues. These can take the form of 'dangling' DNS records - i.e. subdomains which are no longer used and which are susceptible to hijacking - or the third-party creation of new subdomains through DNS compromise (i.e. domain 'shadowing'). The second definition - i.e. the identification of relevant subdomains on an arbitrary third-party domain name, a process which may be termed subdomain 'discovery' - is a much more complex prospect. Generally it involves the application of a combination of methods (which even together are not comprehensive), such as analysis of domain-name zone configuration information (e.g. passive DNS analysis), certificate transparency (CT) analysis, or the use of explicit queries for specific subdomain names. This issue of subdomain discovery is the focus of the remainder of this article.

A case study of subdomain discovery - top 50 popular websites

a. Introduction

As a case study, we explore an approach allowing the identification of (as many as possible) subdomains of each of the top 50 most popular website domain names (as of March 2024), according to Similarweb[6], using a combination of monitoring and discovery scripts[7,8,9], open-source databases, and search queries.

i. Methodology

In general, a comprehensive overview of the subdomains of a particular domain name is only possible via inspection of the full DNS zone record, which is generally only accessible by the managing registrar (as for a (true) subdomain monitoring service). However, partial coverage - from a discovery point of view - can be achieved through a combination[10] of:

  • Queries to search engines
  • Queries to public databases of DNS or SSL information, data from Internet scans, or certificate transparency logs (i.e. information pertaining to the issue of digital certificates)
  • Brute-force searches (i.e. generating possible 'candidate' subdomains from large lists of keywords, and testing to determine which ones resolve)

ii. Terminology

In the description of the identified subdomains, the following terminology is used (in reference to test.mail.site.com as an example):

  • site.com is the domain name ('.com' is the top-level domain (TLD); 'site' is the second-level domain (SLD))
  • test.mail.site.com is the full hostname
  • The full string preceding the domain name (i.e. 'test.mail' in this case) is the subdomain name string - the number of distinct subdomain 'elements' is referred to as the number of 'levels' (i.e. two - with the elements being 'test' and 'mail' - in this case); the total length of this string (in characters) is the sum of the lengths of the individual elements, plus the separating dots (’.’)
  • The element preceding the domain name (i.e. 'mail' in this case) is the third-level domain
  • The first element in the subdomain name string (i.e. 'test' in this case) is the lowest-level name

b. Findings

Using the range of approaches discussed above, over 640,000 unique subdomains were identified, across just the 50 domain names under consideration (Figure 1).

Figure 1: Numbers of identified subdomains for each of the top twenty domains (by number identified)

The subdomain names range in length and number of levels, up to 231 characters and 28 levels (respectively), with the longest subdomain in the dataset (by both measures) found to be news.xinhuanet.comwww.zalando.dewww.google.comhyperboleandahalf.blogspot.comchannel.pixnet.netwww.youtube.comhistory.gmw.cnvk.comwww.bing.comsd.360.cnmarketplace.asos.comstock.sohu.com2kindsofpeople.tumblr.comimgur.comgithub.comwww.xvideos.com. The distribution of lengths (up to 50 characters) and numbers of levels (up to 10) across the whole dataset is shown in Figure 2.

Figure 2: Distribution of subdomain lengths and numbers of levels, across the dataset, by number of instances

For one-level subdomains, there is a peak in number of instances at a length of 7 characters. For two-level subdomains, there is a peak at 15 characters (i.e. a mean of 7.0 characters per element), and for three-level subdomains the peak occurs at length 21 (mean = 6.3 characters per element).

From the overall dataset, it is possible to calculate the statistics for the most frequently occurring subdomain elements, regardless of the level in the subdomain hierarchy at which they appear. This information is shown in Table 1.

Subdomain element
                                            
No. instances
                                
  mail 25,111
  ne1 18,229
  gq1 17,753
  bf1 16,241
  ghs 15,801
  aa-rt 15,037
  qzone 13,855
  corp 12,552
  afd 12,436
  clump 11,669

Table 1: Top ten subdomain elements (at any level) by total number of instances

Other key terms appearing in the top 100 include 'www' (5,707 instances), 'teams' (3,897), 'dns' (1,879), 'shop' (1,547), 'cloud' (1,477), 'dev' (1,331), 'extranet' (1,261), 'test' (1,145), 'sandbox' (1,084), 'search' (960) and 'media' (873).

It is also possible to calculate more granular statistics for elements appearing at key locations in the subdomain strings. Tables 2 and 3 show the top third-level domain strings (i.e. the element immediately preceding the domain name) and lowest-level domain strings (i.e. the element at the start of the subdomain name string) identified across the dataset, by total numbers of instances (noting that, for subdomains with one level, the third-level string will - by definition - also be the lowest level).

Third-level domain
                                            
No. instances
                                
  ne1 18,225
  gq1 17,745
  bf1 16,188
  ghs 15,785
  aa-rt 15,037
  qzone 13,853
  ynwp 7,307
  corp 6,910
  sg3 6,870
  spaces 6,185

Table 2: Top ten third-level domain strings by total number of instances

Lowest-level domain
                                            
No. instances
                                
  lo0 11,536
  www 4,086
  ha1 993
  ha2 903
  m 776
  api 753
  o-o 698
  crawl 661
  vl-120 522
  a 418

Table 3: Top ten lowest-level domain strings by total number of instances

Certain classes of subdomain names also tend to have special use-cases - two-character names, for example, are often used to denote country codes (e.g. for regional subsites) or may have other special meanings (e.g. 'go', 'my' or 'ai'). The top 20 two-character subdomain elements across the whole dataset are shown in Table 4. 

Subdomain element
                                      
No. instances
                                      
98 3,512
a1 2,352
ke 1,916
10 1,817
qa 1,750
bb 1,517
dv 1,206
sc 1,112
ny 1,020
tc 841
a0 690
a2 685
in 652
hk 568
mp 561
cp 524
fp 523
qq 522
my 502
db 491

Table 4: Top 20 two-character subdomain elements (at any level) by total number of instances

Various other common abbreviations also appear highly in the dataset, including the (potential) country codes de (430 instances), ru (343), fr (324), us (252), cn (246), es (246), it (243), kr (243), uk (223), au (200), jp (191), and other terms such as go (393).

It is worth noting, however, that the above statistics may be dominated by the naming style used across just a small number of sites. For example, all of the 'ne1' third-level domains were identified on the yahoo.com site. Potentially a more meaningful insight into the style of names used across the subdomain landscape generally can be gained by determining the numbers of unique sites (within the dataset of 50) across which a specific name string was identified. These statistics - for the features shown in Tables 2 and 3 - are shown in Tables 5 and 6.

Third-level domain
                                            
No. sites (/ 50)
                                
  www 49
  m 43
  api 40
  support 36
  blog 35
  mail 34
  help 33
  dev 31
  news 31
  email 30

Table 5: Top ten third-level domain strings by number of unique sites (in the set of 50) on which the name was identified

Lowest-level domain
                                            
No. sites (/ 50)
                                
  www 49
  m 44
  api 43
  dev 40
  mail 39
  support 39
  blog 37
  help 36
  test 35
  app 34

Table 6: Top ten lowest-level domain strings by number of unique sites (in the set of 50) on which the name was identified

Several of these terms have clear use-cases, and appear to be used consistently across multiple popular sites (e.g. the use of 'm.' for the mobile-compatible version of a website).

Many of these trends mirror those from previous studies. For example, a 2021 analysis[11] of the most popular subdomain ('element') strings overall found that the top three were 'www', 'mail' and 'forum'. Whilst 'www' does not appear in the list of top ten most frequently occurring subdomain elements across the 50 sites considered in this analysis (Table 1), it does appear more than 5,700 times across the dataset. Furthermore, the dataset contains almost 1,000 distinct variants of 'www' being used as subdomain elements, with the list topped by 'www' itself (5,707 instances), followed by 'comwww' (146), 'www2' (42), 'www1' (31) and 'ww' (24).

An additional study[12], looking at the (analogous) use of second-level domain names in conjunction with dot-brand extensions, also found extensive use of many of the strings featured in this study, including 'mail' and 'api'.

It was noted previously that subdomain-related brand abuse can be a popular way of creating infringements or deceptive content. As a proxy for the infringement landscape, we can consider just those examples from the 50-site dataset in which the name of the Apple brand (the most valuable brand in 2024[13], but whose website does not appear in the list of 50 considered) appears anywhere in the subdomain name string. This will represent just a tiny proportion of the potential subdomain infringement landscape, since we are focusing just on a single brand, are considering only those instances where a textual mention appears in the subdomain name, and are focusing only on subdomains on the top 50 sites (where - one might hope - being controlled by large corporations and, in some cases, with IP protection programmes in place, the infringement landscape may be much less pronounced than across the Internet generally). In addition, the searches carried out for this study did not include any explicit brand-related searches; in a formal landscape sweep for (say) Apple, it would be advantageous to include additional search queries of the form: site:[site.com]+apple.

Nevertheless, the study dataset includes 139 examples in which 'apple' is referenced somewhere in the subdomain name, including a small number of live examples of potential infringements (Figure 3).

Figure 3: Examples of subdomain-based Apple potential infringements from within the dataset

Conclusions

Aside from the specific trends observed in the set of subdomains of the top 50 most popular websites, a significant take-away from this analysis is the effectiveness of the use of a range of discovery techniques to identify relevant content. Using a combination of search-engine queries, information from DNS, SSL and certificate transparency databases, and brute-force keyword-based searches, it has proven possible to identify almost two-thirds of a million subdomains of the 50 websites in question.

Given the risks associated with subdomain-based infringements, monitoring of this space as part of a comprehensive brand protection solution is of key importance, but has always proven difficult to achieve. This initial analysis shows that the range of available approaches can, when used together, provide a successful means of detecting potential threats. Whilst completely comprehensive subdomain detection is unlikely to be possible, these methods certainly provide a significant step in the right direction.

References

[1] 'Patterns in Brand Monitoring' by D.N. Barnett, Chapter 1: 'Overview of online brand protection' [awaiting publication]

[2] 'Patterns in Brand Monitoring' by D.N. Barnett, Chapter 7: 'Creation of deceptive URLs' [awaiting publication]

[3] https://circleid.com/posts/20220504-the-world-of-the-subdomain

[4] https://businessnamegenerator.com/what-is-a-subdomain/

[5] https://circleid.com/posts/20230525-the-millennium-problems-in-brand-protection

[6] https://www.similarweb.com/top-websites/

[7] https://github.com/aboul3la/Sublist3r

[8] https://sekuro.io/blog/snrublist3r3-subdomain-enumeration-tool/

[9] https://github.com/b3n-j4m1n/snrublist3r

[10] https://hackertarget.com/find-dns-host-records/

[11] https://securitytrails.com/blog/most-popular-subdomains-mx-records

[12] https://www.iamstobbs.com/opinion/a-review-of-the-current-state-of-the-new-gtld-programme-dot-brands

[13] https://www.visualcapitalist.com/most-valuable-brands-in-2024/

This article was first published on 28 May 2024 at:

https://circleid.com/posts/20240528-exploring-the-domain-of-subdomain-discovery

Thursday, 16 May 2024

The world of the bitsquat

BLOG POST

Bitsquatting is a way of launching a potential attack against a trusted website. It is reliant on the fact that, in some cases, the binary-string representation of the requested URL can become corrupted in transit, with a '1' being flipped to a '0' (or vice-versa). Ordinarily this would result in an invalid URL being produced, but in cases where the corrupted version is a valid URL in its own right, bad actors can register these variant domain names as a way of intercepting traffic intended for the legitimate site. This is analogous to cybersquatting or typosquatting, with no requirement to compromise the site explicitly. 

In a new study, we consider the bitsquat variants of each of the top 50 most popular websites, as of March 2024. Of the 1,553 valid domain names which could be used to launch bitsquatting attacks, only 125 appear explicitly to be under the ownership of the brand in question (or under other legitimate usage). Only 43 of these have been configured to re-direct to the official website in question. 

Of the remainder, at least 87% appear to be registered by third parties. Although some represent legitimate usage of the brand-name variant in question, many appear to have been registered with malicious intent. One active example of a lookalike site was identified in the dataset, in addition to many more misdirecting web traffic to similar content, or which have been monetised through the inclusion of pay-per-click links or offers to sell the domain names. Many of these present the potential to be 'weaponised' for use in attacks at a later date. 

There are a number of options available to brand owners to mitigate these risks. The first is the defensive registration of the bitsquat variants of their primary domain names, and active monitoring for - and enforcement against - identified infringements. Other possibilities include the use of domain extensions which are not amenable to bitsquat attacks, appropriate use of subdomains, and increased use of relative (rather than absolute) hyperlinks in the HTML of their websites.  

In practice, bitsquatting is not an attack vector which has been extensively exploited by bad actors to date. Nevertheless, it does raise some significant risks in the limited instances where it occurs, and is of particular concern in cases where a carefully selected registration can allow an attacker to target all domains on a specific extension. Realistically, some of the suggested remediative actions will only be appropriate in limited instances, and are often likely to be superseded by other branding considerations. However, the most advanced domain management and registration policies should certainly bear the issue in mind as a potential risk factor, and some simple steps (such as also registering the variant .tk version to accompany a .uk domain) can easily be made to improve the risk profile for brands.

This article was first published on 16 May 2024 at:

https://www.iamstobbs.com/opinion/redirecting-into-the-world-of-the-bitsquat

* * * * *

WHITE PAPER

Introduction to bitsquatting

Many previous descriptions of techniques used by infringers have discussed ways in which deceptive URLs can be created, to misdirect Internet users and drive web traffic to non-legitimate sites[1]. One related tactic concerns the issue of 'bitsquatting'.

As part of the technical process taking place when a user attempts to access a website, the URL is converted to a string of individual characters, each of which is represented as a unique ASCII code[2] - essentially, a number between 0 and 255, forming a range covering all allowable characters - which is then converted to its binary equivalent. For example, the character 'a' (lower-case a) is ASCII number 97 which, in binary, is 1100001[3]. This is generally expressed as an eight-binary-digit ('8-bit') string (or one byte, in IT terms), by adding leading zeroes as necessary (i.e. 01100001 in this case)[4]. However, when accessing a website, the string needs to be copied in and out of computer memory multiple times[5], and in some cases a corruption of the binary string can occur, due to factors such as hardware faults or electromagnetic interference[6]. Generally this manifests as one of the '1's being 'flipped' to a '0', or vice-versa. If this occurs, a different character is obtained when the binary string is decoded. In many cases, this would mean that there would be an error in accessing the website (e.g. if a request to access google.com was corrupted to gokgle.com, nothing would be found if this typo-variant domain did not exist). However, infringers can take advantage of this possibility, by proactively registering ('bitsquatting') the specific domain variants which may arise (e.g. by specifically having registered gokgle.com, in the above example), so that any such corrupted requests are instead directed towards their own content.

Note that this route of attack is not reliant on compromising the site in question, but rather takes advantage of hardware errors which can occur naturally. An early analysis by Black Hat[7] considered the targeting of eight legitimate domains using 31 bitsquatted variants, finding that over 52,000 web requests were made to the bitsquat domains over a six-month period.

Furthermore, a 2013 study[8] found a number of instances where bitsquatted domains were actively being utilised for brand abuse. These included a case where variants of huffingtonpost.com were being used to direct users to a page promoting the sale of hardware products, and one where variants of microsoft.com were being used to distribute malware or fake antivirus products.

The technical framework described above means that (considering only instances where a single binary character ('bit') is corrupted), there are up to eight possible bitsquatted variants of any given character (i.e. one for the transposed version of each individual bit). These are listed in Table 1.

Note that in this analysis:

  • We consider only the alphabetic characters (a – z), the numerals (0 – 9), and the characters '-', '.', '/' and '#', which can appear in a domain name / URL.
  • We do not consider case variants, partly because domain names / URLs are (generally) case-insensitive, but also because upper- and lower-case versions of the same character differ by ASCII values of 32 (i.e. differ in a single bit) - e.g. 'a' is ASCII 97 (01100001) and 'A' is ASCII 65 (01000001), so upper- and lower-case versions of the same character will always bitsquat-'match' each other, and their other [alphabetic] bitsquat variants will also differ only in case (e.g. the variants of 'a' will be ('A'), 'q', 'i', 'e' and 'c', and the variants of 'A' will be ('a'), 'Q', 'I', 'E' and 'C').
  • We also exclude other non-Latin characters since, although these can be used in domain names, they are generally treated differently, and usually expressed in Internet technology applications in Punycode format, which uses only Latin characters (e.g. where hermès.com is expressed as xn--herms7ra.com)[9].

Character
                    
1
              
2
              
3
              
4
              
5
              
6
              
7
              
8
              
a q i e c
b r j f c
c # s k g a b
d t l f e
e u m a g d
f v n b d g
g w o c e f
h x l j i
i y a m k h
j z b n h k
k c o i j
l d h n m
m - e i o l
n . f j l o
o / g k m n
p 0 x t r q
q 1 a y u s p
r 2 b z v p s
s 3 c w q r
t 4 d p v u
u 5 e q w t
v 6 f r t w
w 7 g s u v
x 8 h p z y
y 9 i q x
z
j r x
0 p 8 4 2 1
1 q 9 5 3 0
2 r 6 0 3
3 s # 7 1 2
4 t
0 6 5
5 u 1 7 4
6 v 2 4 7
7 w 3 5 6
8 x 0 9
9 y 1 8
- m / ,
. n /
/ o - .
# c 3

Table 1: Bitsquatted variants of each character permissible in a standard domain name / URL; columns show the variants based on the transposition of the bit number[10] shown in the column heading

In general, (assuming only one bitsquatted character is present in any given case) this means that the bitsquatted variants of any given domain name will form a subset of the domain names comprising 'fuzzy' variants (where any character can be replaced by any other character). Therefore, a domain monitoring solution which incorporates fuzzy matching will - by definition - detect these bitsquatted variants, without needing to monitor for them explicitly. However, there are some exceptions to this principle, mostly arising from the inclusion of the '.', '/' and '#' characters in the set of variants[11]. The main additional categories of potential bitsquats are as follows:

  • Subdomain-related variants (with a '.') - These can arise because of the potential for 'n' and '.' to substitute for each other. This gives rise to two types of potential bitsquat:
    • Instances where an 'n' is replaced by a '.', such that the bitsquatted domain is a truncated version of the original domain, e.g. where windowsupdate.com could be replaced by wi.dowsupdate.com (such that the bitsquatted variant domain name would be dowsupdate.com).
    • Instances where a '.' (e.g. as used in an active hostname, or subdomain / domain-name combination) is replaced by an 'n', e.g. where s.ytimg.com (a hostname used by YouTube for content delivery) would be replaced by snytimg.com. There are a range of popularly-used hostnames which might be susceptible to this type of attack, including several used for affiliate- or other URL-tracking services.

  • URL-delimiter variants for a '/' - These arise because of the potential for substitution between '/' (which forms bitsquat variants with 'o', '-' and '.') and any of these other characters. Again, this implies two types of potential bitsquat:
    • Instances where a relevant character is replaced by a '/', which can constitute a valid bitsquat if the preceding characters form a valid domain name (e.g. ecampus.phoenix.edu being replaced by ecampus.ph/enix.edu (i.e. with a variant domain name of ecampus.ph), or trading.scottrade.com being bitsquatted using trading.sc). A similar principle can affect a character at the start of a domain name, working on the basis that the resulting string ('http:///' - with three slashes) would generally be 'corrected' by a browser.
    • Instances where a '/' is replaced by a relevant character. This could potentially affect the second slash in a URL (giving 'http:/' - with one slash, which would also usually be 'corrected' by a browser) or the third slash (after the domain name), if this generated a valid alternative domain name.
  • URL-delimiter variants for a '#' - These arise because of the potential for substitution between '#' and 'c' or '3'. The '#' character can be used in URLs to prepend URL fragments (e.g. when specifying anchor tags on a webpage). Examples might include cgportal2.uscg.mil being substituted by cgportal2.us#g.mil or isbc.com.cn being substituted by isbc.com.#n - again, the arising syntactical errors will often be 'corrected' by a browser (so cgportal2.us and isbc.com could potentially be used to launch successful bitsquatting attacks in these cases).

Additionally, bitsquatting attacks can also take advantage of the possibility for TLDs (top-level domains, or domain extensions) to substitute for each other (e.g. .uk being substituted by .tk), or for a character in the TLD to replaced by a '.' or other URL delimiter (in cases where the subsequent part of the TLD string was also a TLD in its own right), which could provide the potential to bitsquat all domains hosted on the TLD in question - e.g. .cleaning being replaced by .clea.ing, .photography being replaced by .ph/tography, or .auction being replaced by .au#ction.

Bitsquatted variants of the top 50 most popular websites

As an investigation into the extent of bitsquatting as a potential attack vector, we consider the utilisation of the bitsquatted variants of each of the top 50 most popular websites (as of March 2024), according to Similarweb[12]. This list (topped by google.com, youtube.com, facebook.com, instagram.com and twitter.com) features domain names covering a range of TLDs, as shown in Table 2.

TLD
                                
No. instances
                                
  com 39
  ru 3
  org 2
  co.jp 1
  tv 1
  desi 1
  us 1
  me 1
  ne.jp 1

Table 2: TLDs represented in the set of the top 50 most popular websites

Considering the potential bitsquats of each of these 50 websites yields a dataset of 1,553 valid domain names which could be used to create a bitsquatted variant of one of the domain names in question. Only one domain name, g.com, appears as a duplicate in the list (in the variants bi.g.com (for bing.com) and samsu.g.com (for samsung.com)), but which - in practice - as a one-character .com domain is anyway unlikely to be available for registration. The 1,553 domain names occupy a wider range of TLDs (Table 3) than the original dataset of 50 websites, arising due to cases where bitsquatting of a character in the domain-name extension produces another valid extension[13] (Table 4).

TLD
                                
No. instances
                                
  com 1,203
  org 90
  ru 60
  bom 39
  desi 39
  tv 29
  ne.jp 26
  co.jp 20
  us 15
  me 5
  su 3
  vu 3
  rw 3
  re 3
  jp 2
  ee 1
  md 1
  mm 1
  ws 1
  tr 1
  tt 1
  ie 1
  tw 1
  ma 1
  mu 1
  mg 1
  tf 1
  es 1

Table 3: TLDs represented in the set of domain names appearing in bitsquatted variants of the top 50 most popular websites

TLD
                                
Valid variants
                                                                
  com   bom
  me   ee, ie, ma, md, mg, mm, mu
  ru   re, rw, su, vu
  tv   tf, tr, tt, tw
  us   es, ws

Table 4: Valid variant TLDs appearing in the bitsquatted dataset

Within the dataset of bitsquats, there are 12 instances of valid hostnames (i.e. subdomain plus domain-name combinations). Of these, only two (li.kedin.com (a variant of linkedin.com) and pi.terest.com (a variant of pinterest.com)) were found to be in active use, both redirecting to pay-per-click parking pages offering the respective domain names for sale).

Of the 1,553 potential bitsquatting domains, only 125 appear explicitly to be under the ownership of the brand in question, or under other legitimate usage (on the basis of the citation of official registrant details, or of an enterprise-class registrar, in the whois records). 43 of these have (sensibly) been configured to re-direct to the official website in question.

Of the remainder, all except 179 (i.e. at least 87%) appear to be actively registered by parties other than the relevant brand owner (on the basis of a whois record being returned and/or the presence of a live website response). Some of these will, of course, also represent legitimate use (in cases where the bitsquatted variant forms a legitimate alternative brand name or website in its own right), but there is clear potential for significant abuse within the dataset, particularly where the variant domain name does not appear to constitute any meaningful string other than as a variant of the brand name in question.

Within the set of third-party registrations of valid bitsquatted variants which return live webpages, there are also, however, a number of examples of potential concern. As of the time of analysis, only one example was identified of a variant domain name actively apparently being used to impersonate the website in question (Figure 1), but many others were found to be displaying third-party content with a subject- or industry area similar to that of the corresponding top50 website - particularly in cases of adult-content brands (i.e. comprising instances of brand abuse and traffic misdirection). Many more examples were found to have been monetised through the inclusion of PPC links or pages offering the domain names for sale and, given the nature of the risk, it would be advisable to monitor the dormant domains for any changes to website content.

Figure 1: Example of a lookalike Pinterest website hosted on a bitsquatted variant domain name

Conclusions

The technical issue of potential bit corruption in URLs makes bitsquatting a highly effective attack route for infringers and fraudsters. It would generally be advisable to brand owners to defensively register the (relatively small number) of bitsquatted variants of their primary domain name(s) as a way of mediating such attacks (including variant TLDs where possible / appropriate - e.g. registering the variant .tk version to accompany a .uk registration). However, for the top 50 most popular websites globally, this defensive approach has seen relatively limited adoption, and significant numbers of the domain names which could be used for attacks of this type have been registered by third parties. In many cases these are in active use for traffic misdirection or as revenue generators. It is also concerning that many restricted domain name-spaces (i.e. those where specific registration requirements or limitations are in place) are susceptible to bitsquatting attacks.

Other tactics which might effectively be employed by brand owners or industry organisations to prevent bitsquatting (as summarised in the previously referenced Cisco white paper) might include:

by brand owners:

  • appropriate selection by brand owners of suitable TLDs for their primary website (i.e. those where the bitsquatted variants do not produce valid domain extensions) – note that, whilst this is technically a remediative option, in practice it is likely to be overruled by branding or SEO (search-engine optimisation) considerations (e.g. the favourability of .com as a primary domain name extension
  • careful selection and use of subdomain names
  • use of relative, rather than absolute, hyperlinks in HTML content (to minimise the number of times the domain name needs to be loaded in and out of computer memory)
  • use of capital letters (which generally have fewer bitsquat variants) in the (limited) sections of URLs which are case sensitive

by other organisations:

  • restrictions by registries on the registration of domains featuring keywords (such as 'www') which are conducive to this style of attack, or of registrations by any out-of-territory registrants
  • introduction of mandates to include error checking against bitsquats in hardware devices[14]

The findings presented in this study also highlight the importance of a brand monitoring approach able to detect the registration of the candidate URLs, carrying out analysis of content, and the implementation of rapid and effective enforcement in cases where abuse is identified.

References

[1] 'Patterns in Brand Monitoring' by D.N. Barnett, Chapter 7: 'Creation of deceptive URLs' [awaiting publication]

[2] https://www.ascii-code.com/

[3] Binary (base-2) 1100001 = (1 × 26) + (1 × 25) + 1 = Decimal (base-10) 97

[4] https://www.totalphase.com/blog/2023/05/binary-ascii-relationship-differences-embedded-applications/

[5] http://dinaburg.org/bitsquatting.html

[6] https://en.wikipedia.org/wiki/Bitsquatting

[7] https://web.archive.org/web/20180713212603/http://media.blackhat.com/bh-us-11/Dinaburg/BH_US_11_Dinaburg_Bitsquatting_WP.pdf

[8] https://www.securitee.org/files/bitsquatting_www2013.pdf

[9] https://www.iamstobbs.com/idns-ebook

[10] Reading from left to right in the byte, i.e. with the largest-value bit first

[11] https://media.defcon.org/DEF%20CON%2021/DEF%20CON%2021%20presentations/DEF%20CON%2021%20-%20Schultz-Examining-the-Bitsquatting-Attack-Surface-WP.pdf

[12] https://similarweb.com/top-websites/

[13] https://data.iana.org/TLD/tlds-alpha-by-domain.txt

[14] https://sec.okta.com/articles/2020/11/why-bitsquatting-attacks-are-here-stay

This article was first published as an e-book on 16 May 2024 at:

https://www.iamstobbs.com/the-world-of-the-bitsquat

Tuesday, 14 May 2024

IP and digital due diligence: constructing a domain policy that Matches brand owner requirements

Following the placing of luxury e-commerce platform Matches[1] (formerly Matchesfashion) into administration by owners Frasers Group[2] in March 2024[3], Mike Ashley's Frasers has at the end of April repurchased the company's intellectual property (domain names, trademarks and store databases). They have not, however, purchased the £80 million worth of stock or 250 remaining employees[4].

Introduction

Matches was established as a physical store in 1987, before expanding to an online e-commerce platform in 2007[5]. The company was acquired by Frasers for £52 million in 2023 from private equity firm Apax Partners[6,7], who had purchased it around six years earlier for an estimated £400 million[8].

In this article, we take a look at the potential landscape of the Matches domain name portfolio and use it as a case study to consider some points related to a suitable registration and domain-management policy for the brand[9].

Methodology

Given the generic nature of the 'Matches' brand name, it makes sense to consider in an initial landscape analysis only those domain names of greatest potential relevance. The first step is to focus on those where the brand name appears at the start of the domain name. As of the date of analysis (01-May-2024), 480 registered domains with names beginning with 'matches' were identified. Many of these are clearly not relevant, so we next filter the dataset to consider only those where the SLD (the second-level domain name, i.e. the portion to the left of the dot) consists only of the term 'matches', or where other relevant terms (i.e. those related to fashion, commerce, or other generic terms such as 'online' or other geographic terms) are present. This yields a dataset of 184 domain names of potential relevance.

The first point to note is that, given the generic nature of the brand, it is unlikely that the brand owner will want (or easily be able) to acquire all domains of the form matches.[TLD]. Indeed, the company's primary website is matchesfashion.com, and even the brand-name-only .com (the most popular TLD) domain (i.e. matches.com) is under the ownership of a third party (it currently re-directs to world.com, displaying content relating to the sale of 'premium domain names').

The next point of significance is that Matches' official primary website domain is actually registered via a proxy service ('Domains By Proxy, LLC') meaning that it is difficult to definitively verify which domains are under official ownership. However, there are 53 domains within the dataset which re-direct to matchesfashion.com – covering a small number of matches.[TLD] domains, a larger group of domains where the SLD is 'matchesfashion' or 'matches-fashion', and a small number of others (featuring additional keywords such as 'store', 'site' or 'website'). Many of these are also registered via Domains By Proxy and seem likely to comprise (at least part of) the official domain portfolio. Some are explicitly registered with alternative contact details, and may be of concern (and warrant careful monitoring and potential enforcement) if not actually under the control of the brand owner. As a general recommendation, it might be appropriate for brand owners to consolidate all official registrations under a single (ideally enterprise-class) registrar, and standardise the contact details cited in the whois record (usually with official corporate information).

Analysis

From amongst the remainder of the (probably third-party-owned) dataset domains in this sample landscape (53 of the form matches.[TLD] and another 78 featuring additional keywords), a number of relevant insights can be drawn:

  • The dataset includes several examples of additional domains with SLDs of 'matchesfashion' or 'matches-fashion', i.e. the brand owner's apparent preferred domain-name format.
  • There are numerous domains comprising misspellings of 'matchesfashion.com'.
  • There are several domains with names featuring variants of 'matchesboutique', 'matchesonline', 'matchesworld' and other relevant keywords.
  • Some domains (including several of the misspellings) resolve to pages featuring pay-per-click links, indicating an effort to take advantage of the brand name and monetise the content.
  • Some of the registrations are likely to pertain to legitimate third-party use of the 'Matches' name.

In addition, there are a small number of findings of greater concern:

  • Two of the misspellings (matchesashion.com and matchesfshion.com) re-direct via affiliate-tracking URLs to the official matchesfashion.com site. Whilst this is most likely purely a revenue-generation scheme, there is potential for the domains to be used fraudulently, with the re-direction to the official site designed to provide the appearance of legitimacy.
  • Four of the sites resolve to a log-in page (Figure 1), presenting the potential for fraudulent use (e.g. phishing).
  • One domain resolves to a third-party site using the 'Matches' name in a similar industry area (Figure 2).

Figure 1: Screenshot of a log-in page displayed by four of the third-party domains in the dataset

Figure 2: Screenshot of a third-party site using the 'Matches' name in a similar industry area

Approach

Based on these findings, the following general recommendations regarding a domain registration and management portfolio may be appropriate:

  • In cases where any of the third-party sites feature active infringements, it would be advisable to launch enforcement actions to deactivate the content, and/or consider a dispute procedure if the brand owner wishes to reclaim the domain for their own portfolio.
  • If any of the other third-party-owned domains are required for the official portfolio, it may be appropriate to attempt to acquire them through purchase or dispute.
  • Outside the set of domains which are currently registered, it may make sense (depending on the balance between budget and risk, and on planned future business expansions) to attempt to purchase domains featuring relevant keyword patterns, across relevant TLDs, for inclusion in a core / tactical (i.e. defensive and strategic) portfolio. Some of the key domains of potential interest might include:
    • Those featuring the brand name together (say, with and without hyphens) with relevant keywords, such as 'fashion', product keywords such as 'designer', 'luxury', 'clothing', etc., and potentially terms such as 'shop', 'store', 'outlet' etc. (particularly given the 'Matches Outlet' branding used on the official website), and/or other relevant generic or geographical keywords. Note that this may include domains where the brand name does not necessarily appear at the start.
    • Domains across TLDs relating to the brand's current or future planned geographical areas of business, or which relate specifically to the industry area or e-commerce generally (e.g. .fashion, .boutique, .luxury, .moda, .shop, .store, etc.)
    • It may also be appropriate to register domains featuring other relevant terms, such as brand taglines.

The general principle is usually to achieve coverage across a wider range of TLDs for the higher-relevance keywords / SLDs (e.g. 'matchesfashion' in this case).

  • Beyond the construction of an official portfolio, it is generally advisable to monitor for the registration of relevant domains, and the potential appearance of infringing content, with a view to taking subsequent enforcement where appropriate. This might particularly be relevant for misspellings, where it would be unsustainable to attempt to defensively register all variants pre-emptively.

Take-homes

The re-purchase of the Matches IP portfolio, for a greater sum than the purchase of the whole company a year earlier, provides a striking illustration of the extent to which the value of a brand can be dominated by its intangible assets. However, the value and usefulness of a set of domain names is limited by the quality of any domain management policy which underlies it. Policies of this type are a key consideration for brand owners, incorporating insights from analysis of the pre-existing state of official and third-party registrations, and taking into account the balance between registration and renewal costs, and business requirements for core and tactical domains, covering brand variants, relevant keywords, and appropriate domain-name extensions (top-level domains, or TLDs).

Frasers' last acquisition of the Matches brand was short-lived, and the brand has had multiple owners in two decades, potentially giving rise to different approaches and competing commercial interests, such as factors regarding the spend on IP versus commercial operations. Stability of the ownership and management of Matches may assist with the brand’s future IP portfolio rationalisation. After all, a carefully managed and properly executed policy can help brand owners maximise their value for money, control their IP, manage infringements and help to strengthen the brand overall.

References

[1] https://www.matchesfashion.com/

[2] https://frasers.group/

[3] https://www.retailgazette.co.uk/blog/2024/03/matches-administration/

[4] https://www.retailgazette.co.uk/blog/2024/04/frasers-matches-ip/

[5] http://www.managementtoday.co.uk/article/1668609

[6] https://www.independent.co.uk/business/frasers-group-buys-matches-fashion-for-ps52m-b2467208.html

[7] https://www.businessoffashion.com/articles/retail/mike-ashleys-frasers-group-buys-matchesfashion/

[8] https://www.theguardian.com/technology/2017/sep/01/husband-wife-chapman-bank-400m-sale-matches-fashion

[9] https://www.iamstobbs.com/opinion/strategies-for-constructing-a-domain-name-registration-and-management-policy

This article was first published on 14 May 2024 at:

https://www.iamstobbs.com/opinion/ip-and-digital-due-diligence-constructing-a-domain-policy-that-matches-brand-owner-requirements

Unregistered Gems Part 6: Phonemizing strings to find brandable domains

Introduction The UnregisteredGems.com series of articles explores a range of techniques to filter and search through the universe of unregis...