Thursday, 16 May 2024

The world of the bitsquat

BLOG POST

Bitsquatting is a way of launching a potential attack against a trusted website. It is reliant on the fact that, in some cases, the binary-string representation of the requested URL can become corrupted in transit, with a '1' being flipped to a '0' (or vice-versa). Ordinarily this would result in an invalid URL being produced, but in cases where the corrupted version is a valid URL in its own right, bad actors can register these variant domain names as a way of intercepting traffic intended for the legitimate site. This is analogous to cybersquatting or typosquatting, with no requirement to compromise the site explicitly. 

In a new study, we consider the bitsquat variants of each of the top 50 most popular websites, as of March 2024. Of the 1,553 valid domain names which could be used to launch bitsquatting attacks, only 125 appear explicitly to be under the ownership of the brand in question (or under other legitimate usage). Only 43 of these have been configured to re-direct to the official website in question. 

Of the remainder, at least 87% appear to be registered by third parties. Although some represent legitimate usage of the brand-name variant in question, many appear to have been registered with malicious intent. One active example of a lookalike site was identified in the dataset, in addition to many more misdirecting web traffic to similar content, or which have been monetised through the inclusion of pay-per-click links or offers to sell the domain names. Many of these present the potential to be 'weaponised' for use in attacks at a later date. 

There are a number of options available to brand owners to mitigate these risks. The first is the defensive registration of the bitsquat variants of their primary domain names, and active monitoring for - and enforcement against - identified infringements. Other possibilities include the use of domain extensions which are not amenable to bitsquat attacks, appropriate use of subdomains, and increased use of relative (rather than absolute) hyperlinks in the HTML of their websites.  

In practice, bitsquatting is not an attack vector which has been extensively exploited by bad actors to date. Nevertheless, it does raise some significant risks in the limited instances where it occurs, and is of particular concern in cases where a carefully selected registration can allow an attacker to target all domains on a specific extension. Realistically, some of the suggested remediative actions will only be appropriate in limited instances, and are often likely to be superseded by other branding considerations. However, the most advanced domain management and registration policies should certainly bear the issue in mind as a potential risk factor, and some simple steps (such as also registering the variant .tk version to accompany a .uk domain) can easily be made to improve the risk profile for brands.

This article was first published on 16 May 2024 at:

https://www.iamstobbs.com/opinion/redirecting-into-the-world-of-the-bitsquat

* * * * *

WHITE PAPER

Introduction to bitsquatting

Many previous descriptions of techniques used by infringers have discussed ways in which deceptive URLs can be created, to misdirect Internet users and drive web traffic to non-legitimate sites[1]. One related tactic concerns the issue of 'bitsquatting'.

As part of the technical process taking place when a user attempts to access a website, the URL is converted to a string of individual characters, each of which is represented as a unique ASCII code[2] - essentially, a number between 0 and 255, forming a range covering all allowable characters - which is then converted to its binary equivalent. For example, the character 'a' (lower-case a) is ASCII number 97 which, in binary, is 1100001[3]. This is generally expressed as an eight-binary-digit ('8-bit') string (or one byte, in IT terms), by adding leading zeroes as necessary (i.e. 01100001 in this case)[4]. However, when accessing a website, the string needs to be copied in and out of computer memory multiple times[5], and in some cases a corruption of the binary string can occur, due to factors such as hardware faults or electromagnetic interference[6]. Generally this manifests as one of the '1's being 'flipped' to a '0', or vice-versa. If this occurs, a different character is obtained when the binary string is decoded. In many cases, this would mean that there would be an error in accessing the website (e.g. if a request to access google.com was corrupted to gokgle.com, nothing would be found if this typo-variant domain did not exist). However, infringers can take advantage of this possibility, by proactively registering ('bitsquatting') the specific domain variants which may arise (e.g. by specifically having registered gokgle.com, in the above example), so that any such corrupted requests are instead directed towards their own content.

Note that this route of attack is not reliant on compromising the site in question, but rather takes advantage of hardware errors which can occur naturally. An early analysis by Black Hat[7] considered the targeting of eight legitimate domains using 31 bitsquatted variants, finding that over 52,000 web requests were made to the bitsquat domains over a six-month period.

Furthermore, a 2013 study[8] found a number of instances where bitsquatted domains were actively being utilised for brand abuse. These included a case where variants of huffingtonpost.com were being used to direct users to a page promoting the sale of hardware products, and one where variants of microsoft.com were being used to distribute malware or fake antivirus products.

The technical framework described above means that (considering only instances where a single binary character ('bit') is corrupted), there are up to eight possible bitsquatted variants of any given character (i.e. one for the transposed version of each individual bit). These are listed in Table 1.

Note that in this analysis:

  • We consider only the alphabetic characters (a – z), the numerals (0 – 9), and the characters '-', '.', '/' and '#', which can appear in a domain name / URL.
  • We do not consider case variants, partly because domain names / URLs are (generally) case-insensitive, but also because upper- and lower-case versions of the same character differ by ASCII values of 32 (i.e. differ in a single bit) - e.g. 'a' is ASCII 97 (01100001) and 'A' is ASCII 65 (01000001), so upper- and lower-case versions of the same character will always bitsquat-'match' each other, and their other [alphabetic] bitsquat variants will also differ only in case (e.g. the variants of 'a' will be ('A'), 'q', 'i', 'e' and 'c', and the variants of 'A' will be ('a'), 'Q', 'I', 'E' and 'C').
  • We also exclude other non-Latin characters since, although these can be used in domain names, they are generally treated differently, and usually expressed in Internet technology applications in Punycode format, which uses only Latin characters (e.g. where hermès.com is expressed as xn--herms7ra.com)[9].

Character
                    
1
              
2
              
3
              
4
              
5
              
6
              
7
              
8
              
a q i e c
b r j f c
c # s k g a b
d t l f e
e u m a g d
f v n b d g
g w o c e f
h x l j i
i y a m k h
j z b n h k
k c o i j
l d h n m
m - e i o l
n . f j l o
o / g k m n
p 0 x t r q
q 1 a y u s p
r 2 b z v p s
s 3 c w q r
t 4 d p v u
u 5 e q w t
v 6 f r t w
w 7 g s u v
x 8 h p z y
y 9 i q x
z
j r x
0 p 8 4 2 1
1 q 9 5 3 0
2 r 6 0 3
3 s # 7 1 2
4 t
0 6 5
5 u 1 7 4
6 v 2 4 7
7 w 3 5 6
8 x 0 9
9 y 1 8
- m / ,
. n /
/ o - .
# c 3

Table 1: Bitsquatted variants of each character permissible in a standard domain name / URL; columns show the variants based on the transposition of the bit number[10] shown in the column heading

In general, (assuming only one bitsquatted character is present in any given case) this means that the bitsquatted variants of any given domain name will form a subset of the domain names comprising 'fuzzy' variants (where any character can be replaced by any other character). Therefore, a domain monitoring solution which incorporates fuzzy matching will - by definition - detect these bitsquatted variants, without needing to monitor for them explicitly. However, there are some exceptions to this principle, mostly arising from the inclusion of the '.', '/' and '#' characters in the set of variants[11]. The main additional categories of potential bitsquats are as follows:

  • Subdomain-related variants (with a '.') - These can arise because of the potential for 'n' and '.' to substitute for each other. This gives rise to two types of potential bitsquat:
    • Instances where an 'n' is replaced by a '.', such that the bitsquatted domain is a truncated version of the original domain, e.g. where windowsupdate.com could be replaced by wi.dowsupdate.com (such that the bitsquatted variant domain name would be dowsupdate.com).
    • Instances where a '.' (e.g. as used in an active hostname, or subdomain / domain-name combination) is replaced by an 'n', e.g. where s.ytimg.com (a hostname used by YouTube for content delivery) would be replaced by snytimg.com. There are a range of popularly-used hostnames which might be susceptible to this type of attack, including several used for affiliate- or other URL-tracking services.

  • URL-delimiter variants for a '/' - These arise because of the potential for substitution between '/' (which forms bitsquat variants with 'o', '-' and '.') and any of these other characters. Again, this implies two types of potential bitsquat:
    • Instances where a relevant character is replaced by a '/', which can constitute a valid bitsquat if the preceding characters form a valid domain name (e.g. ecampus.phoenix.edu being replaced by ecampus.ph/enix.edu (i.e. with a variant domain name of ecampus.ph), or trading.scottrade.com being bitsquatted using trading.sc). A similar principle can affect a character at the start of a domain name, working on the basis that the resulting string ('http:///' - with three slashes) would generally be 'corrected' by a browser.
    • Instances where a '/' is replaced by a relevant character. This could potentially affect the second slash in a URL (giving 'http:/' - with one slash, which would also usually be 'corrected' by a browser) or the third slash (after the domain name), if this generated a valid alternative domain name.
  • URL-delimiter variants for a '#' - These arise because of the potential for substitution between '#' and 'c' or '3'. The '#' character can be used in URLs to prepend URL fragments (e.g. when specifying anchor tags on a webpage). Examples might include cgportal2.uscg.mil being substituted by cgportal2.us#g.mil or isbc.com.cn being substituted by isbc.com.#n - again, the arising syntactical errors will often be 'corrected' by a browser (so cgportal2.us and isbc.com could potentially be used to launch successful bitsquatting attacks in these cases).

Additionally, bitsquatting attacks can also take advantage of the possibility for TLDs (top-level domains, or domain extensions) to substitute for each other (e.g. .uk being substituted by .tk), or for a character in the TLD to replaced by a '.' or other URL delimiter (in cases where the subsequent part of the TLD string was also a TLD in its own right), which could provide the potential to bitsquat all domains hosted on the TLD in question - e.g. .cleaning being replaced by .clea.ing, .photography being replaced by .ph/tography, or .auction being replaced by .au#ction.

Bitsquatted variants of the top 50 most popular websites

As an investigation into the extent of bitsquatting as a potential attack vector, we consider the utilisation of the bitsquatted variants of each of the top 50 most popular websites (as of March 2024), according to Similarweb[12]. This list (topped by google.com, youtube.com, facebook.com, instagram.com and twitter.com) features domain names covering a range of TLDs, as shown in Table 2.

TLD
                                
No. instances
                                
  com 39
  ru 3
  org 2
  co.jp 1
  tv 1
  desi 1
  us 1
  me 1
  ne.jp 1

Table 2: TLDs represented in the set of the top 50 most popular websites

Considering the potential bitsquats of each of these 50 websites yields a dataset of 1,553 valid domain names which could be used to create a bitsquatted variant of one of the domain names in question. Only one domain name, g.com, appears as a duplicate in the list (in the variants bi.g.com (for bing.com) and samsu.g.com (for samsung.com)), but which - in practice - as a one-character .com domain is anyway unlikely to be available for registration. The 1,553 domain names occupy a wider range of TLDs (Table 3) than the original dataset of 50 websites, arising due to cases where bitsquatting of a character in the domain-name extension produces another valid extension[13] (Table 4).

TLD
                                
No. instances
                                
  com 1,203
  org 90
  ru 60
  bom 39
  desi 39
  tv 29
  ne.jp 26
  co.jp 20
  us 15
  me 5
  su 3
  vu 3
  rw 3
  re 3
  jp 2
  ee 1
  md 1
  mm 1
  ws 1
  tr 1
  tt 1
  ie 1
  tw 1
  ma 1
  mu 1
  mg 1
  tf 1
  es 1

Table 3: TLDs represented in the set of domain names appearing in bitsquatted variants of the top 50 most popular websites

TLD
                                
Valid variants
                                                                
  com   bom
  me   ee, ie, ma, md, mg, mm, mu
  ru   re, rw, su, vu
  tv   tf, tr, tt, tw
  us   es, ws

Table 4: Valid variant TLDs appearing in the bitsquatted dataset

Within the dataset of bitsquats, there are 12 instances of valid hostnames (i.e. subdomain plus domain-name combinations). Of these, only two (li.kedin.com (a variant of linkedin.com) and pi.terest.com (a variant of pinterest.com)) were found to be in active use, both redirecting to pay-per-click parking pages offering the respective domain names for sale).

Of the 1,553 potential bitsquatting domains, only 125 appear explicitly to be under the ownership of the brand in question, or under other legitimate usage (on the basis of the citation of official registrant details, or of an enterprise-class registrar, in the whois records). 43 of these have (sensibly) been configured to re-direct to the official website in question.

Of the remainder, all except 179 (i.e. at least 87%) appear to be actively registered by parties other than the relevant brand owner (on the basis of a whois record being returned and/or the presence of a live website response). Some of these will, of course, also represent legitimate use (in cases where the bitsquatted variant forms a legitimate alternative brand name or website in its own right), but there is clear potential for significant abuse within the dataset, particularly where the variant domain name does not appear to constitute any meaningful string other than as a variant of the brand name in question.

Within the set of third-party registrations of valid bitsquatted variants which return live webpages, there are also, however, a number of examples of potential concern. As of the time of analysis, only one example was identified of a variant domain name actively apparently being used to impersonate the website in question (Figure 1), but many others were found to be displaying third-party content with a subject- or industry area similar to that of the corresponding top50 website - particularly in cases of adult-content brands (i.e. comprising instances of brand abuse and traffic misdirection). Many more examples were found to have been monetised through the inclusion of PPC links or pages offering the domain names for sale and, given the nature of the risk, it would be advisable to monitor the dormant domains for any changes to website content.

Figure 1: Example of a lookalike Pinterest website hosted on a bitsquatted variant domain name

Conclusions

The technical issue of potential bit corruption in URLs makes bitsquatting a highly effective attack route for infringers and fraudsters. It would generally be advisable to brand owners to defensively register the (relatively small number) of bitsquatted variants of their primary domain name(s) as a way of mediating such attacks (including variant TLDs where possible / appropriate - e.g. registering the variant .tk version to accompany a .uk registration). However, for the top 50 most popular websites globally, this defensive approach has seen relatively limited adoption, and significant numbers of the domain names which could be used for attacks of this type have been registered by third parties. In many cases these are in active use for traffic misdirection or as revenue generators. It is also concerning that many restricted domain name-spaces (i.e. those where specific registration requirements or limitations are in place) are susceptible to bitsquatting attacks.

Other tactics which might effectively be employed by brand owners or industry organisations to prevent bitsquatting (as summarised in the previously referenced Cisco white paper) might include:

by brand owners:

  • appropriate selection by brand owners of suitable TLDs for their primary website (i.e. those where the bitsquatted variants do not produce valid domain extensions) – note that, whilst this is technically a remediative option, in practice it is likely to be overruled by branding or SEO (search-engine optimisation) considerations (e.g. the favourability of .com as a primary domain name extension
  • careful selection and use of subdomain names
  • use of relative, rather than absolute, hyperlinks in HTML content (to minimise the number of times the domain name needs to be loaded in and out of computer memory)
  • use of capital letters (which generally have fewer bitsquat variants) in the (limited) sections of URLs which are case sensitive

by other organisations:

  • restrictions by registries on the registration of domains featuring keywords (such as 'www') which are conducive to this style of attack, or of registrations by any out-of-territory registrants
  • introduction of mandates to include error checking against bitsquats in hardware devices[14]

The findings presented in this study also highlight the importance of a brand monitoring approach able to detect the registration of the candidate URLs, carrying out analysis of content, and the implementation of rapid and effective enforcement in cases where abuse is identified.

References

[1] 'Patterns in Brand Monitoring' by D.N. Barnett, Chapter 7: 'Creation of deceptive URLs' [awaiting publication]

[2] https://www.ascii-code.com/

[3] Binary (base-2) 1100001 = (1 × 26) + (1 × 25) + 1 = Decimal (base-10) 97

[4] https://www.totalphase.com/blog/2023/05/binary-ascii-relationship-differences-embedded-applications/

[5] http://dinaburg.org/bitsquatting.html

[6] https://en.wikipedia.org/wiki/Bitsquatting

[7] https://web.archive.org/web/20180713212603/http://media.blackhat.com/bh-us-11/Dinaburg/BH_US_11_Dinaburg_Bitsquatting_WP.pdf

[8] https://www.securitee.org/files/bitsquatting_www2013.pdf

[9] https://www.iamstobbs.com/idns-ebook

[10] Reading from left to right in the byte, i.e. with the largest-value bit first

[11] https://media.defcon.org/DEF%20CON%2021/DEF%20CON%2021%20presentations/DEF%20CON%2021%20-%20Schultz-Examining-the-Bitsquatting-Attack-Surface-WP.pdf

[12] https://similarweb.com/top-websites/

[13] https://data.iana.org/TLD/tlds-alpha-by-domain.txt

[14] https://sec.okta.com/articles/2020/11/why-bitsquatting-attacks-are-here-stay

This article was first published as an e-book on 16 May 2024 at:

https://www.iamstobbs.com/the-world-of-the-bitsquat

No comments:

Post a Comment

Unregistered Gems Part 6: Phonemizing strings to find brandable domains

Introduction The UnregisteredGems.com series of articles explores a range of techniques to filter and search through the universe of unregis...