David Barnett's Brand Protection Articles: phishing

Showing posts with label phishing. Show all posts

Monday, 3 July 2023

An overview of the concept and use of domain-name entropy

Introduction

In this article, I present an overview of a series of 'proof-of-concept' studies looking at the application of domain-name entropy as a means of clustering together related domain registrations, and serving as an input into potential metrics to determine the likely level of threat which may be posed by a domain.

In our previous studies, we utilised the mathematical concept of Shannon entropy^[1], providing a measure of the amount of information stored in a string of characters (or, equivalently, the number of bits required to optimally encode the string). The idea was applied to the second-level domain name (SLD) part of each domain (i.e. the portion of the domain name before the dot - such as 'google' in 'google.com'), and broadly means that short domain names, or those with large numbers of repeated characters, will have low entropy values, whereas longer domain names, or those with large numbers of distinct characters, will have higher entropy.

The background to this analysis is the fact that domains registered for egregious purposes (such as spamming, malware distribution, or botnet creation) may be more likely to be registered in bulk by bad actors using automated algorithms^[2], which typically results in the generation of long, non-sensical (i.e. high entropy) domain names, which have the added benefit of not containing brand-related keywords and are typically therefore harder to detect using classic brand-monitoring techniques. The idea is that domains registered by a particular infringer for a specific campaign are likely all to be generated using the same algorithm, and may therefore have similar or identical entropy values.

Overview of previous studies

In our initial proof of concept^[3], we considered the set of all domains registered on a particular day - a sample of around 205,000 domains. The advantage also of considering a set of domains with a common registration date is that it presents the possibility for one or more groups of automated bulk registrations (which are typically all registered at the same time) to be present.

Within the dataset, a range of domain entropy values was present, from a minimum of 0.000, to a maximum of 4.700, and with 92.3% of the dataset having values below 3.500. (see Figure 1). The top 1,000 highest-entropy domains (i.e. the top 0.49%) had entropy values in excess of 3.823, and accounted for the majority of examples which appeared visually to feature 'random' SLD strings. Within this high-entropy subset, a number of additional characteristics were indicative that many may have been registered for nefarious purposes, including the prominence of use of consumer-grade registrars and privacy-protection services, and the extent of the presence of active MX records amongst these new registrations (in 27.5% of the cases - indicating that these domains have been configured to be able to send and receive e-mails and therefore could potentially be associated with phishing activity).

Figure 1: Cumulative proportion of domains with entropy less than the value shown on the horizontal axis, from the dataset in the initial proof-of-concept study

Indeed, at least one apparent 'cluster' of suspicious registrations was found to be present within the dataset, comprising a group of 125 .buzz ('dot-buzz') domains, all with an identical high entropy value (3.907), registered via a common registrar and associated with groups of similar IP addresses. At the time of analysis, many of the domains registered to Chinese-language, gambling-related websites, likely representing either an affiliate revenue generation scheme, or 'dummy' content serving to 'mask' higher-threat content which may only have been visible in specific geographic regions, or which may have been planned for subsequent upload.

In a follow-up study^[4], I considered a month's worth of registrations of domains with names containing any of the top ten most valuable brands in 2022. Similarly, the high entropy domain names within this dataset included groups of apparently related, coordinated 'clusters' of domains, several of which appeared intended for fraudulent use and were consistent with registration via automated generation algorithms. For example, seven of the top eight domains in the dataset (by entropy values) had similar names of the form 'google-site-verificationXXXXXX.com' (or .net) (where 'XXXXXX' was a long string of apparently random characters), and a series of groups of 'microsoft' examples was identified, including keywords such as 'cloudworkflow', 'netsuites' and 'cloudroam'.

Comparison with other work

Other studies taking similar approaches to the analysis of domain entropy also reach similar conclusions. For example, an analysis outlined in a blog posting by Tiberium^[5] states that the use of an entropy threshold of >3.1 (as an indicator of potential concern) correctly classifies 80% of NCSC malicious domains, and incorrectly classifies only 8% of the top 1000 most popular (legitimate!) domains overall (cf. Table 1).

Domain name	Entropy value
google.com	1.918
youtube.com	2.522
facebook.com	2.750
twitter.com	2.128
instagram.com	2.948
baidu.com	2.322
wikipedia.org	2.642
yandex.ru	2.585
yahoo.com	1.922
whatsapp.com	2.500

Table 1: Entropy values of the SLDs of the top ten most popular websites according to Similarweb^[6]

Additionally, an article published by Splunk^[7] looking at the entropy values of fully qualified domain names, i.e. also including subdomain names - also states that high-entropy examples are consistent with the use of domain generation algorithms, and may be indicative of association with malware (e.g. in 'beaconing') and other web exploits. Comparable approaches and conclusions can also be found in a range of other studies^[8,9,10], with some finding improvements in the reliability of threat determination through the use of alternative measures such as relative entropy (essentially, a comparison against the character distribution observed in a dataset of known legitimate domains, so as to provide a better measure of the randomness arising from automated algorithmic registrations)^[11].

Conclusions

Domain-name entropy analysis has applications in at least two key areas of brand protection. The first of these is the ability to 'cluster' together related infringements, which has a number of benefits, including the ability to identify serial infringers and instances of bad-faith activity, for targeted and effective bulk enforcement actions. The second key area is as an input into algorithms to quantify the likely level of threat which may be posed by an online feature such as a new domain registration. Threat determination is essential in allowing prioritisation of results for analysis, enforcement, or content-change tracking.

All other factors being equal, there is some indication that high-threat domains - particularly those associated with automated registrations by domain-name generation algorithms - may have a tendency to sit at the higher-entropy end of the spectrum (and, furthermore, that domain names generated using a particular algorithm may be likely to have similar entropy values). This statement runs alongside the assertion that legitimate domains may (in general) be more likely to have lower entropy values, particularly where there is a desire for legitimate businesses to utilise strongly branded, short, memorable web addresses - as can be seen in many of the globally most popular websites.

References

[1] https://arxiv.org/ftp/arxiv/papers/1405/1405.2061.pdf

[2] https://interisle.net/sub/CriminalDomainAbuse.pdf

[3] https://www.linkedin.com/pulse/investigating-use-domain-name-entropy-clustering-results-barnett/

[4] https://www.linkedin.com/pulse/entropy-analysis-registered-domain-names-relating-top-david-barnett/

[5] https://www.tiberium.io/blog/chapter-2-classifying-domains-through-string-entropy/

[6] https://www.similarweb.com/top-websites/

[7] https://www.splunk.com/en_us/blog/security/random-words-on-entropy-and-dns.html

[8] https://hurricanelabs.com/blog/dns-entropy-hunting-and-you/

[9] https://www.logpoint.com/en/blog/embracing-randomness-to-detect-threats-through-entropy/

[10] https://suleman-qutb.medium.com/use-of-shannon-entropy-estimation-for-dga-detection-9ded275795ca

[11] https://redcanary.com/blog/threat-hunting-entropy/

This article was first published on 3 July 2023 at:

https://circleid.com/posts/20230703-an-overview-of-the-concept-and-use-of-domain-name-entropy

Thursday, 25 May 2023

The 'Millennium Problems' in Brand Protection

As the brand protection industry approaches a quarter of a century in age, following the founding of pioneers Envisional^[1] and MarkMonitor^[2] in 1999, I present an overview of some of the main outstanding issues which are frequently unaddressed or are generally only partially solved by brand protection service providers. I term these the 'Millennium Problems' in reference to the set of unsolved mathematical problems published in 2000 by the Clay Mathematics Institute^[3], and for which significant prizes were offered for solutions. Like their mathematical counterparts, the unsolved problems in brand protection will present significant benefits for any service providers able to develop and offer comprehensive solutions.

Brand protection basics

In their most basic sense, brand protection solutions generally consist of two components: monitoring (or, strictly, detection) of brand-related content on the Internet, and enforcement action to achieve the removal of infringing material. Monitoring is most usually carried out using technological solutions intended to identify relevant material on the Internet, across a range of relevant channels, typically using a combination of methodologies, namely: (i) Internet metasearching (i.e. the submission of relevant query terms to search engines) and web crawling; (ii) analysis of domain-name zone files (see Problem 2), to identify domains with names including brand-related terms (or variants); (iii) direct monitoring / searching on known sites of interest (see Problem 1); and (iv) other techniques, such as the use of spam traps and webserver logs, as used in phishing detection technologies^[4]. Many service providers will also make use of automated analysis tools, which can inspect the content of the identified webpages, and categorise and prioritise these results accordingly.

The 'Millennium Problems'

1. Social media monitoring

Whilst monitoring of content across social media platforms is a well-established element of many brand-protection service providers' product suites, it frequently remains extremely difficult to achieve anything approaching a comprehensive level of coverage. There are a number of reasons why this is the case. In general, social media content is most usually addressed using the 'direct site searching' approach (that is, using the search functionality typically in-built to the platforms themselves as a means of returning results), though some providers also have access to direct data feeds from the platforms (e.g. through an API). In general, a variety of types of content may be of interest, including brand references in usernames (e.g. associated with fake profiles), and the content of postings (e.g. associated with fraud, the sale of counterfeits, the spread of malware, brand disparagement, etc.) and elsewhere (including imagery, sponsored advertisements, and so on).

The main difficulty with the 'direct search' approach is that results presented to a user are often limited (sometimes significantly) unless the user is logged in to the social media platform. This can be circumvented by configuring a brand-protection monitoring tool to present itself to the platform as if it is a real user (with a registered account, handle (username) and password), or simply through the use of manual searches. Both of these approaches typically require the use of 'dummy' accounts and may be in contravention of the terms and conditions of the platforms themselves.

Other technological issues may also be problematic. Many social media platforms return results on an 'infinite scroll' basis (where additional results are continually added to the webpage as the user continues to scroll down through them), often with no indication of the total numbers of results which may be present, and many platforms also have specific access requirements, such as functionality only to be accessed via a mobile app (see Problem 7). Similarly, monitoring can be further complicated by sites where content is protected via the requirement to enter a CAPTCHA code, for example. It is also typically the case that the exact results returned to a user will be highly personalised, and dependent on their browsing history, interests, location, and personal demographic.

Some of these issues can be addressed through the development of partner relationships by brand-protection service providers with the platforms themselves. However, even in cases where the platforms are amenable to this approach, some of the above technological issues may remain difficult to address.

2. Comprehensive ccTLD monitoring

Another of the core elements of many brand protection service offerings is often a domain monitoring capability; that is, the ability to identify domains whose names include the name of the brand being infringed (and/or other relevant keywords). As a special subset of general Internet content, branded domain names are often of particular interest by virtue of their greater visibility (e.g. higher ranking in search-engine results) and the more explicit nature of the IP abuse (and an associated greater range of enforcement options)^[5]. Branded domain names have been noted in many previous studies as being popular with bad actors in the creation of infringing content of a variety of types, including phishing sites^[6], sites offering the sale of counterfeits, and sites claiming false affiliation or including disparaging content.

The primary source of data for domain monitoring is usually the analysis of zone files, which are data files published by the registry organisations responsible for overseeing the infrastructure of each individual TLD (top-level domain, or domain extension - such as .com), and which contain a list of all existing registered domains across that extension. By comparing the content of a zone file with that from the previous day, it is possible to identify new domain registrations (as well as dropped, or lapsed, domains) and filter this list for those examples containing a brand name or keyword of interest. Domain monitoring solutions can (and, in general, should) also make use of zone-file analysis to allow identification of the full pre-existing 'landscape' of registered domain names of interest, across the TLDs in question, at the commencement of monitoring (so-called 'baseline' analysis). The most sophisticated domain monitoring solutions can also automatically check for variations of the brand strings (such as typos), which are frequently used by infringers to construct deliberately deceptive domain names^[7,8].

Zone files are generally available for most gTLDs (generic, or global, TLDs such as .com, .net, etc.) plus the new-gTLDs which have been launched in the period since 2012^[9], but are often not published (or may not be comprehensive) by the registry organisations responsible for other TLDs, particularly the country-specific examples (ccTLDs). For this reason, detection of relevant domains across ccTLD extensions is typically incomplete, and a number of techniques may typically be used in order to fill in the gaps. These might include parallel look-ups (checks for domains with the same second-level domain name - i.e. the part of the domain name to the left of the dot - as examples identified through zone-file analysis), exact-match queries (regular searches for the existence of domains with second-level domain name strings of particular relevance, such as a brand name), and Internet metasearching. However, each of these approaches has its own limitations and, even when all taken together, there can always be domain names of potential concern which are not detected through any of these methods. The next generation of domain monitoring solutions will need to better address these shortcomings, potentially involving factors such as the use of improved algorithms to 'guess' candidate domain names for checking, and/or the use of more comprehensive indexes of Internet content. Additionally, the building of specific relationships with country registries - potentially combined with regulatory changes regarding the availability of zone files - may also be relevant.

3. Third-party subdomain monitoring

The subdomain is the section of a URL prior to the domain name, from which it is separated by a dot (e.g. 'translate' in 'translate.google.com'). The owner of a domain name can create whatever subdomains they wish, and can point these URLs to associated web content (via the configuration of DNS settings). Accordingly, subdomains can be used to create brand-related URLs, and can be associated with many of the same types of infringements as domain names themselves^[10]. Subdomain-based abuse can also be particularly attractive to infringers, both because it avoids the requirement to register a brand-specific domain name^[11] (which bad actors know can easily be detected by brand owners employing domain-monitoring services) and because there can be a low cost associated with the creation of the URL, particularly where a service provider allowing the free registration of personalised subdomains (such as blogspot.com) is used.

Consequently, the ability to monitor generally for brand references in the subdomain name of arbitrary URLs can be of great value. Note that this is distinct from the (relatively much simpler) problem of monitoring the existence and content of subdomains of official domains under the ownership of the brand owner 'internal' subdomain monitoring), since all of the relevant information is contained in the DNS configuration files held by the brand owner's domain-name management service provider.

Conversely, the identification of brand-related subdomains on third-party ('external') domain names is much more difficult. In many cases, this is achieved purely using Internet metasearching techniques (i.e. finding only content which is indexed by search engines in response to brand-specific query terms). Whilst this does mimic the search techniques used by general Internet users (and thereby identify the 'highest-visibility' content), it will in general not find all potentially threatening content (e.g. URLs to which traffic is driven through other means, such as links in spam e-mails). This problem can be mediated to some degree through the use of other techniques, such as passive DNS analysis or certificate transparency (CT) analysis, or via explicit queries for the existence of specific subdomain names of interest. However, these techniques require prior identification of the specific domains to be monitored; generalised identification of brand-related subdomains remains a much harder problem to solve.

4. Circumventing site blocking and geoblocking

Site blocking and geoblocking are two long-established problems in brand monitoring. The former arises when a monitored site becomes aware of repeated search queries from a particular source, and restricts access to the site from the IP address in question. A site owner may choose to do this for a number of reasons, including protection of website performance (e.g. in preventing DDoS attacks), or for compliance with their own terms and conditions (e.g. where they state that information is not to be collected for commercial purposes, such as by brand-protection service providers). Geoblocking (or geotargeting) is a related issue, whereby the visible content of a website may vary depending on the geographical location of the visitor. Again, this may be implemented by a site owner for a range of reasons, including the tailoring of content to a local audience, search-engine optimisation, security, or legal compliance^[12]. However, geoblocking can also be employed by infringers as a means of evading detection, and can also present difficulties in enforcement, where it may be necessary to demonstrate exactly what content is visible from a specific remote location.

The solutions to these issues, from a brand-protection point of view, are relatively simple in principle, generally involving the use of proxies (standalone external machines serving as intermediate 'hops' through which search queries from a brand-protection service provider are routed, so as to 'mask' the originating IP address) in a range of remote locations, and/or (particularly for site blocking) the building of relationships with the sites being monitored, so that the monitoring service provider can gain permission for collecting the data. However, in practice this requires a great deal of investment in building the required infrastructure (such as hosting and maintaining the necessary proxies, and configuring the monitoring software to communicate with them) and establishing the necessary relationships. Furthermore, the construction of appropriate user interfaces to visualise and interpret the relevant information (such as the ability to compare the content of a particular website across a range of different user (i.e. proxy) locations, in cases where geoblocking or geotargeting may be an issue) can also be a complex prospect.

5. Clustering and open-source intelligence analysis

The subject areas of clustering and open-source intelligence (OSINT) are generally of greatest relevance for entity investigations, i.e. the process of using Internet searches to build a portfolio of information relating to an identified individual or website of interest. Such information can be used for a range of purposes, including background for on-the-ground investigations or goods seizures, or for legal cases, but can also be useful background for enforcement actions (e.g. in identifying clusters of related infringements for efficient bulk takedowns in a single action).

A number of technological solutions exist for visualising the links behind related entities, on the basis of common shared characteristics (such as e-mail addresses, telephone numbers, web-hosting information such as IP addresses, and so on) - i.e. 'clustering', but it is often the case that the characteristics themselves require identification through manual analysis processes. A great deal of additional efficiency can be built into the process, however, through the use of monitoring and analysis tools which can identify and extract this information automatically. This is relatively more straightforward in cases where the data can be extracted in a consistent manner (e.g. performing an IP-address look-up for any identified website of interest), and/or where the information is contained in a known location on a webpage with a fixed, pre-defined format (the 'contact details' section of a social-media profile page), such that a web scraper can be configured to pull out the content. It is a considerably more difficult enterprise to extract such information from general webpages where the structure of each page is not known in advance. In these cases, the approach generally needs to be based on the configuration of monitoring tools which are able to extract text-strings with the general format of (say) an e-mail address or telephone number. This then typically requires an element of post-processing to 'clean' and standardise the data. The next generation of clustering tools are likely to make extensive use of artificial intelligence in order to do this, in addition to also then drawing out insights between the clusters thus produced.

6. Dark Web monitoring

Dark Web content is the general name given to online material for which there are special access requirements; however in the context of online brand monitoring, it is usually taken to refer to content which is only accessible via the Tor network (a decentralised network involving the use of encrypted communications, and connections via multiple hops between Tor servers (proxies) - also known as relays or nodes). The Tor network - which is accessed using specially enabled browsers - can be used to view regular ('surface web') Internet content (and is one option open to users for whom anonymity is important), but is more usually used to access websites with the .onion extension, i.e. those which are only accessible from within the network^[13].

The Tor network of .onion websites includes a range of different content types, but is notorious for illegal and infringing content and, as such, can be a key area of interest for brand monitoring. However, many brand protection service providers offer only limited capabilities in this area. This is for a number of different reasons. One significant factor is that the Dark Web is essentially unregulated, frequently with no available links to 'real-world' contact details, and extremely limited enforcement options against infringing content. However, even in cases where takedown is not possible, intelligence on the content can be extremely valuable - one example may be on 'carder' websites, on which stolen financial credentials are traded; if (say) a financial services company can determine that the details for a particular credit card or bank account are being offered for sale, this provides the opportunity for the account to be 'locked' or deactivated.

It can also be extremely difficult to configure monitoring software to search the Dark Web. Whilst it is technically relatively straightforward to configure systems to be Tor-enabled (although connections are typically rather slow), there are generally no robust indexes of Dark Web content (such as the search engines and zone files used to search surface-web content), not least because the .onion addresses for any given website - which usually consist of long, random alphanumeric strings - are generally short-lived and change over time. A number of Dark Web search engines do exist, together with ad-hoc indexes of Dark Web content posted by users on sites such as Pastebin, but the information on these sources typically becomes out-of-date rather quickly.

The nature of the content on the Dark Web also means that security concerns can be an issue for brand-protection service providers wishing to build their capabilities in this area.

7. Mobile-based technologies

As Internet engagement has continued to grow over recent years, an increasing proportion of Internet use is conducted over mobile devices^[14,15], using a wide ecosystem of mobile apps. Many platforms are now almost exclusively mobile-based, often with little or no corresponding web presence - popular examples might include the WeChat / Weixin platforms, public groups on messaging services such as WhatsApp, and e-commerce platforms such as Pinduoduo. Many brand-protection service providers use legacy monitoring technologies which were designed specifically for analysing HTML content on the regular Internet and are often poorly equipped to address mobile technologies. In some cases, the work-around is to make use of standalone mobile devices or emulators - on which significant proportions of the monitoring is conducted manually - and there typically remains significant work to be done in order to fully integrate the relevant technologies into core monitoring capabilities.

8. Addressing the Web3 landscape

Web3 (also known as 'Web 3.0') is a general term referring to decentralised content on the Internet, with a particular focus on blockchain technologies. Blockchains are publicly accessible digital ledgers in which transactions are recorded, and form the basis of many digital currencies (or 'cryptocurrencies') (such as Bitcoin), in addition to a number of other applications, such as supply-chain control by brand owners. From a brand-protection viewpoint, the main related areas of interest are typically NFTs and blockchain domains^[16].

NFTs (non-fungible tokens) are digital files whose ownership is recorded on a blockchain. They are most commonly associated with graphics files (such as artworks and branded imagery) or other types of digital content (such as audio or music files). However, brand owners are increasingly incorporating NFTs into their business models, including areas such as the production and trade of virtual branded items (e.g. items to be worn by avatars in virtual-reality environments within the 'metaverse', the name given to a generalised connected environment of 3D virtual worlds). Consequently, unofficial branded NFTs can be a source of concern for brand owners.

Blockchain domains - which are recorded (together with their ownership details) on a blockchain, rather than using traditional registrars and web hosting - have a number of similarities to 'classic' domain names, and can be utilised in a number of ways. The most common uses are the creation of decentralised websites on peer-to-peer (P2P) platforms, to be accessed via specially-enabled browsers, or as addresses for sending and receiving cryptocurrency. However, the blockchain domain ecosystem is essentially unregulated, and nothing analogous to domain-name zone files is available. The system is made additionally more complicated by the fact the infrastructure allows for the possibility of domain-name 'clashes' - i.e. the potential for the same name to exist independently on distinct blockchains. As with traditional domain names, blockchain domains with brand-specific names can be threat to brand owners, and a potential source of confusion for customers.

Both NFTs and blockchain domains can be traded on NFT marketplaces (such as OpenSea), and the monitoring of these sites is typically the primary source of intelligence utilised by those brand-protection service providers offering capabilities in this area. For blockchain domains particularly, this approach is less than satisfactory, and offers nothing approaching the sort of comprehensive coverage as is available for regular gTLD domain names via zone-file analysis. Some additional information on the existence of registered blockchain domains is typically available through direct searches within databases provided by blockchain domain registrars and nameserver providers; however, the problem of more comprehensive detection is much more difficult to solve, potentially involving analysis of the content of the individual blockchains directly.

Another difficulty to be overcome in service offerings relating to NFTs and blockchain domains is the issue of enforcement against infringing content. In some cases, enforcement can be carried out through the submission of a DMCA (Digital Millennium Copyright Act) notice, and some NFT marketplaces have specific takedown procedures for content which infringes protected IP. However, in many cases, this simply involves the item being 'delisted' from the marketplace in question. In the future, we may see a move towards more rigorous enforcement, potentially involving forced transfers of ownership. Part of the problem is that the legal issues surrounding NFTs and blockchain domains are, in many cases, still not well-defined and are rapidly evolving, complicated by factors such as the fact that ownership of an NFT ownership does not necessarily grant ownership of copyright for the embedded content.

Beyond #8: Other emerging technologies

As new Internet technologies continue to emerge and develop, they will bring with them new risks for brand owners and associated challenges for brand-protection service providers, who will need to continue to observe and innovate in order to stay ahead of the curve.

At any given time, it is unclear where the next area of concern will come from. Currently, there is a great deal of buzz and speculation about artificial intelligence (AI) technologies and chatbots such as ChatGPT, but it is less obvious how these may affect brand-protection considerations. In this context, I am referring to content associated with, or produced by, AI applications. (Conversely, however, it seems highly likely that AI capabilities will be increasingly built into technologies used to facilitate the brand-protection process - i.e. tools to assist with monitoring, prioritisation, clustering and enforcement.)

Users are able to communicate with AI technologies such as ChatGPT via natural language, which are then able to construct responses based on information with which they have been 'trained'. This means that the information available from a chatbot is only as good as the data with which it has been trained (essentially, in the case of ChatGPT, including large volumes of Internet databases^[17,18]), and should really be treated with at least as much caution as the old "I'm Feeling Lucky" button on Google, where the user is just presented with a single response (not necessarily the most reliable one!) to any given query. This point is all the more valid given the ability of chatbots to extrapolate, and provide responses based on incomplete information. What this all means is that chatbots pose the risk of providing information about (say) a company or brand which is misleading or otherwise damaging to corporate reputation. However, since responses are generated dynamically in response to queries (rather than being 'fixed', as in the content of an HTML webpage), it is not clear how these issues might be addressed from a brand-protection point of view. Further complications surround issues such as the ownership of rights to content produced by AI technologies^[19].

Where chatbots may be of particular concern from a brand-protection and cybersecurity point of view is in their ability to rapidly create content of a wide variety of types, in a range of different styles - including the ability to write and de-bug computer code. What this may mean is that the entry barrier for infringers wishing to create compelling phishing e-mails^[20], or write malicious programs ('malware')^[21] may be significantly diminished. The likelihood is - at least in the first generations of AI technologies - that AI will not so much change the types of attack which are possible, but rather the ease with which they can be executed^[22].

Another issue surrounds use-cases in which AI systems are 'trained' with confidential corporate information as part of the process of creation of company materials (such as marketing releases). These scenarios raise the possibility for the information to be accessed by third parties, either directly via hacking, or via content included in the responses provided to other users, depending on the ways in which information is 'shared' within the infrastructure of the AI technology itself^[23].

References

[1] https://www.cst.cam.ac.uk/ring/halloffame

[2] https://www.markmonitor.com/download/ds/MarkMonitor-Corporate-Overview.pdf

[3] https://www.claymath.org/millennium-problems

[4] https://www.linkedin.com/pulse/assessing-mediating-digital-risk-landscape-brand-david-barnett/

[5] https://www.worldtrademarkreview.com/global-guide/anti-counterfeiting-and-online-brand-enforcement/2022/article/creating-cost-effective-domain-name-watching-programme

[6] https://www.cscdbs.com/blog/branded-domains-are-the-focal-point-of-many-phishing-attacks/

[7] https://www.cscdbs.com/en/resources-news/threatening-domains-targeting-top-brands/

[8] https://www.linkedin.com/pulse/hyphenated-domain-infringements-david-barnett/

[9] https://newgtlds.icann.org/en/about/program

[10] https://www.cscdbs.com/blog/the-world-of-the-subdomain/

[11] https://www.linkedin.com/pulse/exploring-domain-hostname-based-infringements-david-barnett/

[12] https://www.cscdbs.com/blog/do-you-see-what-i-see-geotargeting-in-brand-infringements/

[13] 'Brand Protection in the Online World: A Comprehensive Guide' by David Barnett (2016). Chapter 11: ''Deep' and 'Dark' Web'

[14] https://www.statista.com/statistics/617136/digital-population-worldwide/

[15] https://www.linkedin.com/pulse/holistic-brand-fraud-cyber-protection-using-domain-threat-barnett/

[16] https://www.linkedin.com/pulse/rise-nft-david-barnett

[17] https://www.sciencefocus.com/future-technology/gpt-3/

[18] https://techcrunch.com/2023/03/23/openai-connects-chatgpt-to-the-internet/

[19] https://intellectual-property-helpdesk.ec.europa.eu/news-events/news/intellectual-property-chatgpt-2023-02-20_en

[20] https://securityboulevard.com/2023/01/what-does-chat-gpt-imply-for-brand-impersonation-qa-with-dr-salvatore-stolfo/

[21] https://www.digitaltrends.com/computing/chatgpt-created-malware/

[22] https://venturebeat.com/security/security-risks-evolve-with-release-of-gpt-4/

[23] https://blogs.blackberry.com/en/2023/04/is-chatgpt-safe-for-organizations-to-use

This article was first published on 25 May 2023 at:

https://circleid.com/posts/20230525-the-millennium-problems-in-brand-protection

Thursday, 9 February 2023

Calculation of return on investment for brand-protection programmes: Thoughts towards a new paradigm

Pre-existing ideas

Numerous previous studies have considered methodologies for calculating the return on investment (ROI) of brand-protection programmes which incorporate components of monitoring and enforcement. These ideas can be important both to justify the spend on a programme in the first place, and to assess its impact once established. Correspondingly, 'classic' ROI calculations can be categorised into two main types: the first (known as 'a priori' calculations) consider the probable infringement landscape in advance of the implementation of a brand-protection programme; the second aims to quantify the actions taken as part of an active enforcement initiative^[1]. It is the latter category with which we are primarily concerned in this article.

To a very high level, many ROI calculation methodologies use a formulation along the lines of:

R = C × E

where R, the ROI (within a given timeframe) (i.e. the benefit of the brand-protection programme, to be offset against the associated spend) is equal to the product of C, the 'cost' of a pre-existing infringement being active, and E, the number of infringements removed through enforcement as part of the brand-protection programme (in the same timeframe).

Very many assumptions are typically required in order to estimate these figures. In some methodologies, the assumed 'cost' associated with a live infringement may be reflective of an estimate of its direct financial impact (e.g. the typical loss from a phishing incident); in others it may be calculated as the proportion of lost revenue which is reclaimable following deactivation of the infringement (i.e. the 'cost' in the above formulation essentially reflecting the pre-enforcement impact of not yet having taken the infringement down). In these types of approaches, it is very rare that these figures can be measured directly and therefore a number of assumptions (or 'proxies' for the data) are required. In cases of domain acquisition, for example, it may be appropriate to make use of figures such as web traffic when quantifying impact; for marketplace listings, it is typically necessary to consider factors such as price and quantity of items in the listings removed. In both cases, the methodology needs to consider assumed conversion rates (i.e. the proportion of customers who can be 'monetised' by the brand owner - e.g. those who will make a legitimate purchase once the source of infringements is removed)^[2,3]. Even this part of the process is far from simple; complications include factors such as:

The conversion rate will be (strongly) dependent on the nature and price of the item (e.g. it will be much lower for (say) an obvious counterfeit, such as an item passing off as a high-end luxury brand but with a very low price point)^[4].

The conversion rate for customers knowingly navigating to an official brand website will potentially be different to that for those Internet users intending to visit a third-party standalone e-commerce site (if we are considering the case where this domain may subsequently have been acquired by the brand owner and its traffic re-directed to their official site) - this consideration involves taking account of a principle sometimes referred to as the 'substitution effect'^[5].

Alternative proxies for the above figures may also need to be utilised, depending on the web channel under consideration (e.g. where absolute estimates of web traffic are not available or appropriate). For example, on social media, the 'exposure' or 'reach' of content can be estimated using numbers of 'likes' or followers; for mobile apps, the number of downloads may be relevant; for file sharing, it may be appropriate to consider the number of individuals accessing the content (e.g. 'seeds' and 'leechers' for BitTorrent).

Numerous other approaches can also be taken. The ultimate objective when estimating the 'value' of a website is the identification a direct measure of the revenue it generates (e.g. via direct sales of products, for an e-commerce site). In practice, this information is almost never publicly available, though it is sometimes possible to make estimations via shipping or logistics information available through third-party databases. Some methodologies will utilise web-analytics tools to estimate value based on factors such as advertising spend by the site owner, or will analyse outgoing site traffic (e.g. to payment service provider platforms) to estimate customer volume and/or conversion rates^[6].

It has also previously been noted that sometimes determination of ROI can reflect more qualitative goals (i.e. the statements of 'what success looks like' for a brand-protection programme). For example, a brand owner may consider a programme 'successful' once there are no infringing results returned on the first page of search-engine results, or in pages of search results on a range of key marketplace sites, in response to brand-specific queries. Similarly, the 'ownership of the buy button' (i.e. being the first vendor listed for a particular product on an e-commerce marketplace site) might be a key aim.

The success of a brand-protection initiative can also be judged based on other (again, more quantitative) metrics which may be only available to the brand owner themselves (as opposed to, say, a brand-protection service provider partner). These might include factors such as increases in the numbers of visitors to physical stores, or in volumes of traffic to official websites (as might be directly measurable using the brand owner's webserver log information).

Beyond this, wholly different methodologies can also be applied. Some will take account of 'intangible' factors such as brand value^[7], considering the spend on brand protection to be a business cost necessary to lower the risk of damage to the brand. This type of approach is also not straightforward - higher levels of abuse can be considered an indicator that the brand is a desirable one, which can actually be reflective of greater brand value. Other factors, such as new product launches, can also affect the visibility of the brand and its likelihood of being targeted, all of which can serve to further complicate the landscape.

However, in this article, we will primarily consider the simpler approaches discussed in previous work, and look at how they can potentially be modified to better account for the overall impact of a brand-protection programme.

Variations over time in the infringement landscape

Part 1: Single-brand analysis

In this section, we consider an extremely simplified model looking at changes in the infringement landscape over time for a brand, considering in the first instance the example of a newly-launched brand. In this case, the growth in the number of infringements over time might look something like that shown in Figure 1.

Figure 1: Mock-up of the changing infringement landscape over time for a newly-launched brand

The above framework is formulated using a timeframe expressed as numbers of months for convenience, though the timescales observed in practice may vary hugely. There is also a deliberate choice to avoid stating any quantitative numbers for the volumes of infringements, as these will also be dependent on any number of different factors - one brand may see tens or hundreds of infringements; other may see many thousands or more. Beyond these points, the construction of the above trend lines is based on the following scenario:

Following the launch of the brand (in month 1), there is a ramp-up in the number of monthly numbers of new infringements ('N') appearing online, up to a constant level.

There is also a (slower) ramp-up in the rate of infringements disappearing naturally from the Internet ('natural removal', 'R') even in the absence of any enforcement activity. This will arise through a combination of factors, including: content which is deactivated by the infringer following a period of use; domains expiring after their registration period; older content gradually dropping down search-engine rankings (and potentially therefore eventually ceasing to have any damaging impact), and so on.

There is a resulting growth in the cumulative number of active online infringements ('I'), caused by the difference between the monthly values of 'N' and 'R'.

Finally, it seems reasonable to assume in most cases that 'I' will eventually reach a steady state, rather than continuing to grow indefinitely. This implies that 'R' will eventually 'catch up' with 'N' (possibly in part due to the fact that 'N' may also drop off slightly over time, after an initial peak in infringement activity).

Of course, in practice the exact balance between the above numbers will be dependent on an enormous range of factors, including considerations such as the type of Internet channel. For example, marketplace listings will typically have a shorter 'lifetime' than domain registrations (affecting, for example, the rate at which 'R' catches up with 'N').

Let us now consider the case where a brand-protection programme, incorporating the introduction of enforcement actions for the removal of infringing content, is added into the picture (say, after the landscape has reached steady state in month 12) (Figure 2).

Figure 2: Mock-up of the changing infringement landscape over time, with an enforcement programme introduced in month 12

In this case, we use the following formulation:

In month 12, the enforcement programme is introduced, which incorporates a particular level of resource sufficient to action a certain maximum number of takedowns each month. This number will of course need to be greater than the rate at which new infringements appear, if the programme is to be successful.

Following the introduction of the enforcement programme, the rate of natural removal ('R') of infringements will quickly drop off to zero (essentially, the infringements are being removed via enforcement quicker than the rate at which they would otherwise naturally disappear).

As enforcement progresses, the cumulative number of infringements drops off from its pre-existing level, until we reach a steady state (the 'whackamole' phase^[8]) where the monthly number of enforcements ('E') simply needs to 'keep up' with the rate at which new infringements appear ('N'). In other words, each month a certain number of new infringements appear and these are all removed through the actions of the enforcement programme. (N.B. Equivalently, at this point we could express the 'cumulative number of infringements' ('I') as zero, depending on the point in the month at which we carry out the calculation (i.e. whether pre- or post-enforcement).)

In reality, the situation is likely to be far less straightforward, with a number of additional factors complicating the picture, including (but not limited to) the facts that:

The types of infringements actioned over time may change (potentially starting with higher-impact or easier takedowns).

Monitoring will inevitably start to uncover lower visibility and/or lower severity infringements once the initial high-visibility, high-impact infringements have been taken down.

The rate of appearance of online infringements may change in response to the enforcement programme (e.g. infringers turning their attention to easier targets).

The infringers may change their tactics in response to the enforcement programme (e.g. describing goods in different ways) - accordingly, both the monitoring approach and the enforcement methodologies may need to evolve in order to account for this.

Nevertheless, the above very simplistic picture does reflect some of the top-level trends typically seen in a brand-protection programme, with an initial period of 'cleaning up' the pre-existing backlog of infringements followed by a steady-state period of lower required activity, just keeping pace with new infringements as they appear.

This being the case, we can look to this model to draw insights into how our classic ROI calculation methods could be augmented to provide a fuller picture. In many of the traditional approaches, monthly ROI calculation methodologies make use just of the total monthly numbers of enforcements carried out ('E'). Although the drop-off in the numbers of pre-existing infringements is reflected in the ROI calculations associated with the enforcements carried out during the 'ramp-down' phase itself, it is usually not reflected in the ongoing calculations during the subsequent 'whackamole' phase. Really, it may be preferable to make use of the difference between the ongoing number of infringements ('X') and that observed at the start of the programme ('Y'), if we are to fully assess the impact of the brand-protection programme. In other words, rather than using the number 'X' as the basis of our monthly ROI calculation, it might instead be better to use 'Y – X'. This number instead provides a measure of the value of the ongoing brand-protection programme - essentially, reflecting the difference in the ongoing number of infringements (with the associated 'cost' of them being live) compared with that which would have been observed if the programme were not in place. In practice, determination of these numbers will require the brand-protection initiative to incorporate a comprehensive programme of monitoring (as well as enforcement) throughout, incorporating a full landscape 'audit' at the outset.

Part 2: Benchmarking and the use of controls

To further complicate the situation, what the above approach fails to consider is any changes to the infringement landscape which would have occurred if the brand-protection programme were not being carried out. This is known as the 'attribution' issue in the physical sciences. Of course, once enforcement starts being carried out, we lose the ability to see what would have happened to the numbers of infringements if they were not being actively taken down. It is well established that external factors can significantly change the infringement landscape. For example, numerous previous studies show that real-world events can drive spikes in resulting infringement activity^[9].

One way in which this problem can be addressed is via comparison with another 'control' brand of a similar type, operating in a similar industry area, but for which brand-protection activity is not being carried out. In practice, a brand owner can never be completely sure what any given competitor is doing, so a more realistic scenario is the use of analysis a group of industry peers, across which the infringement trends over time can be averaged to create a 'benchmark'. Of course, this requires active monitoring across all these brands, and so may be far from straightforward.

In this case, we may end up with a scenario such as that shown in Figure 3, where the control or benchmark brand (actually ideally an average of the data collected across multiple third-party brands) - which we have to assume reflects external drivers in infringement trends in the absence of enforcement initiatives - shows a change in the infringement landscape since the start of the programme for the brand being protected.

Figure 3: Mock-up of the changing infringement landscape over time, with an enforcement programme introduced for the customer brand in month 12, and compared against a (pre-existing, established) benchmark brand(s)

In the above example, the control brand shows a ramp-up in infringements during the period of the brand-protection programme, perhaps driven by an external event of some sort. Additionally, by using a benchmark comprising data from across numerous brands, we reduce the likelihood that the change is driven by some characteristic specific to one brand (such as a new product launch) and increase the likelihood that the change is representative of the industry landscape in general.

In this case we can assume that, in the absence of a brand-protection programme, the infringement landscape for the customer brand would have increased by the same proportion as that seen for the benchmark brand(s). Therefore, instead of our ROI calculation being a function (' f ') of 'Y – X' (written as 'ROI = f [Y – X]'), we can say that:

ROI = f [ ( (B/A) × Y ) – X ]

Essentially, we are saying that, had the brand-protection programme not been in place, we might have expected the 'background' level of infringements for the customer brand also to have increased by a factor of ('B/A') by the end of the monitoring period, and so the benefit of the programme is in reducing it from this value to the value observed ('X').

Of course, the same approach can also be used if the benchmark shows a decrease in infringements across the monitoring period.

Discussion

The calculation of ROI for brand protection is fiendishly complicated, and no single approach will be applicable in all cases. In any selected methodology, it is necessary to make use of a wide range of assumptions and proxies for the data to which we would ideally like to have access. Nevertheless, there are some general industry-accepted standards for these calculations, many of which utilise metrics around ongoing levels of enforcement activity. In this article, we have considered some approaches which could be taken to modify these methodologies towards a new framework of ideas, involving the following two fundamental changes:

Considering the difference between the ongoing levels of enforcement (as a measure of the ongoing level of infringement activity), and those seen at the outset of the programme, as a measure of the overall impact of the brand-protection programme (rather than just considering the ongoing levels of enforcement in their own right).

Considering the use of one or (ideally) more benchmark brands, to separate out the observed change in infringement levels (for the customer brand) arising from the enforcement activity, from other background or landscape changes applicable to the industry vertical in general.

Even then, there are still other factors to consider - the customer brand may also have experienced (company-specific) issues (such as product launches, changes in sales channels or target markets, etc. etc.) which themselves could have driven changes in the number of infringements, even in the absence of an enforcement programme or industry issues. All of this can further complicate the calculations to be carried out.

Additionally, I anticipate that the general philosophy behind ROI calculations may need to evolve further to reflect other issues more directly tied to cybersecurity, as the importance of this area becomes more widely appreciated. A former colleague of mine recently asked in a LinkedIn posting^[10]:

""So what's the cost?" is a frequent question I hear. Rather than thinking about the budget required, brands need to consider the financial and reputational costs of repairing the damage when they are impacted."

The key point here is thinking about proactive rather than reactive measures. This issue is particularly relevant when it comes to domain security, where a range of products are available to allow corporations to secure their domains from external attack vectors which can be highly damaging (from both financial and reputational points of view)^[11]. The matter is of even greater urgency in a landscape where we still see significant proportions of the world's top companies failing to adequately protect themselves^[12].

The expected financial loss ('L') per year due to (say) cybersecurity issues (an 'attack') is given^[13] by:

L = p_att × C_att

where p_att is the probability of an attack occurring during the year, and C_att is the financial cost (the 'damage') resulting from the attack. From this, we can say that, if the probability of an attack can be reduced (from p_att^{without_security} to p_att^{with_security}) through the implementation of domain security measures, the saving ('S') to the organisation can be written as:

S = ( p_att^{without_security} – p_att^{with_security} ) × C_att

Whilst easy to formulate, this can be much harder to quantify. However, a recent study showed that 88% of organisations were subject to some form of DNS attack in 2021, with each attack costing the enterprise an average of almost $1 million^[14]. If, then, the risk of an attack can be (conservatively) reduced from (say) 10% to 1% though the introduction of security measures, this equates to an equivalent annual saving to the company of the order of $90k. If the cost of implementing the security measures is less than this value, the return on investment will be positive. If we factor in also the implications for access to - and cost of - cyberinsurance cover, the importance of domain security products and services becomes ever clearer.

Acknowledgements

Thanks must go to Angharad Baber, Mark Barrett and David Riley for their feedback and input into this article.

References

[1] https://www.worldtrademarkreview.com/anti-counterfeiting/return-investment-proving-protection-pays

[2] https://www.worldtrademarkreview.com/global-guide/anti-counterfeiting-and-online-brand-enforcement/2022/article/creating-cost-effective-domain-name-watching-programme

[3] https://www.cscdbs.com/blog/four-steps-to-an-effective-brand-protection-program/

[4] https://circleid.com/posts/20220726-calculating-the-return-on-investment-of-online-brand-protection-projects

[5] 'Digital Brand Protection: Investigating Brand Piracy and Intellectual Property Abuse' by Steven Ustel (2019). Chapter 17: 'Accounting and Accountability'

[6] 'Digital Brand Protection: Investigating Brand Piracy and Intellectual Property Abuse' by Steven Ustel (2019). Chapter 9: 'Pivots'

[7] https://www.cscdbs.com/blog/brand-abuse-and-ip-infringements/

[8] By 'whackamole' in this context, I am referring to a consistent state in which infringements are reactively taken down as quickly as they appear (rather than implying a random or disordered approach).

[9] https://www.linkedin.com/pulse/four-new-case-studies-domain-registration-activity-spikes-barnett/

[10] https://www.linkedin.com/posts/stuart-fuller-17a7411_what-cisos-can-do-about-brand-impersonation-activity-7027979839747846144-E6ak

[11] https://www.linkedin.com/pulse/holistic-brand-fraud-cyber-protection-using-domain-threat-barnett/

[12] https://www.cscdbs.com/en/resources-news/domain-security-report/ (2022)

[13] This follows from the fact that, mathematically, the expected value ('E_x') of a variable ('X') is given by:

E_x = S_i ( p(X_i) × X_i ), where p(X_i) is the probability of X taking the i^th value

[14] https://www.efficientip.com/wp-content/uploads/2022/05/IDC-EUR149048522-EfficientIP-infobrief_FINAL.pdf

This article was first published on 9 February 2023 at:

https://www.linkedin.com/pulse/calculation-return-investment-brand-protection-thoughts-david-barnett/

Wednesday, 8 February 2023

Hyphenated-domain infringements

Introduction

In this latest study, I consider domain-name infringements consisting of close matches to official brand websites, but differing only in the addition of a hyphen within the domain name. This follows on from previous studies looking at highly-convincing deceptive URLs, such as those utilising exact matches, homoglyphs or fuzzy matches^[1], or hostname-based infringements^[2]. An example of this type of infringement being used for fraudulent purposes was identified in November 2022, for a financial-services brand. The scam comprised a phishing attack utilising a SMS message as the attack vector; a mock-up of the SMS message (represented using the fictitious brand financebrand.com) is shown in Figure 1.

Figure 1: Mock-up of an SMS phishing message utilising a hyphenated-domain infringement

The scam - which utilises the infringing domain name financebran-d.com - has been cleverly designed to take advantage of the tendency of mobile SMS clients to split URLs after the '-' symbol, thereby creating the appearance of the official domain name (financebrand.com) split across a line-break with a breaking-hyphen (as is seen in the other text at the start of the message).

Methodology

To investigate the popularity of this type of infringement, I considered domain registration activity in which the domain name is an exact match to the name of any of the top ten most valuable brands in 2022 according to Interbrand^[3], but including a hyphen between any pair of adjacent characters (e.g. for Google, I searched for 'googl-e', 'goog-le', 'goo-gle', 'go-ogle' and 'g-oogle')^[4]. The analysis encompasses new registrations ('N'), re-registrations ('R') and drops (domain lapses) ('D') (collectively referred to as 'events').

In practice, the types of variation considered in this study would be covered by the 'fuzzy' match category included within sophisticated domain monitoring technologies, when simply searching for the brand string itself.

Findings

The dataset included 252 distinct domain registration activity events for the brand variations under consideration, representing 140 distinct domain names (of which 83 were still registered as of the time of analysis^[5] - i.e. those for which the most recent event was not a 'D'). The breakdown of these domains by targeted brand and TLD is shown in Figures 2 and 3.

Figure 2: Breakdown of the 140 distinct hyphenated domain variants by targeted brand

Figure 3: Breakdown of the 140 distinct hyphenated domain variants by TLD

Of the 140 domain names, only 14 (10%) are explicitly registered to the associated brand owner (where the domains are registered and whois information is available), with the remainder registered to third parties and/or utilising privacy-protection services or having redacted information. 11 of the 14 officially owned domains have been configured to re-direct to the main brand website (with the other three not resolving to any live site).

The following is a summary of the characteristics of the 126 remaining sites:

27 (21%) are configured with active MX records, indicating that they have been configured to be able to send and receive e-mails, and could potentially be used for phishing attacks.

One (no longer live) displays a browser warning indicating that dangerous content was formerly present.

Two are configured to re-direct to the corresponding official brand website.

The remainder display a range of content types, as shown in Figure 4.

Figure 4: Overview of content types on the 126 non-official domains in the dataset

Of the 73 possible permutations of .com domains (i.e. those with the greatest potential for confusion with the primary official .com site for the respective brand in question), 30 are present in the dataset, of which only 9 are registered to the brand owner, and 9 are configured with active MX records (of which only one is officially owned).

Figure 5 shows examples of some of the unofficial sites within the overall dataset found to resolve to live content of potential concern.

Figure 5: Examples of live sites hosted on hyphenated domain-name variants targeting the Nike (top), Amazon (middle), and Microsoft (bottom) brands

Summary and recommendations

The analysis shows that the registration of hyphenated domain-name variants targeting the most valuable brand names, by entities other than the brand owners, is a significant issue and may be growing (as 24 of the 71 third-party domains for which creation dates are available were registered in 2022, compared with 17 in 2021, 6 in 2020, and 24 across all earlier years).

Around one in five of the domains are configured with active MX records, and of the domains resolving to live content, a range of types of site content were identified. These include examples where web traffic is misdirected to third-party content, and others where the sites are being monetised through the inclusion of pay-per-click links or offers to sell the domain name. This indicates that not only do these domains present the potential for convincing attack vectors in phishing activity, but they may also be taking advantage of misdirected traffic arising from mis-typed search queries or browser requests. It is also noteworthy that the list of top TLDs within the dataset includes a number of new-gTLDs, many of which have previously been noted as being popular with infringers^[6,7,8,9].

These findings highlight the importance for brand owners carrying out proactive and comprehensive programmes of brand monitoring and enforcement, to identify and takedown infringing third-party content. Additionally, brand owners may wish proactively to consider defensively registering hyphenated variants of their core domain names, to prevent them being registered by third parties for fraudulent or infringing use.

References

[1] https://www.cscdbs.com/en/resources-news/threatening-domains-targeting-top-brands/

[2] https://www.linkedin.com/pulse/exploring-domain-hostname-based-infringements-david-barnett/

[3] https://interbrand.com/best-global-brands-2022-download-form/; the brands are: Apple, Microsoft, Amazon, Google, Samsung, Toyota, Coca-Cola, Mercedes-Benz, Disney, Nike

[4] N.B. I exclude from this study any variants where the hyphen appears in the same location as a hyphen or space in the brand name itself (i.e. 'coca-cola' and 'mercedes-benz'), since these are considered exact matches to the brand name, rather than hyphenated variants. I do, however, consider the existence of variants such as 'coca-col-a' and 'cocacol-a'.

[5] All observations correct as of 22-Nov-2022

[6] https://www.cscdbs.com/blog/branded-domains-are-the-focal-point-of-many-phishing-attacks/

[7] https://circleid.com/posts/20210908-credential-hinting-domain-names-a-phishing-lure

[8] https://unit42.paloaltonetworks.com/top-level-domains-cybercrime/

[9] https://www.cscdbs.com/blog/the-highest-threat-tlds-part-2/

This article was first published on 8 February 2023 at:

https://www.linkedin.com/pulse/hyphenated-domain-infringements-david-barnett/

Tuesday, 7 February 2023

Exploring the domain of hostname-based infringements

Introduction

As noted in numerous previous studies, one of the main objectives in the construction of a deceptive infringement (such as a phishing site) may be the use of a URL which appears similar to that of the official site being targeted.

One way in which this can be achieved is by constructing a hostname (consisting of a subdomain and domain name combination) which is identical (apart from an additional dot) to that of the genuine brand site. Active use of this technique has been observed in numerous cases - e.g. considering the case of the fictitious banking brand bankbrand.com, the use of a URL such as ba.nkbrand.com to target the bank's customers with a phishing attack. In order to put this type of attack into practice, the infringer needs to register a domain name which is a truncated form of the official brand site (in the above case, nkbrand.com), allowing them to construct the full hostname by configuring the required subdomain (in this case, 'ba.').

Study methodology

In order to investigate the scale of this practice being used for fraud and other brand infringements, I consider hostname-based variations of each of the top 50 most popular brand websites on the Internet^[1] (see Appendix). For example, for the domain google.com, I investigate whether any live content exists at any of the following hostname-based variations:

g.oogle.com
go.ogle.com
goo.gle.com
goog.le.com
googl.e.com

This approach (i.e. checking the subdomain specifically) is more robust than simply checking whether the truncated versions of the domain names (e.g. oogle.com, ogle.com, etc.) have been registered, since some of these (particularly the shortest domain names) may be in use by unrelated third parties.

Findings

Of the 262 candidate URLs^[2] (i.e. the hostname-based variants of the top 50 domain names), 89 (34%) have active A records (indicating that they point at a live IP address) and 37 (14%) have active MX records (indicating that they have been configured to be able to send and receive e-mails), as shown in Figure 1. Significantly (where whois information is available), only six (2.6%) of the 233^[3] truncated domain-name variants are registered to the brand owner who could be targeted using an associated hostname infringement.

Figure 1: Breakdown of URLs by presence of A and MX records

Of the 89 URLs with active A records, a range of content types were observed, including:

Live third-party content - Pages where the URL resolves or re-directs to content unrelated to the brand in question (i.e. traffic misdirection)

PPC - Sites monetised through the inclusion of pay-per-click links

Domain-for-sale pages - Pages where the domain name is explicitly being offered for sale

A breakdown of the numbers is shown in Figure 2.

Figure 2: Breakdown of URLs with active A records by content type

It is worth noting that some of the instances of URLs resolving to live content may arise through the use of wildcard DNS records^[4] (i.e. where the domain has been configured such that any arbitrary subdomain will resolve, rather than the specific subdomain having been explicitly configured). However, any URL pointing to a live IP address raises the potential for fraudulent or infringing use. At the time of analysis, none of the 262 URLs resolved to live phishing sites targeting the brand in question; however, it has been previously noted that in many cases, sites are left in a dormant state - in some cases, for an extended period of time - before being weaponised^[5,6]. Consequently, many of the sites resolving to parking, holding or inactive pages may be worthy of monitoring for future changes in content. Furthermore, some of the identified instances of URLs resolving to third-party content may be of particular concern to the brand owner, if they misdirect web-users to competitor content or provide an undesirable brand association. Some examples include:

Hostname-based variant of google[.]com → Resolves to a page promoting a VPN product

Hostname-based variant of yandex[.]com → Re-directs to a flight-sales website

Hostname-based variant of xvideos[.]com → Resolves to a third-party adult website

Hostname-based variant of pornhub[.]com → Re-directs to a third-party adult website

Hostname-based variant of linkedin[.]com → Resolves to a gambling-site portal page

Hostname-based variant of ebay[.]com → Re-directs to the Google website

Additionally, the frequency of PPC pages within the dataset indicates the popularity to infringers of monetising domains whilst in their dormant state. Furthermore, the fact that many of these examples display content unrelated to the brand in question may also suggest that they have been configured to attract web traffic arising from mistyped browser requests, rather than being intended as explicitly deceptive variants of the brand domain name in question.

As a final observation, we can compare the date of registration with the length (in characters) of the second-level domain (SLD) name string (i.e. the portion of the domain name prior to the TLD, or domain extension), for each of the 233 potentially infringing domain names in the dataset (where these are registered and have whois information available) (Figure 3).

Figure 3: Comparison of date of registration with length of the SLD name, for the domains comprising (right-) truncated versions of the top 50 most popular domain names

The dataset shows that the domains in question have been registered over an extended period, between 1986 and 2022. The shorter domain names - i.e. those which are more likely to have been used for unrelated third-party or generic use - tend to comprise the oldest registrations. However, many of the domains with longer SLD string lengths - i.e. those less likely to be associated with 'accidental' brand collisions, and more likely to have been registered specifically to create hostname-based infringements - tend to have been registered over the last few years, highlighting a potential growth in popularity over time of this particular attack vector.

Summary and recommendations

The proportion of hostname-based infringements resolving to live content, or configured with active A and/or MX records - combined with previous observations of the use of this type of infringement as a phishing attack vector - highlights the scale of this infringement type as a potential source of concern. Consequently, brand owners may wish to consider proactively registering or acquiring domain names comprising truncated versions (where the right-hand end is retained) of their core domain name, to prevent registration and abuse by a third party. In cases where acquisition is not possible, it may be advisable to monitor the hostname-based infringements for future changes in content and - if and when active infringing content is detected - launching a timely enforcement action for the takedown of the material.

Appendix

Top 50 most popular websites according to Similarweb (October 2022).

Rank	Website	Category
1	google[.]com	Computers Electronics and Technology → Search Engines
2	youtube[.]com	Arts & Entertainment → Streaming & Online TV
3	facebook[.]com	Computers Electronics and Technology → Social Media Networks
4	twitter[.]com	Computers Electronics and Technology → Social Media Networks
5	instagram[.]com	Computers Electronics and Technology → Social Media Networks
6	baidu[.]com	Computers Electronics and Technology → Search Engines
7	wikipedia[.]org	Reference Materials → Dictionaries and Encyclopedias
8	yandex[.]ru	Computers Electronics and Technology → Search Engines
9	yahoo[.]com	News & Media Publishers
10	xvideos[.]com	Adult
11	whatsapp[.]com	Computers Electronics and Technology → Social Media Networks
12	pornhub[.]com	Adult
13	amazon[.]com	eCommerce & Shopping → Marketplace
14	xnxx[.]com	Adult
15	yahoo[.]co[.]jp	News & Media Publishers
16	live[.]com	Computers Electronics and Technology → Email
17	netflix[.]com	Arts & Entertainment → Streaming & Online TV
18	docomo[.]ne[.]jp	Computers Electronics and Technology → Telecommunications
19	tiktok[.]com	Computers Electronics and Technology → Social Media Networks
20	reddit[.]com	Computers Electronics and Technology → Social Media Networks
21	office[.]com	Computers Electronics and Technology → Programming and Developer Software
22	linkedin[.]com	Computers Electronics and Technology → Social Media Networks
23	dzen[.]ru	Community and Society → Faith and Beliefs
24	vk[.]com	Computers Electronics and Technology → Social Media Networks
25	xhamster[.]com	Adult
26	samsung[.]com	Computers Electronics and Technology → Consumer Electronics
27	turbopages[.]org	News & Media Publishers
28	mail[.]ru	Computers Electronics and Technology → Email
29	bing[.]com	Computers Electronics and Technology → Search Engines
30	naver[.]com	News & Media Publishers
31	microsoftonline[.]com	Computers Electronics and Technology → Programming and Developer Software
32	twitch[.]tv	Games → Video Games Consoles and Accessories
33	discord[.]com	Computers Electronics and Technology → Social Media Networks
34	bilibili[.]com	Arts & Entertainment → Animation and Comics
35	pinterest[.]com	Computers Electronics and Technology → Social Media Networks
36	zoom[.]us	Computers Electronics and Technology → Other Computers Electronics and Tech.
37	weather[.]com	Science and Education → Weather
38	qq[.]com	News & Media Publishers
39	microsoft[.]com	Computers Electronics and Technology → Programming and Developer Software
40	globo[.]com	News & Media Publishers
41	roblox[.]com	Games → Video Games Consoles and Accessories
42	duckduckgo[.]com	Computers Electronics and Technology → Search Engines
43	news[.]yahoo[.]co[.]jp	News & Media Publishers
44	quora[.]com	Reference Materials → Dictionaries and Encyclopedias
45	msn[.]com	News & Media Publishers
46	realsrv[.]com	Adult
47	fandom[.]com	Arts & Entertainment → Other Arts and Entertainment
48	ebay[.]com	eCommerce & Shopping → Marketplace
49	aajtak[.]in	News & Media Publishers
50	ok[.]ru	Computers Electronics and Technology → Social Media Networks