As the brand protection industry approaches a quarter of a century in age, following the founding of pioneers Envisional[1] and MarkMonitor[2] in 1999, I present an overview of some of the main outstanding issues which are frequently unaddressed or are generally only partially solved by brand protection service providers. I term these the 'Millennium Problems' in reference to the set of unsolved mathematical problems published in 2000 by the Clay Mathematics Institute[3], and for which significant prizes were offered for solutions. Like their mathematical counterparts, the unsolved problems in brand protection will present significant benefits for any service providers able to develop and offer comprehensive solutions.
Brand protection basics
In their most basic sense, brand protection solutions generally consist of two components: monitoring (or, strictly, detection) of brand-related content on the Internet, and enforcement action to achieve the removal of infringing material. Monitoring is most usually carried out using technological solutions intended to identify relevant material on the Internet, across a range of relevant channels, typically using a combination of methodologies, namely: (i) Internet metasearching (i.e. the submission of relevant query terms to search engines) and web crawling; (ii) analysis of domain-name zone files (see Problem 2), to identify domains with names including brand-related terms (or variants); (iii) direct monitoring / searching on known sites of interest (see Problem 1); and (iv) other techniques, such as the use of spam traps and webserver logs, as used in phishing detection technologies[4]. Many service providers will also make use of automated analysis tools, which can inspect the content of the identified webpages, and categorise and prioritise these results accordingly.
The 'Millennium Problems'
1. Social media monitoring
Whilst monitoring of content across social media platforms is a well-established element of many brand-protection service providers' product suites, it frequently remains extremely difficult to achieve anything approaching a comprehensive level of coverage. There are a number of reasons why this is the case. In general, social media content is most usually addressed using the 'direct site searching' approach (that is, using the search functionality typically in-built to the platforms themselves as a means of returning results), though some providers also have access to direct data feeds from the platforms (e.g. through an API). In general, a variety of types of content may be of interest, including brand references in usernames (e.g. associated with fake profiles), and the content of postings (e.g. associated with fraud, the sale of counterfeits, the spread of malware, brand disparagement, etc.) and elsewhere (including imagery, sponsored advertisements, and so on).
The main difficulty with the 'direct search' approach is that results presented to a user are often limited (sometimes significantly) unless the user is logged in to the social media platform. This can be circumvented by configuring a brand-protection monitoring tool to present itself to the platform as if it is a real user (with a registered account, handle (username) and password), or simply through the use of manual searches. Both of these approaches typically require the use of 'dummy' accounts and may be in contravention of the terms and conditions of the platforms themselves.
Other technological issues may also be problematic. Many social media platforms return results on an 'infinite scroll' basis (where additional results are continually added to the webpage as the user continues to scroll down through them), often with no indication of the total numbers of results which may be present, and many platforms also have specific access requirements, such as functionality only to be accessed via a mobile app (see Problem 7). Similarly, monitoring can be further complicated by sites where content is protected via the requirement to enter a CAPTCHA code, for example. It is also typically the case that the exact results returned to a user will be highly personalised, and dependent on their browsing history, interests, location, and personal demographic.
Some of these issues can be addressed through the development of partner relationships by brand-protection service providers with the platforms themselves. However, even in cases where the platforms are amenable to this approach, some of the above technological issues may remain difficult to address.
2. Comprehensive ccTLD monitoring
Another of the core elements of many brand protection service offerings is often a domain monitoring capability; that is, the ability to identify domains whose names include the name of the brand being infringed (and/or other relevant keywords). As a special subset of general Internet content, branded domain names are often of particular interest by virtue of their greater visibility (e.g. higher ranking in search-engine results) and the more explicit nature of the IP abuse (and an associated greater range of enforcement options)[5]. Branded domain names have been noted in many previous studies as being popular with bad actors in the creation of infringing content of a variety of types, including phishing sites[6], sites offering the sale of counterfeits, and sites claiming false affiliation or including disparaging content.
The primary source of data for domain monitoring is usually the analysis of zone files, which are data files published by the registry organisations responsible for overseeing the infrastructure of each individual TLD (top-level domain, or domain extension - such as .com), and which contain a list of all existing registered domains across that extension. By comparing the content of a zone file with that from the previous day, it is possible to identify new domain registrations (as well as dropped, or lapsed, domains) and filter this list for those examples containing a brand name or keyword of interest. Domain monitoring solutions can (and, in general, should) also make use of zone-file analysis to allow identification of the full pre-existing 'landscape' of registered domain names of interest, across the TLDs in question, at the commencement of monitoring (so-called 'baseline' analysis). The most sophisticated domain monitoring solutions can also automatically check for variations of the brand strings (such as typos), which are frequently used by infringers to construct deliberately deceptive domain names[7,8].
Zone files are generally available for most gTLDs (generic, or global, TLDs such as .com, .net, etc.) plus the new-gTLDs which have been launched in the period since 2012[9], but are often not published (or may not be comprehensive) by the registry organisations responsible for other TLDs, particularly the country-specific examples (ccTLDs). For this reason, detection of relevant domains across ccTLD extensions is typically incomplete, and a number of techniques may typically be used in order to fill in the gaps. These might include parallel look-ups (checks for domains with the same second-level domain name - i.e. the part of the domain name to the left of the dot - as examples identified through zone-file analysis), exact-match queries (regular searches for the existence of domains with second-level domain name strings of particular relevance, such as a brand name), and Internet metasearching. However, each of these approaches has its own limitations and, even when all taken together, there can always be domain names of potential concern which are not detected through any of these methods. The next generation of domain monitoring solutions will need to better address these shortcomings, potentially involving factors such as the use of improved algorithms to 'guess' candidate domain names for checking, and/or the use of more comprehensive indexes of Internet content. Additionally, the building of specific relationships with country registries - potentially combined with regulatory changes regarding the availability of zone files - may also be relevant.
3. Third-party subdomain monitoring
The subdomain is the section of a URL prior to the domain name, from which it is separated by a dot (e.g. 'translate' in 'translate.google.com'). The owner of a domain name can create whatever subdomains they wish, and can point these URLs to associated web content (via the configuration of DNS settings). Accordingly, subdomains can be used to create brand-related URLs, and can be associated with many of the same types of infringements as domain names themselves[10]. Subdomain-based abuse can also be particularly attractive to infringers, both because it avoids the requirement to register a brand-specific domain name[11] (which bad actors know can easily be detected by brand owners employing domain-monitoring services) and because there can be a low cost associated with the creation of the URL, particularly where a service provider allowing the free registration of personalised subdomains (such as blogspot.com) is used.
Consequently, the ability to monitor generally for brand references in the subdomain name of arbitrary URLs can be of great value. Note that this is distinct from the (relatively much simpler) problem of monitoring the existence and content of subdomains of official domains under the ownership of the brand owner 'internal' subdomain monitoring), since all of the relevant information is contained in the DNS configuration files held by the brand owner's domain-name management service provider.
Conversely, the identification of brand-related subdomains on third-party ('external') domain names is much more difficult. In many cases, this is achieved purely using Internet metasearching techniques (i.e. finding only content which is indexed by search engines in response to brand-specific query terms). Whilst this does mimic the search techniques used by general Internet users (and thereby identify the 'highest-visibility' content), it will in general not find all potentially threatening content (e.g. URLs to which traffic is driven through other means, such as links in spam e-mails). This problem can be mediated to some degree through the use of other techniques, such as passive DNS analysis or certificate transparency (CT) analysis, or via explicit queries for the existence of specific subdomain names of interest. However, these techniques require prior identification of the specific domains to be monitored; generalised identification of brand-related subdomains remains a much harder problem to solve.
4. Circumventing site blocking and geoblocking
Site blocking and geoblocking are two long-established problems in brand monitoring. The former arises when a monitored site becomes aware of repeated search queries from a particular source, and restricts access to the site from the IP address in question. A site owner may choose to do this for a number of reasons, including protection of website performance (e.g. in preventing DDoS attacks), or for compliance with their own terms and conditions (e.g. where they state that information is not to be collected for commercial purposes, such as by brand-protection service providers). Geoblocking (or geotargeting) is a related issue, whereby the visible content of a website may vary depending on the geographical location of the visitor. Again, this may be implemented by a site owner for a range of reasons, including the tailoring of content to a local audience, search-engine optimisation, security, or legal compliance[12]. However, geoblocking can also be employed by infringers as a means of evading detection, and can also present difficulties in enforcement, where it may be necessary to demonstrate exactly what content is visible from a specific remote location.
The solutions to these issues, from a brand-protection point of view, are relatively simple in principle, generally involving the use of proxies (standalone external machines serving as intermediate 'hops' through which search queries from a brand-protection service provider are routed, so as to 'mask' the originating IP address) in a range of remote locations, and/or (particularly for site blocking) the building of relationships with the sites being monitored, so that the monitoring service provider can gain permission for collecting the data. However, in practice this requires a great deal of investment in building the required infrastructure (such as hosting and maintaining the necessary proxies, and configuring the monitoring software to communicate with them) and establishing the necessary relationships. Furthermore, the construction of appropriate user interfaces to visualise and interpret the relevant information (such as the ability to compare the content of a particular website across a range of different user (i.e. proxy) locations, in cases where geoblocking or geotargeting may be an issue) can also be a complex prospect.
5. Clustering and open-source intelligence analysis
The subject areas of clustering and open-source intelligence (OSINT) are generally of greatest relevance for entity investigations, i.e. the process of using Internet searches to build a portfolio of information relating to an identified individual or website of interest. Such information can be used for a range of purposes, including background for on-the-ground investigations or goods seizures, or for legal cases, but can also be useful background for enforcement actions (e.g. in identifying clusters of related infringements for efficient bulk takedowns in a single action).
A number of technological solutions exist for visualising the links behind related entities, on the basis of common shared characteristics (such as e-mail addresses, telephone numbers, web-hosting information such as IP addresses, and so on) - i.e. 'clustering', but it is often the case that the characteristics themselves require identification through manual analysis processes. A great deal of additional efficiency can be built into the process, however, through the use of monitoring and analysis tools which can identify and extract this information automatically. This is relatively more straightforward in cases where the data can be extracted in a consistent manner (e.g. performing an IP-address look-up for any identified website of interest), and/or where the information is contained in a known location on a webpage with a fixed, pre-defined format (the 'contact details' section of a social-media profile page), such that a web scraper can be configured to pull out the content. It is a considerably more difficult enterprise to extract such information from general webpages where the structure of each page is not known in advance. In these cases, the approach generally needs to be based on the configuration of monitoring tools which are able to extract text-strings with the general format of (say) an e-mail address or telephone number. This then typically requires an element of post-processing to 'clean' and standardise the data. The next generation of clustering tools are likely to make extensive use of artificial intelligence in order to do this, in addition to also then drawing out insights between the clusters thus produced.
6. Dark Web monitoring
Dark Web content is the general name given to online material for which there are special access requirements; however in the context of online brand monitoring, it is usually taken to refer to content which is only accessible via the Tor network (a decentralised network involving the use of encrypted communications, and connections via multiple hops between Tor servers (proxies) - also known as relays or nodes). The Tor network - which is accessed using specially enabled browsers - can be used to view regular ('surface web') Internet content (and is one option open to users for whom anonymity is important), but is more usually used to access websites with the .onion extension, i.e. those which are only accessible from within the network[13].
The Tor network of .onion websites includes a range of different content types, but is notorious for illegal and infringing content and, as such, can be a key area of interest for brand monitoring. However, many brand protection service providers offer only limited capabilities in this area. This is for a number of different reasons. One significant factor is that the Dark Web is essentially unregulated, frequently with no available links to 'real-world' contact details, and extremely limited enforcement options against infringing content. However, even in cases where takedown is not possible, intelligence on the content can be extremely valuable - one example may be on 'carder' websites, on which stolen financial credentials are traded; if (say) a financial services company can determine that the details for a particular credit card or bank account are being offered for sale, this provides the opportunity for the account to be 'locked' or deactivated.
It can also be extremely difficult to configure monitoring software to search the Dark Web. Whilst it is technically relatively straightforward to configure systems to be Tor-enabled (although connections are typically rather slow), there are generally no robust indexes of Dark Web content (such as the search engines and zone files used to search surface-web content), not least because the .onion addresses for any given website - which usually consist of long, random alphanumeric strings - are generally short-lived and change over time. A number of Dark Web search engines do exist, together with ad-hoc indexes of Dark Web content posted by users on sites such as Pastebin, but the information on these sources typically becomes out-of-date rather quickly.
The nature of the content on the Dark Web also means that security concerns can be an issue for brand-protection service providers wishing to build their capabilities in this area.
7. Mobile-based technologies
As Internet engagement has continued to grow over recent years, an increasing proportion of Internet use is conducted over mobile devices[14,15], using a wide ecosystem of mobile apps. Many platforms are now almost exclusively mobile-based, often with little or no corresponding web presence - popular examples might include the WeChat / Weixin platforms, public groups on messaging services such as WhatsApp, and e-commerce platforms such as Pinduoduo. Many brand-protection service providers use legacy monitoring technologies which were designed specifically for analysing HTML content on the regular Internet and are often poorly equipped to address mobile technologies. In some cases, the work-around is to make use of standalone mobile devices or emulators - on which significant proportions of the monitoring is conducted manually - and there typically remains significant work to be done in order to fully integrate the relevant technologies into core monitoring capabilities.
8. Addressing the Web3 landscape
Web3 (also known as 'Web 3.0') is a general term referring to decentralised content on the Internet, with a particular focus on blockchain technologies. Blockchains are publicly accessible digital ledgers in which transactions are recorded, and form the basis of many digital currencies (or 'cryptocurrencies') (such as Bitcoin), in addition to a number of other applications, such as supply-chain control by brand owners. From a brand-protection viewpoint, the main related areas of interest are typically NFTs and blockchain domains[16].
NFTs (non-fungible tokens) are digital files whose ownership is recorded on a blockchain. They are most commonly associated with graphics files (such as artworks and branded imagery) or other types of digital content (such as audio or music files). However, brand owners are increasingly incorporating NFTs into their business models, including areas such as the production and trade of virtual branded items (e.g. items to be worn by avatars in virtual-reality environments within the 'metaverse', the name given to a generalised connected environment of 3D virtual worlds). Consequently, unofficial branded NFTs can be a source of concern for brand owners.
Blockchain domains - which are recorded (together with their ownership details) on a blockchain, rather than using traditional registrars and web hosting - have a number of similarities to 'classic' domain names, and can be utilised in a number of ways. The most common uses are the creation of decentralised websites on peer-to-peer (P2P) platforms, to be accessed via specially-enabled browsers, or as addresses for sending and receiving cryptocurrency. However, the blockchain domain ecosystem is essentially unregulated, and nothing analogous to domain-name zone files is available. The system is made additionally more complicated by the fact the infrastructure allows for the possibility of domain-name 'clashes' - i.e. the potential for the same name to exist independently on distinct blockchains. As with traditional domain names, blockchain domains with brand-specific names can be threat to brand owners, and a potential source of confusion for customers.
Both NFTs and blockchain domains can be traded on NFT marketplaces (such as OpenSea), and the monitoring of these sites is typically the primary source of intelligence utilised by those brand-protection service providers offering capabilities in this area. For blockchain domains particularly, this approach is less than satisfactory, and offers nothing approaching the sort of comprehensive coverage as is available for regular gTLD domain names via zone-file analysis. Some additional information on the existence of registered blockchain domains is typically available through direct searches within databases provided by blockchain domain registrars and nameserver providers; however, the problem of more comprehensive detection is much more difficult to solve, potentially involving analysis of the content of the individual blockchains directly.
Another difficulty to be overcome in service offerings relating to NFTs and blockchain domains is the issue of enforcement against infringing content. In some cases, enforcement can be carried out through the submission of a DMCA (Digital Millennium Copyright Act) notice, and some NFT marketplaces have specific takedown procedures for content which infringes protected IP. However, in many cases, this simply involves the item being 'delisted' from the marketplace in question. In the future, we may see a move towards more rigorous enforcement, potentially involving forced transfers of ownership. Part of the problem is that the legal issues surrounding NFTs and blockchain domains are, in many cases, still not well-defined and are rapidly evolving, complicated by factors such as the fact that ownership of an NFT ownership does not necessarily grant ownership of copyright for the embedded content.
Beyond #8: Other emerging technologies
As new Internet technologies continue to emerge and develop, they will bring with them new risks for brand owners and associated challenges for brand-protection service providers, who will need to continue to observe and innovate in order to stay ahead of the curve.
At any given time, it is unclear where the next area of concern will come from. Currently, there is a great deal of buzz and speculation about artificial intelligence (AI) technologies and chatbots such as ChatGPT, but it is less obvious how these may affect brand-protection considerations. In this context, I am referring to content associated with, or produced by, AI applications. (Conversely, however, it seems highly likely that AI capabilities will be increasingly built into technologies used to facilitate the brand-protection process - i.e. tools to assist with monitoring, prioritisation, clustering and enforcement.)
Users are able to communicate with AI technologies such as ChatGPT via natural language, which are then able to construct responses based on information with which they have been 'trained'. This means that the information available from a chatbot is only as good as the data with which it has been trained (essentially, in the case of ChatGPT, including large volumes of Internet databases[17,18]), and should really be treated with at least as much caution as the old "I'm Feeling Lucky" button on Google, where the user is just presented with a single response (not necessarily the most reliable one!) to any given query. This point is all the more valid given the ability of chatbots to extrapolate, and provide responses based on incomplete information. What this all means is that chatbots pose the risk of providing information about (say) a company or brand which is misleading or otherwise damaging to corporate reputation. However, since responses are generated dynamically in response to queries (rather than being 'fixed', as in the content of an HTML webpage), it is not clear how these issues might be addressed from a brand-protection point of view. Further complications surround issues such as the ownership of rights to content produced by AI technologies[19].
Where chatbots may be of particular concern from a brand-protection and cybersecurity point of view is in their ability to rapidly create content of a wide variety of types, in a range of different styles - including the ability to write and de-bug computer code. What this may mean is that the entry barrier for infringers wishing to create compelling phishing e-mails[20], or write malicious programs ('malware')[21] may be significantly diminished. The likelihood is - at least in the first generations of AI technologies - that AI will not so much change the types of attack which are possible, but rather the ease with which they can be executed[22].
Another issue surrounds use-cases in which AI systems are 'trained' with confidential corporate information as part of the process of creation of company materials (such as marketing releases). These scenarios raise the possibility for the information to be accessed by third parties, either directly via hacking, or via content included in the responses provided to other users, depending on the ways in which information is 'shared' within the infrastructure of the AI technology itself[23].
References
[1] https://www.cst.cam.ac.uk/ring/halloffame
[2] https://www.markmonitor.com/download/ds/MarkMonitor-Corporate-Overview.pdf
[3] https://www.claymath.org/millennium-problems
[4] https://www.linkedin.com/pulse/assessing-mediating-digital-risk-landscape-brand-david-barnett/
[5] https://www.worldtrademarkreview.com/global-guide/anti-counterfeiting-and-online-brand-enforcement/2022/article/creating-cost-effective-domain-name-watching-programme
[6] https://www.cscdbs.com/blog/branded-domains-are-the-focal-point-of-many-phishing-attacks/
[7] https://www.cscdbs.com/en/resources-news/threatening-domains-targeting-top-brands/
[8] https://www.linkedin.com/pulse/hyphenated-domain-infringements-david-barnett/
[9] https://newgtlds.icann.org/en/about/program
[10] https://www.cscdbs.com/blog/the-world-of-the-subdomain/
[11] https://www.linkedin.com/pulse/exploring-domain-hostname-based-infringements-david-barnett/
[12] https://www.cscdbs.com/blog/do-you-see-what-i-see-geotargeting-in-brand-infringements/
[13] 'Brand Protection in the Online World: A Comprehensive Guide' by David Barnett (2016). Chapter 11: ''Deep' and 'Dark' Web'
[14] https://www.statista.com/statistics/617136/digital-population-worldwide/
[15] https://www.linkedin.com/pulse/holistic-brand-fraud-cyber-protection-using-domain-threat-barnett/
[16] https://www.linkedin.com/pulse/rise-nft-david-barnett
[17] https://www.sciencefocus.com/future-technology/gpt-3/
[18] https://techcrunch.com/2023/03/23/openai-connects-chatgpt-to-the-internet/
[19] https://intellectual-property-helpdesk.ec.europa.eu/news-events/news/intellectual-property-chatgpt-2023-02-20_en
[20] https://securityboulevard.com/2023/01/what-does-chat-gpt-imply-for-brand-impersonation-qa-with-dr-salvatore-stolfo/
[21] https://www.digitaltrends.com/computing/chatgpt-created-malware/
[22] https://venturebeat.com/security/security-risks-evolve-with-release-of-gpt-4/
[23] https://blogs.blackberry.com/en/2023/04/is-chatgpt-safe-for-organizations-to-use
This article was first published on 25 May 2023 at:
https://circleid.com/posts/20230525-the-millennium-problems-in-brand-protection