David Barnett's Brand Protection Articles: April 2026

Introduction

The monitoring component of brand protection services aims to identify infringing web content relating to a particular brand, which can then be considered for subsequent analysis, reporting, and potential enforcement.

Monitoring generally makes use of a number of data-collection techniques, including the analysis of domain-name data files (typically the 'zone files' published by domain registry organisations) and the use of search-engine queries (to identify those webpages returned in response to relevant searches). However, even when these approaches are combined, gaps can be left in detection capability, meaning that significant findings can remain undetected.

One category of findings which may be missed by 'classic' monitoring approaches includes material hosted on websites which: (a) are hosted on a domain excluding the name of the brand being infringed; and (b) are not indexed by search engines (perhaps unless highly specific search-terms are used). This difficulty in detection can be a significant issue if the content poses a high degree of risk, such as phishing sites which impersonate a trusted brand. In such cases, the nature of the website configuration (e.g. the use of a non-branded domain, or other initiatives to avoid the site being linked-to from other sources) may be a deliberate choice by the infringer to avoid detection. The intention may be that the infringers rely on methods other than the use of search engines, such as the inclusion of links to the site in scam e-mails, to drive potential victims to the content.

In this article, we explore the use of a new comprehensive domain data-source to identify this otherwise hard-to-find web content.

Background and analysis

The database utilised in this study contains information relating to every single registered domain across the Internet, including technical and configuration information related to each domain name (e.g. ownership and hosting details), and crucially (from a brand monitoring point of view) also the webpage title and full HTML (i.e. webpage text) content of the homepage of each associated website. This rich dataset means that the information can be used for a range of purposes (including - for example - 'clustering' analyses to group the domains together based on their shared characteristics), but also (in this study) the identification of content which may otherwise remain undetected by other monitoring methods.

As an illustration of the monitoring capabilities offered by this dataset, we present our findings from a case study carried out relating to an international banking brand. A simple query of the database reveals around 84k domains for which the brand name is mentioned somewhere in the content of the homepage of the associated website. In order to identify the most significant infringements, it is necessary to filter down this set to a much smaller 'shortlist' of findings of potential interest. In this study, the following filtering approach was used:

A secondary stage of filtering was carried out to retain only those examples on which a high-relevance keyword also appears in the HTML content (e.g. 'bank', 'login', 'invest', etc.). The use of this type of step is particularly important in cases where the brand name in question is a common or generic term in its own right and therefore may appear across a high volume of results in unrelated contexts.

The remaining results were then filtered again, to retain only those examples where the following conditions were met: (i) the domain name begins with an abbreviated form of the brand name in question (as is a common tactic used by fraudsters to create an appearance of authenticity without using the brand name explicitly); and/or (ii) an exact match to the brand name in question appears within the first 100 characters of the webpage content (i.e. near the start of the page), as might be expected if the website is impersonating the brand.

Additionally, the analysis excluded examples where an exact match to the full brand name appears in the domain name, since these types of examples are more easily detectable through a classic domain-monitoring approach, and this study is focused explicitly on 'hard-to-find' content.

The above simple filtering approach yielded a 'shortlist' of only 123 sites, from the initial set of around 84k 'candidate' results. Manual analysis of this shortlist revealed that 58 of them (i.e. almost half) comprised potential infringements of sufficient severity to warrant flagging and reporting from a brand protection perspective. Amongst these, seven were found to resolve to active impersonation and/or phishing websites (Figure 1), and at least approximately 18 more were inactive sites which appear (on the basis of the content of the page text) formerly also to have resolved to similar high-threat content.

Figure 1: Anonymised screenshots of examples of active impersonation and/or phishing websites identified from the dataset

As such, the methodology seems to be highly effective at identifying significant infringements which would not necessarily be detectable through 'classic' monitoring techniques, and is thereby a valuable augmentation to such methods.

Take-aways and further work

The inclusion in the dataset of the full HTML content of the homepage of each registered domain name makes it a highly compelling data source for use in brand monitoring, offering the potential to identify infringing content which is not easily detectable by other means.

Part of the key to identifying relevant results from within large sets of 'candidate' pages of potential interest is the application of suitable filtering techniques. The analysis presented in this article has shown that a simple, keyword-based approach can be highly effective. However, additional work is also underway to identify other techniques which may be even more efficient at pinpointing the highest-priority content. These approaches may involve searches within specific fields of the domain data (such as the page title or the second-level domain name, i.e. the part of the domain name to the left of the dot), and it may also be appropriate to use 'fuzzy' matching to explicitly identify the use of brand variants or abbreviations.

In addition, these types of approach may also benefit from the incorporation of AI-based methods. In our current research work, such tools are being applied in a range of ways. These include the automatic generation of search configurations (to be used when configuring brand monitoring tools) specifically intended to identify infringement types matching known patterns, the production of written summaries of webpage content to aid with the identification of particular website characteristics, and the automatic tagging and prioritisation of results based on their similarity to previously enforced infringements.

This article was first published on 8 April 2026 at:

https://www.iamstobbs.com/insights/experimenting-with-a-new-domain-data-source-to-identify-hard-to-find-web-content

David Barnett's Brand Protection Articles

Friday, 10 April 2026

Experimenting with a new domain data source to identify hard-to-find web content

Experimenting with a new domain data source to identify hard-to-find web content