Introduction
The issue of 'clustering' in brand protection - that is, the ability to flexibly identify the existence of links between disparate findings[1] from a brand monitoring solution - is one of the great unsolved problems in the industry[2].
Clustering has a number of key benefits, including the identification of high-volume or serial infringers to serve as priority targets for enforcement and demonstrate 'bad faith' action, offering the potential for efficient bulk takedowns of groups of associated results in a single action, and the building of a full profile of the activity associated with a particular entity through an OSINT (open-source intelligence)-style investigative approach[3].
In general, there are several characteristics of any finding/result from a brand monitoring programme which can serve as a basis for clustering, some of which will be dependent on the channel or type of content. Domain names are one of the 'richest' sources of such data points (many of which can be determined through standard look-ups), which can include features of the whois record[4] such as registrant (owner) and registrar contact details, hosting information (e.g. host IP address and hosting service provider), characteristics of the domain name itself (such as name patterns[5] and TLD[6]), and the providers of any MX (mail exchange) record(s) (allowing e-mail functionality) or SSL (secure sockets layer) certificate(s) (i.e. the authentication feature allowing the domain to use an https URL), in addition to features of any associated website. Many of these characteristics can also be relevant to other types of general Internet content, and other features may be applicable to content from other channels (such as seller names in e-commerce marketplace listings).
These features can additionally serve as the basis for more generally quantifying the level of potential threat posed by an identified result, which can be a key process in prioritising the identified results (which may, in general, comprise a large dataset), to identify the priority targets for further analysis, enforcement or content tracking[7,8].
Clustering analysis techniques
The simplest type of clustering analysis technique - and one which is still the only offering by many brand-protection service providers - is that which is based on the use just of a single particular common characteristic of a particular type (i.e. associated with a specific single 'label', such as the registrant name or host IP address) associated with the set of results in question. For instance, if the name of the registrant of a group of sites is the same for each of the examples, then those sites can be determined all to be connected to each other (if that registrant name is suitably distinctive). This very simple approach is really nothing more than can be achieved through manual analysis (essentially, carrying out a series of 'reverse look-ups') and, while it can have value, the extent of this value is often limited.
Clustering becomes more insightful and useful if links can be drawn on the basis of identical (or similar) characteristics associated with different fields (or labels) in the database of pieces of information associated with the set of 'candidate' findings to be analysed. For example, if a particular e-mail address appears in the whois record of some domains, but in the website content of a series of others, the wider set of both groups of findings can reasonably be assumed to be associated with each other. However, these types of insights are generally much harder to obtain, essentially because it is not known in advance where these commonalities may appear. The situation may become even more complex if links must be followed in order to find the common features - e.g. crawling from a marketplace listing to the associated seller information page, to identify company names, addresses, telephone numbers, etc. These types of instance are where artificial intelligence (AI) tools can potentially begin to provide value.
Specific requirements of an AI tool to carry out clustering analysis
Beyond even the initial complexity described above in constructing an effective clustering tool, there are a number of additional points to consider:
- Distinctiveness / reliability of the features used as the basis of clustering - The point to be made in this case is essentially that some characteristics of a result will be more reliable than others as a basis for clustering that result with others sharing the same characteristic. Features such as e-mail addresses and telephone numbers are (generally) highly distinctive, unique and diagnostic. Others, such as seller names (especially if relatively generic and identified across different platforms) and host IP addresses (in cases where multiple different web-hosting customers may share the use of a single webserver), may be less so. At the other end of the scale, features such as the use of a common TLD (e.g. if we consider a group of sites which just share the use of a common extension such as .com), reference a common privacy-protection service provider in their whois record, or the observation for a group of domains that they simply happen to have been registered on the same day (unless other characteristics suggesting a link are also present) may, in isolation, be poor indicators of an actual connection between the findings. Accordingly, any clustering tool will need to take account of the differences between the various possible clustering criteria, and 'weight' their contribution to the overall 'strength' of any asserted potential link.
- Identification of variants - In many cases, even when results are linked, the pieces of information pertaining to that link may be presented in different formats across the various findings, so any clustering tool will need to take a 'fuzzy' approach to its matching. For example, the same telephone number may be presented in a variety of ways (e.g. "01223 435240", "01223435240", "01223 435 240", "+44 (0)1223 435240", "44 1223 435240", etc.). Similarly, in many cases, a particular company name may be presented differently in distinct contexts (e.g. the registrar / hosting provider GoDaddy might variably be cited as "GoDaddy", "Godaddy.com", "GoDaddy.com, LLC", etc. - and in some cases, depending on the nature of the variations, the entities might be better considered to be distinct anyway - e.g. "Alibaba Cloud LLC" vs "Alibaba Cloud (Singapore) Private Limited"). There is also complexity of the type that (for example) "badseller123@gmail.com" and "badseller123@qq.com" may or may not relate to the same actual entity.
- Analysis of rich content types - Further difficulties arise from the fact that Internet content is becoming increasingly 'rich' (in terms of the ways in which data can be presented) and any truly comprehensive clustering tool would ideally need to be able to interrogate all of these areas of content. Examples for consideration might include text, imagery or audio content embedded in pictures or videos (say, text displayed as a watermark), potentially requiring features such as image analysis, optical character recognition (OCR), etc.
Conclusion
The construction of a truly effective clustering tool able to take account of all the factors discussed in this article is likely to be an extremely difficult problem to solve. However, appropriate application of AI capabilities may be able to provide a stepwise approach towards addressing the issue.
The benefits of successfully doing so will be enormous, potentially building insights and efficiencies into the processes of brand protection monitoring, analysis and enforcement which are essentially not available through any 'classic' approaches. Any service provider able to put a compelling solution of this nature in place in the short to medium term - particularly if it also offers other attractive AI or machine-learning features, such as the option for automatic 'tuning' of search parameters to identify and categorise the most significant results, being able to be 'trained' based on analyst feedback on the quality of the outputs, or the implementation of semi-automated enforcement notice production and sending - may find themselves a long way ahead of their field of mainstream competitors.
References
[1] In referring to a 'finding', in this context I refer to any single result (such as a website / its associated URL) identified via a brand monitoring product or service configured to search the Internet for material of potential interest or concern.
[2] https://circleid.com/posts/20230525-the-millennium-problems-in-brand-protection
[3] 'Patterns in Brand Monitoring' (D.N. Barnett, Business Expert Press, 2025), Chapter 6: 'Result clustering'
[4] The 'whois' record of a domain gives technical configuration and ownership information for that domain.
https://www.iamstobbs.com/opinion/health-scam-websites-identifying-related-domains-using-clustering-techniques
[6] The TLD (top-level domain) is the domain name extension - i.e. the part of the name after the dot.
[7] 'Patterns in Brand Monitoring' (D.N. Barnett, Business Expert Press, 2025), Chapter 5: 'Prioritization criteria for specific types of content'
[8] 'Patterns in Brand Monitoring' (D.N. Barnett, Business Expert Press, 2025), Chapter 3: 'Brand content scoring'
This article was first published on 5 March 2025 at: