David Barnett's Brand Protection Articles: E-mail address extraction from webpages: a quick case study in result 'clustering'

Introduction

The concept of result 'clustering' - that is, the ability to establish connections between online brand monitoring findings not previously known to be linked - has been discussed previously as a key element of the analysis process in brand protection.

It can allow the identification of key targets for further investigation or enforcement, and assist in building a fuller picture of the identity and activities of the entity(ies) behind the web-content in question, as part of an open-source intelligence (OSINT)-style investigative approach^[1,2,3].

In this article, we focus specifically on the case of e-mail addresses as the data points on which clustering analysis can be based. The presented findings are derived from a process of data analysis involving the automated extraction of contact e-mail addresses from a series of webpages of potential interest, and the associated discussion shows how insights can be derived from the dataset.

Analysis

The dataset used in this case study is a set of domains of potential interest to a fashion brand, as identified through analysis of domain name zone files, which are data files containing the names of all registered domains across each TLD (top-level domain, or domain extension). The search was run using an analysis script configured to identify all domains containing the name of the brand in question, thereby simulating the process of collection of results by a full formal automated domain-monitoring service.

For the brand under consideration (the name of which has simply been replaced, for confidentiality, by the string '[brand]' in all examples which follow), the initial searches generated over 16,000 brand-specific domain names of potential interest. Simple analysis techniques (as discussed in previous articles) can be used to carry out an initial stage of filtering and prioritisation of these results, to identify those sites most likely to be of interest. These techniques might typically include the calculation of 'risk scores' based on characteristics of the domains themselves, or of the content of any associated websites (in cases where a live site is present)^[4,5]. This initial analysis allowed the production of a focused sub-dataset of around 4,500 domains most likely to be of greatest interest to the brand owner in question, based on the presence and prominence of the brand name and associated relevance keywords in the domain name itself and/or on the associated website.

The basic step of the subsequent analysis was to inspect the (HTML) content of each of the domains from the prioritised subset and (using an automated script) extract from the page any text-string(s) matching the format of an e-mail address (where present), with a view to identifying any contact addresses cited on each of the sites, and thereby identify any commonalities or similarities in usage.

At least one e-mail address was identified in the content of just over 1,000 of the sites in question (focusing specifically on the homepages of the sites in each case). The analysis focused on those e-mail addresses in which the 'host' part of the e-mail address (i.e. the part after the '@') was different from the domain name of the particular website on which the e-mail address was identified (deemed to be 'site-specific' contact details).

The most obvious links which can be established are those cases in which the same e-mail address was found to be used on more than one distinct site in the dataset, which may otherwise not obviously have been known to be linked.

In some of these cases, the distinct sites on which a particular e-mail address was found were themselves found to share a common SLD (second-level name, i.e. the part of the domain name to the left of the dot), such that it would have been relatively straightforward to establish a link even in the absence of the common e-mail address. Some such examples from the dataset (with domain names and e-mail addresses obfuscated in each case) include:

[brand]bag.vip and [brand]bag.store - e-mail address: camarendale9XXX[at]gmail.com
[brand]vix.com and [brand]vix.shop - e-mail address: ryanmi0XXX[at]gmail.com

However, in other cases, the common e-mail address may be the only basis on which a link between the sites in question could easily be established, e.g.:

my[brand]photos.com and omaha[brand].com - e-mail address: whatsyour[brand][at]gmail.com (Figure 1)
i-[brand]lightingonline.com and [brand]malls.com - e-mail address: 2853583XXX[at]qq.com

Figure 1: Screenshots from two sites found to be linked on the basis of the use of a common e-mail address

In other cases, the 'host' part of the common e-mail address may also reveal the identity of an additional domain name which is linked to the first two, e.g.:

art-[brand].com and art[brand]dz.com - e-mail address: contact[at]art[brand].com
XX[brand]nails.eu and XX[brand]usa.com - e-mail addresses: helpdesk[at]XX[brand]nails.com and james[at]XX[brand]nails.com
e-casa[brand].com and casa[brand]contract.gr - e-mail address: info[at]casa[brand].gr
[brand]zeitde.com and [brand]zeitde.shop - e-mail address: info[at][brand]zeit.com
[brand]tailorhk.com and [brand]tailors.com - e-mail address: [brand][at][brand]tailor.com
n-[brand].com and n-[brand].net - e-mail address: care[at]usaglobalXXX.org
ceramica[brand].com and ceramica[brand].it - e-mail address: info[at]gruppobarXXX.com
[brand]movies.com and [brand]sf.com - e-mail address: 94115adam[at]cinemaXXX.com

It may also then be possible to determine further information on the underlying entity, by carrying out further searches for other online references to the common pieces of information (i.e. OSINT research). It is, however, worth noting that some e-mail addresses appearing on multiple sites may simply relate to (say) a particular service provider which just happens to have been used by the owners of each the websites in question, but where the sites themselves may be otherwise unrelated. One such example might be the presence of contact details pertaining to the associated domain registrar, such as (from the dataset used) filler[at]godaddy.com or support[at]goldenname.com. This point highlights the importance of reviewing individual findings for relevance and significance, before asserting the presence of a definitive link.

In certain cases where an e-mail username (i.e. just the part of the e-mail address to the left of the '@') is particularly distinctive, searches based on this characteristic alone might be sufficient to establish a link.

Finally, it is also worth noting that the identity of the e-mail address provider can yield its own insights in some cases, with addresses from webmail providers such as yahoo.com and outlook.com, or messaging services such as qq.com, found less frequently to be utilised by larger legitimate businesses.

Conclusion

This brief case study has highlighted the potential usefulness of e-mail addresses - features which are essentially unique to a particular entity, and which can be extracted directly from the content of a website through the use of a simple script or 'scraper' - as a means of establishing links between results. The identification of connections between findings can be a key part of the process of identifying serial infringers, or entities warranting prioritised analysis, and can serve as a 'start-points' for deeper open-source investigations into entities and their associated activities.

Beyond this, insights drawn from the e-mail addresses themselves can also feed into more general algorithms used for quantifying the overall level of potential risk (e.g. non-authenticity) of a website. Characteristics such as the use of e-mail addresses from webmail providers and instant messaging services, for example, are less usually associated with mainstream corporate entities, and can be indicators of higher risk.

References

[1] 'Patterns in Brand Monitoring' (D.N. Barnett, Business Expert Press, 2025), Chapter 6: 'Result clustering'

[2] https://circleid.com/posts/braive-new-world-part-1-brand-protection-clustering-as-a-candidate-task-for-the-application-of-ai-capabilities

[3] https://www.iamstobbs.com/insights/using-clustering-and-investigation-techniques-to-connect-and-identify-scam-law-firm-websites

[4] https://circleid.com/posts/towards-a-generalised-threat-scoring-framework-for-prioritising-results-from-brand-monitoring-programmes

[5] https://www.iamstobbs.com/insights/exploring-a-domain-scoring-system-with-tricky-brands

This article was first published on 31 July 2025 at:

https://www.iamstobbs.com/insights/e-mail-address-extraction-from-webpages-a-quick-case-study-in-result-clustering