David Barnett's Brand Protection Articles: 2025

Monday, 4 August 2025

E-mail address extraction from webpages: a quick case study in result 'clustering'

Introduction

The concept of result 'clustering' - that is, the ability to establish connections between online brand monitoring findings not previously known to be linked - has been discussed previously as a key element of the analysis process in brand protection.

It can allow the identification of key targets for further investigation or enforcement, and assist in building a fuller picture of the identity and activities of the entity(ies) behind the web-content in question, as part of an open-source intelligence (OSINT)-style investigative approach^[1,2,3].

In this article, we focus specifically on the case of e-mail addresses as the data points on which clustering analysis can be based. The presented findings are derived from a process of data analysis involving the automated extraction of contact e-mail addresses from a series of webpages of potential interest, and the associated discussion shows how insights can be derived from the dataset.

Analysis

The dataset used in this case study is a set of domains of potential interest to a fashion brand, as identified through analysis of domain name zone files, which are data files containing the names of all registered domains across each TLD (top-level domain, or domain extension). The search was run using an analysis script configured to identify all domains containing the name of the brand in question, thereby simulating the process of collection of results by a full formal automated domain-monitoring service.

For the brand under consideration (the name of which has simply been replaced, for confidentiality, by the string '[brand]' in all examples which follow), the initial searches generated over 16,000 brand-specific domain names of potential interest. Simple analysis techniques (as discussed in previous articles) can be used to carry out an initial stage of filtering and prioritisation of these results, to identify those sites most likely to be of interest. These techniques might typically include the calculation of 'risk scores' based on characteristics of the domains themselves, or of the content of any associated websites (in cases where a live site is present)^[4,5]. This initial analysis allowed the production of a focused sub-dataset of around 4,500 domains most likely to be of greatest interest to the brand owner in question, based on the presence and prominence of the brand name and associated relevance keywords in the domain name itself and/or on the associated website.

The basic step of the subsequent analysis was to inspect the (HTML) content of each of the domains from the prioritised subset and (using an automated script) extract from the page any text-string(s) matching the format of an e-mail address (where present), with a view to identifying any contact addresses cited on each of the sites, and thereby identify any commonalities or similarities in usage.

At least one e-mail address was identified in the content of just over 1,000 of the sites in question (focusing specifically on the homepages of the sites in each case). The analysis focused on those e-mail addresses in which the 'host' part of the e-mail address (i.e. the part after the '@') was different from the domain name of the particular website on which the e-mail address was identified (deemed to be 'site-specific' contact details).

The most obvious links which can be established are those cases in which the same e-mail address was found to be used on more than one distinct site in the dataset, which may otherwise not obviously have been known to be linked.

In some of these cases, the distinct sites on which a particular e-mail address was found were themselves found to share a common SLD (second-level name, i.e. the part of the domain name to the left of the dot), such that it would have been relatively straightforward to establish a link even in the absence of the common e-mail address. Some such examples from the dataset (with domain names and e-mail addresses obfuscated in each case) include:

[brand]bag.vip and [brand]bag.store - e-mail address: camarendale9XXX[at]gmail.com
[brand]vix.com and [brand]vix.shop - e-mail address: ryanmi0XXX[at]gmail.com

However, in other cases, the common e-mail address may be the only basis on which a link between the sites in question could easily be established, e.g.:

my[brand]photos.com and omaha[brand].com - e-mail address: whatsyour[brand][at]gmail.com (Figure 1)
i-[brand]lightingonline.com and [brand]malls.com - e-mail address: 2853583XXX[at]qq.com

Figure 1: Screenshots from two sites found to be linked on the basis of the use of a common e-mail address

In other cases, the 'host' part of the common e-mail address may also reveal the identity of an additional domain name which is linked to the first two, e.g.:

art-[brand].com and art[brand]dz.com - e-mail address: contact[at]art[brand].com
XX[brand]nails.eu and XX[brand]usa.com - e-mail addresses: helpdesk[at]XX[brand]nails.com and james[at]XX[brand]nails.com
e-casa[brand].com and casa[brand]contract.gr - e-mail address: info[at]casa[brand].gr
[brand]zeitde.com and [brand]zeitde.shop - e-mail address: info[at][brand]zeit.com
[brand]tailorhk.com and [brand]tailors.com - e-mail address: [brand][at][brand]tailor.com
n-[brand].com and n-[brand].net - e-mail address: care[at]usaglobalXXX.org
ceramica[brand].com and ceramica[brand].it - e-mail address: info[at]gruppobarXXX.com
[brand]movies.com and [brand]sf.com - e-mail address: 94115adam[at]cinemaXXX.com

It may also then be possible to determine further information on the underlying entity, by carrying out further searches for other online references to the common pieces of information (i.e. OSINT research). It is, however, worth noting that some e-mail addresses appearing on multiple sites may simply relate to (say) a particular service provider which just happens to have been used by the owners of each the websites in question, but where the sites themselves may be otherwise unrelated. One such example might be the presence of contact details pertaining to the associated domain registrar, such as (from the dataset used) filler[at]godaddy.com or support[at]goldenname.com. This point highlights the importance of reviewing individual findings for relevance and significance, before asserting the presence of a definitive link.

In certain cases where an e-mail username (i.e. just the part of the e-mail address to the left of the '@') is particularly distinctive, searches based on this characteristic alone might be sufficient to establish a link.

Finally, it is also worth noting that the identity of the e-mail address provider can yield its own insights in some cases, with addresses from webmail providers such as yahoo.com and outlook.com, or messaging services such as qq.com, found less frequently to be utilised by larger legitimate businesses.

Conclusion

This brief case study has highlighted the potential usefulness of e-mail addresses - features which are essentially unique to a particular entity, and which can be extracted directly from the content of a website through the use of a simple script or 'scraper' - as a means of establishing links between results. The identification of connections between findings can be a key part of the process of identifying serial infringers, or entities warranting prioritised analysis, and can serve as a 'start-points' for deeper open-source investigations into entities and their associated activities.

Beyond this, insights drawn from the e-mail addresses themselves can also feed into more general algorithms used for quantifying the overall level of potential risk (e.g. non-authenticity) of a website. Characteristics such as the use of e-mail addresses from webmail providers and instant messaging services, for example, are less usually associated with mainstream corporate entities, and can be indicators of higher risk.

References

[1] 'Patterns in Brand Monitoring' (D.N. Barnett, Business Expert Press, 2025), Chapter 6: 'Result clustering'

[2] https://circleid.com/posts/braive-new-world-part-1-brand-protection-clustering-as-a-candidate-task-for-the-application-of-ai-capabilities

[3] https://www.iamstobbs.com/insights/using-clustering-and-investigation-techniques-to-connect-and-identify-scam-law-firm-websites

[4] https://circleid.com/posts/towards-a-generalised-threat-scoring-framework-for-prioritising-results-from-brand-monitoring-programmes

[5] https://www.iamstobbs.com/insights/exploring-a-domain-scoring-system-with-tricky-brands

This article was first published on 31 July 2025 at:

https://www.iamstobbs.com/insights/e-mail-address-extraction-from-webpages-a-quick-case-study-in-result-clustering

Friday, 25 July 2025

The commonest domain features: constructing look-up tables for use as part of a domain risk scoring system

Many previous pieces of research have focused on the desirability of a comprehensive scoring system, to be used for ranking results identified as part of a brand-protection solution, according to their potential level of threat. Such scoring systems offer the capability for identifying prioritised targets for further analysis, content tracking or enforcement actions^{[1, 2]}.

In a recent Stobbs study^[3], we considered the case of a basic scoring system for domain-name results - a key category of findings because of the possibility for relatively comprehensive monitoring, the high online visibility of related infringements, the explicit nature of any associated IP abuse and the greater range of options for enforcement^[4]. The algorithm presented in the initial study focused on characteristics of the domain name itself, taking account of factors such as the location and context within the domain name of the brand name of interest, the presence of relevance or non-relevance keywords, and the proportion of the domain name composed of other characters. This technique allows for an initial filtering of the list of candidate domain names of interest, and can be augmented by a second stage of filtering to take account of the content of any associated webpage, considering factors such as the number and prominence of mentions of the brand name, and the presence in the site content of relevance keywords.

What this initial algorithm does not encompass is any consideration of other technical or configuration factors associated with the domain, comprising any of a number of features which can also provide some indication of its likely potential level of risk. Examples of such characteristics considered in previous studies include the TLD (top-level domain, or domain extension) (working on the basis that some TLDs are more popular with infringers than others, due to factors such as cost, ease of registration, the presence of IP protection programmes, and the ease of enforcement)^[5], and the host IP address (based on the assertion that websites hosted at (or near) IP addresses containing a large number of other 'bad' or blacklisted websites are themselves more likely to pose a risk)^[6].

Overall, the set of relevant potential characteristics for assessing possible risk include the TLD, and the identity of associated domain service providers such as the registrar, hosting provider and nameserver host. These types of providers are typically associated with differing levels of 'trust', connected to factors such as compliance to enforcement requests and popularity with infringers^[7]. As such, the use of a provider showing a greater degree of association with previously known 'bad' sites arguably provides an indication that any other arbitrary site associated with the same provider is - other factors being equal - more likely to be associated with greater degree of risk.

The basic methodology for constructing a threat-score algorithm on this basis thereby involves collating a large database of known bad sites (identified by - for example - comparison of website templates with those used by previously identified infringing sites, or by analysis and verification (as infringing) of results identified through a brand monitoring service), and extracting the features of interest for these known 'bad' sites. This process makes it possible to create 'league tables' of the top features and providers which tend more frequently to be associated with infringing sites.

One key point to note, however, is that merely the association of large numbers of infringing sites with a particular domain characteristic does not necessarily mean that that characteristic conveys higher risk. One particular reason why this may be the case is that certain characteristics are simply more common generally, and would therefore be associated with larger numbers of 'bad' sites even if the rate of association (i.e. the number as a proportion of the total) with such sites was not disproportionate. As an illustration of this point, we can consider the TLD; the .com domain extension, for example, will generally always be associated with large numbers of infringements, due simply to the large total number of domains registered on this extension. Accordingly, there will normally be a requirement to 'normalise' the raw numbers, by dividing the number of observed infringements by the total numbers of registered domains associated with the same instance of the particular feature (i.e. in the case of TLD, the total number of registered .com domains), to generate a measure of infringement frequency or 'hit rate' associated with the instance in question. Domain characteristics with greater infringement frequencies are generally more likely to be associated with higher risk.

In order to be able to carry out this type of analysis, it is necessary to compile 'look-up tables' of the (proportion of the) total numbers of registered domains which are associated with each possible option, for each feature of interest - i.e. ranked lists (by total (or relative) numbers) of the possible domain TLDs, registrars, hosting providers and nameserver hosts. The remainder of this article considers the process of compiling these lists and is illustrated by tables of the top entries (i.e. the most commonly-appearing options within the datasets) in each case. Whilst this has clear applications in threat scoring, it can also provide general insights in its own right, in terms of showing general trends within the domain name landscape.

Individual domain features

1. TLD

The total numbers of domains by TLD is a relatively simple statistic to obtain, as it can be trivially extracted from analysis of domain name zone files (at least for gTLDs (i.e. generic TLDs), for which the corresponding registries publish the data files and make them publicly accessible). A more comprehensive dataset (with significant additional ccTLD (i.e. country-code TLDs) coverage) is that provided by DomainTools^[8], from which the top ten TLDs are shown in Table 1.

TLD	No. domains	% of dataset
.com	155,728,200	43.86%
.de	17,378,724	4.89%
.net	12,346,352	3.48%
.cn	11,975,245	3.37%
.org	11,226,231	3.16%
.uk	9,752,126	2.75%
.nl	5,973,733	1.68%
.ru	5,795,959	1.63%
.top	5,326,770	1.50%
.br	4,989,115	1.41%

Table 1: The top ten TLDs by number of registered domains (N = 355,069,958) (DomainTools, 08-Jul-2025)

2. Registrars

For domain registrars, the ideal statistic would be the total numbers of domains under management by each registrar. One estimate of this total statistic is that provided by DomainNameStat^[9], although some degree of 'post-processing' is required in order to obtain a 'clean' dataset, due to the existence of a range of variations by which some of the individual distinct registrars are referred to (e.g. with or without '.com', 'Inc.', 'Ltd', 'LLC', and the existence of other variations - e.g. there are over 1,200 distinct entries for DropCatch.com in DomainNameStat's list, mostly of the form 'DropCatch.com XXX LLC', where 'XXX' is a three- or four-digit string). The 'cleansed' list consists of over 1,100 distinct entities, of which the top ten are shown in Table 2.

Registrar	No. domains	% of dataset
GoDaddy.com	87,123,338	26.93%
NameCheap	24,389,502	7.54%
Tucows Domains	13,256,889	4.10%
Squarespace Domains	12,352,131	3.82%
Dynadot	9,019,705	2.79%
NameSilo	7,393,306	2.29%
GMO Internet Group, Inc. d/b/a Onamae.com	7,268,309	2.25%
IONOS	6,749,280	2.09%
Gname.com	6,659,851	2.06%
HOSTINGER operations	6,055,416	1.87%

Table 2: The top ten registrars by total number of domains under management (N = 323,498,496) (DomainNameStat, 08-Jul-2025)

As a 'sanity-check', it is informative to compare these statistics with those identified through an explicit look-up process. In order to reduce the number of look-ups required, though still maintaining a representative sample of the overall domain universe, we consider a set of domains taken by extracting each 500th domain from each of the domain name data zone files. Broadly, domains are contained within the individual zone files in alphabetical order, so this equally-spaced sample should essentially provide a 'random' representative set of domains, which should not correlate obviously with any other characteristic. The only significant bias is that the zone-file analysis will exclude ccTLD domain results.

The sampling process described above generates a dataset of just under half a million domains (actually around 484,000), from the total set of around 350 million registered domains. Carrying out a whois look-up on each domain in the sample dataset (where information is available on an automated basis) makes it possible to extract the registrar identity in around 390,000 cases. Following a similar data 'cleansing' process to that described previously, the top ten registrars from this dataset are shown in Table 3.

Registrar	No. domains	% of dataset
GoDaddy.com	116,640	29.80%
NameCheap	28,389	7.25%
Squarespace Domains	17,511	4.47%
Tucows Domains	17,007	4.34%
Network Solutions	8,939	2.28%
IONOS	8,802	2.25%
Gname.com	8,431	2.15%
Dynadot	8,155	2.08%
GMO Internet	7,633	1.95%
HOSTINGER operations	6,474	1.65%

Table 3: The top ten registrars by number of domains under management, based on a zone-file 'sampling' exercise (N = 391,416)

Overall, there is a good degree of similarity between these two lists (i.e. that provided by DomainNameStat and that provided by the sampled zone-file dataset), and the datasets do correlate with each other very well (correlation coefficient = 0.9923) (Figure 1).

Figure 1: Comparison of the numbers of domains under management for each registrar as given by DomainNameStat and the zone-file sampling exercise

In this case, the statistics from DomainNameStat probably constitute a better dataset for use in threat scoring analysis (not least because of the vastly increased number of data points), but the high degree of correlation with the zone-file sample does provide some confidence that the latter dataset constitutes a robust data-source for analysis in extracting alternative domain features, such as those discussed below, in cases where no definitive third-party data overviews are available.

3. Nameserver hosts

The nameserver host (defined as the domain name given as the end-section of the nameserver (NS) record for the domain in question - e.g. 'cloudflare.com' in the case of 'aaden.ns.cloudflare.com') can easily be extracted for any given domain via a simple whois look-up. The statistics given for this feature relate to the first (primary) nameserver record for each domain, based on the dataset obtained from the zone-file sampling exercise (Table 4).

Nameserver host	No. domains	% of dataset
domaincontrol.com	88,073	22.61%
cloudflare.com	35,422	9.09%
googledomains.com	16,290	4.18%
registrar-servers.com	14,286	3.67%
wixdns.net	11,344	2.91%
afternic.com	10,278	2.64%
dns-parking.com	9,082	2.33%
hichina.com	7,250	1.86%
share-dns.com	6,324	1.62%
namebrightdns.com	6,251	1.60%

Table 4: The top ten nameserver hosts by number of domains, based on the zone-file sample dataset (N = 389,584)

4. Hosting providers

The hosting provider for a domain is defined as the operator of the webserver associated with the (primary) IP address at which the domain is hosted. In this case, the 'top' hosting providers could be calculated on a per-IP address or a per-domain basis; however, in this analysis, the latter approach is taken (since, in general, different IP addresses will be associated with differing numbers of hosted domains, so a per-domain approach provides a more representative overview), again using the sampled zone file dataset (Table 5).

Hosting provider	No. domains	% of dataset
Amazon	115,151	37.22%
Cloudflare^[10]	33,945	10.97%
Squarespace	13,534	4.37%
Namecheap	11,604	3.75%
Google	8,894	2.87%
Shopify	8,116	2.62%
GoDaddy.com	4,918	1.59%
Unified Layer	4,604	1.49%
PSINet	4,398	1.42%
Newfold Digital	4,298	1.39%

Table 5: The top ten hosting providers by number of domains under management, based on the zone-file sample dataset (N = 309,409)

Conclusion

Whilst the statistics presented in this article provide some insights regarding the sets of top domain service providers in their own right, the most obvious application is (using the full datasets in each case, rather than just the top-tens shown in this overview) as 'look-up' tables, for the purposes of normalisation of statistics of those features most commonly associated with infringing or otherwise 'bad' sites, as part of an overall threat-scoring approach. A fuller formulation of such an approach - which is key to identifying priority targets from (potentially very large) sets of brand-monitoring results - will also require a dataset of known 'bad' sites, which should itself be as large as possible so as to provide the most meaningful statistics. Ultimately, it is likely that other domain characteristics (such as registrant characteristics, SSL providers, etc.), in addition to other features such as the presence of MX records, web traffic, etc., will also feed into the construction of an overall comprehensive algorithm.

References

[1] https://circleid.com/posts/towards-a-generalised-threat-scoring-framework-for-prioritising-results-from-brand-monitoring-programmes

[2] 'Patterns in Brand Monitoring' (D.N. Barnett, Business Expert Press, 2025), Chapter 5: 'Prioritisation criteria for specific types of content'

[3] https://www.iamstobbs.com/insights/exploring-a-domain-scoring-system-with-tricky-brands

[4] https://www.worldtrademarkreview.com/global-guide/anti-counterfeiting-and-online-brand-enforcement/2022/article/creating-cost-effective-domain-name-watching-programme

[5] https://circleid.com/posts/20230117-the-highest-threat-tlds-part-2

[6] https://www.iamstobbs.com/insights/notorious-ip-addresses-and-initial-steps-towards-the-formulation-of-an-overall-threat-score-for-websites

[7] https://circleid.com/posts/notorious-hosting-providers-an-overview-of-the-highest-threat-hosts-from-ip-address-blacklist-analysis

[8] https://research.domaintools.com/statistics/tld-counts/

[9] https://domainnamestat.com/statistics/registrar/others

[10] It is worth noting that Cloudflare offers 'pass-through' services, such that many websites simply utilising Cloudflare services will be associated with Cloudflare as the listed hosting provider. In such cases, the 'true' hosting provider can generally be determined only by contacting Cloudflare directly.

This article was first published on 25 July 2025 at:

https://circleid.com/posts/the-commonest-domain-features-constructing-look-up-tables-for-use-as-part-of-a-domain-risk-scoring-system

Thursday, 24 July 2025

you/talk/fast/med: another forthcoming batch of new gTLDs

Following our previous discussion^[1] of a new batch of domain-name extensions to be launched as part of the ongoing first phase of the new-gTLD programme, we present some follow-up comments relating to the next set of TLDs to be released; namely, .fast, .talk, .you, and .med.

.fast, .talk and .you, all offered through Amazon Registry, are set to enter their sunrise periods on 26 August, before going into general availability most probably in October. The .med extension is to relaunch on an unrestricted basis with a pre-registration phase from 31 August, and open registrations from 2 September^[2].

As with any new TLDs, the launches offer an opportunity for brand owners to review their domain registration policies, with a view potentially to registering relevant names for brand-related use, making defensive registrations or, at the very least, to proactively monitor the landscape and/or consider blocking mechanisms in order to defend against third-party abuse.

Given the nature of the four new extensions, there is potential for a wide range of use-cases. Some of the TLDs in particular could be relevant to specific business areas (particularly for .talk, which may be appropriate for communications service providers or for content relating to marketing or reviews; and .med, which has potential in medical or pharmaceutical applications - an area of specific risk for counterfeit products - or businesses with connections to the Mediterranean). This set of TLDs generally also offers options for utilisation as part of a tagline, or to convey specific brand messaging.

As in our previous study on new TLD launches, we consider the current landscape of similar domain names, as a proxy for the sorts of activity which may manifest themselves following the launches of the new extensions. Specifically, we consider the set of domain names ending with the respective terms.

As of the start of July 2025, zone-file analysis reveals over 252k domains with names ending with 'you', 153k with 'med', 61k with 'talk' and 58k with 'fast' (Figure 1). Table 1 shows the top five (pre-existing) TLDs represented in each of these four datasets.

Figure 1: Total numbers of (pre-existing) domains ending with 'you', 'med', 'talk' and 'fast' (split by (legacy) TLD)

Table 1: Numbers of domains associated with each of the top five (legacy) TLDs in each of the current sets of domains ending with 'you', 'med', 'talk' and 'fast'

Some trends are immediately apparent, such as the apparent popularity of those terms most directly relevant to the English language ('you', 'talk' and 'fast') with the UK market in particular (as evidenced by the extensive use of .co.uk), and the potential relevance of 'fast' to e-commerce (given the frequency of use of the .shop extension).

In some cases, the data is complicated by the presence of 'false positives' (which obscure any insights relating to the likely future use of the more explicitly relevant new-gTLD extensions), particularly for 'med' (which frequently occurs as a sub-string of other terms such as 'armed', 'formed', 'groomed', 'teamed' and a wide range of others, plus names such as 'muhammed') and 'fast' ('breakfast').

Having removed any obvious false positives, some insights can be gained by looking at the types of terms which most frequently appear immediately prior to the terms in question, within the set of domain names. In this analysis, we consider the strings of different lengths preceding the keywords under consideration, to assess any obvious patterns of usage.

The following lists show the most common English language words^[3] present in these datasets, for each of the terms considered:

Domains ending with 'you':

9-character words preceding 'you':

'beautiful' (256 instances ) (i.e. domains end with 'beautifulyou')
'healthier' (151)

8-character words preceding 'you':

'welcomes' (180)

7-character words preceding 'you':

'healthy' (317)
'without' (144)

6-character words preceding 'you':

'better' (532)
'within' (447)
'around' (331)

5-character words preceding 'you':

'loves' (1,427)
'thank' (1,294)
'about' (1,015)
'moves' (443)
'round' (348)
'comes' (276)
'found' (246)
'bless' (157)
'helps' (141)
'power' (122)
'teach' (116)
'happy' (116)

4-character words preceding 'you':

'with' (3,615)
'near' (2,521)
'love' (2,111)
'like' (984)
'help' (765)
'best' (377)
'meet' (358)
'find' (276)
'miss' (237)
'move' (220)
'plus' (190)
'real' (187)
'from' (185)
'need' (177)
'told' (170)
'know' (168)
'true' (164)
'dare' (163)
'hear' (159)
'want' (153)

3-character words preceding 'you':

'for' (40,991)
'and' (4,496)
'are' (1,257)

Domains ending with 'med':

11-character words preceding 'med':

'integrative' (123)

10-character words preceding 'med':

'functional' (171)

9-character words preceding 'med':

'concierge' (73)
'lifestyle' (72)
'precision' (60)

8-character words preceding 'med':

'internal' (238)
'wellness' (42)

7-character words preceding 'med':

'natural' (101)

6-character words preceding 'med':

'sports' (1,220)
'family' (675)
'health' (273)
'global' (129)
'beauty' (110)
'dental' (86)
'mobile' (74)
'travel' (63)
'pharma' (62)
'physio' (58)
'techno' (57)
'cardio' (47)
'future' (41)
'social' (39)
'gastro' (39)

5-character words preceding 'med':

'sport' (177)
'sleep' (97)
'laser' (84)
'smart' (80)
'ortho' (79)

Domains ending with 'talk':

10-character words preceding 'talk':

'realestate' (76)
'webhosting' (34)

9-character words preceding 'talk':

'marketing' (40)

8-character words preceding 'talk':

'straight' (272)
'business' (118)
'football' (64)

7-character words preceding 'talk':

'bitcoin' (40)
'teacher' (39)
'toolbox' (37)
'fashion' (33)
'english' (28)

6-character words preceding 'talk':

'sports' (428)
'coffee' (171)
'health' (166)
'pillow' (143)
'travel' (98)
'street' (67)
'beauty' (63)
'people' (61)
'social' (50)
'family' (48)

5-character words preceding 'talk':

'small' (330)
'table' (319)
'could' (288)
'money' (233)
'trash' (114)
'cross' (98)
'power' (90)

4-character words preceding 'talk':

'tech' (842)
'real' (517)
'lets' (453)
'talk' (270)
'shop' (255)
'body' (242)
'news' (205)
'girl' (180)
'self' (150)

Domains ending with 'fast':

9-character words preceding 'fast':

'insurance' (59)
'solutions' (24)
'marketing' (23)
'followers' (16)
'customers' (15)

8-character words preceding 'fast':

'property' (98)
'business' (67)
'mortgage' (28)
'websites' (20)
'anything' (19)
'approved' (18)
'delivery' (17)
'patients' (15)

7-character words preceding 'fast':

'funding' (54)
'capital' (47)
'clients' (36)
'nowhere' (35)
'blazing' (34)
'service' (33)
'english' (27)
'finance' (27)
'website' (26)
'results' (26)
'digital' (26)
'connect' (21)
'closure' (21)
'forward' (20)
'tickets' (19)
'healthy' (19)
'freedom' (19)

6-character words preceding 'fast':

'houses' (382)
'weight' (136)
'online' (94)
'health' (31)
'crypto' (29)
'repair' (28)
'quotes' (28)
'better' (28)
'travel' (23)
'pounds' (22)
'ticket' (21)
'shirts' (21)
'strong' (21)
'design' (21)

5-character words preceding 'fast':

'house' (715)
'super' (291)
'homes' (238)
'loans' (131)
'parts' (92)
'trade' (74)
'learn' (63)
'stand' (62)
'ultra' (61)
'think' (60)
'smart' (58)
'funds' (51)

4-character words preceding 'fast':

'home' (477)
'cash' (273)
'sold' (166)
'food' (126)
'grow' (98)
'tech' (78)
'shop' (76)
'care' (73)
'ship' (68)
'very' (67)
'read' (67)
'help' (67)
'easy' (67)

3-character words preceding 'fast':

'use' (727)
'and' (426)
'old' (321)
'are' (146)
'buy' (143)
'car' (131)
'pay' (118)

This type of analysis may be able to help inform strategic considerations by brand owners on possible-use cases for the new gTLDs when they launch, and specifically whether there are any good fits for product types, slogans and potential wider marketing initiatives. Similarly, however, the same characteristics of these new domain extensions can make them attractive to infringers, highlighting the importance of a proactive approach to monitoring and enforcement as the landscape continues to develop.

References

[1] https://www.iamstobbs.com/insights/free-hot-spot-an-exploration-of-three-new-gtld-launches

[2] https://iptwins.com/2025/06/19/new-gtlds-med-talk-you-fast-set-to-launch-in-august-september/

[3] Neglecting any expletives, or words which seem to provide no potential for a phrase making grammatical sense

This article was first published on 24 July 2025 at:

https://www.iamstobbs.com/insights/you-talk-fast-med-another-forthcoming-batch-of-new-gtlds

Thursday, 10 July 2025

(Literally) Everything's £1 – The Poundland domain landscape

With the news that UK-based discount retailer Poundland has been sold to US investment company Gordon Brothers for a 'nominal sum' of less than its eponymous £1, amid 'challenging trading conditions'^[1,2,3], we take a look at the domain-name landscape for the brand, following similar analyses for other previous troubled companies^[4,5,6,7,8].

Consideration of the set of registered brand-specific domains is of key importance for any incumbent or incoming brand owner, for a number of reasons. Primary considerations might typically include assessing whether there is enough strength in the set of defensive registrations, determining if the portfolio could be downsized by lapsing low-priority and/or high-cost and obscure domain names in order to save on renewal costs, and assessing whether web traffic is optimised by ensuring that all inactive domains re-direct to the official transactional website^[9].

Brand owners should generally also analyse the landscape of third-party domains for any indications of fraud, brand infringement or traffic misdirection. This type of consideration can be particularly pertinent at time of high-profile news stories - such as this particular development with Poundland - when bad actors are often all too keen to take advantage of heightened public interest to launch their own scams associated with the brand.

In the case of Poundland, analysis of domain zone-file data^[10] showed that, as of 13-Jun-2025 (i.e. one day after the break of the news story), there were 120 registered domains with names containing the brand. Whilst this is a relatively modestly-sized landscape, it certainly still warrants a deeper dive to determine any associated trends and patterns.

Whilst the company's official primary domain (poundland.co.uk) has limited available registrant information (as is usual for .co.uk domains due to data redaction following the introduction of GDPR), it is possible to identify other associated characteristics, such as registrar and MX record hosting provider, to confirm its official status. These details can then be cross-referenced to identify other official domains in the portfolio. Additionally, the associated (also official) .com domain (poundland.com), which can be seen to re-direct to the .co.uk version of the site, does have a somewhat richer associated dataset.

On this basis, at least 46 of the 120 brand-specific domains can be seen definitively to be under Poundland's official ownership. Only 11 of these display official content, with the remainder found to be non-resolving, displaying error pages or blank pages, leaving some room for further portfolio configuration optimisation.

This leaves 74 potential third-party domains to be assessed for potential threats. Of these, 32 produce some sort of live website response, and 40 are configured with active MX (mail exchange) records, indicating the ability to send and receive e-mails - providing a potential risk of phishing activity and/or other types of brand impersonation from these domains.

Amongst the domains resolving to live content, a range of examples hosting various types of content of potential concern were identified. Some examples, all of which are worthy of consideration for enforcement action, are shown in Figure 1.

Figure 1: Examples of websites featuring content of potential concern, associated with Poundland-specific domain names:

(a) e-commerce and utilisation of official branding (lovepoundland[.]store and moban-poundland[.]site)
(b) e-commerce and use of same brand name (poundlandshop[.]store)
(c) e-commerce - re-direction to external third-party sites (i. poundlandfabric[.]com - re-directs to poundametre[.]com; ii. onlinepoundland[.]co[.]uk and onlinepoundland[.]com - both re-direct to mxwholesale[.]co[.]uk)
(d) misdirection to third-party content (poundlandol[.]shop)
log-in page with official branding (poundlandreporting[.]co[].uk) (possibly official)

Other examples include pages displaying pay-per-click (PPC) links, or domains being offered for sale, highlighting the intention of taking advantage of the renown of the brand to monetise the web traffic being driven to the sites in question.

Some additional data 'clusters' are also apparent, such as a batch of three .shop (dot-shop) names referencing 'poundlandheart', all registered with privacy-protected whois records, through the same registrar on the same day.

One further domain (poundlandharlow[.]com) was found to be registered simply to 'Poundland' (rather than the more usual 'Poundand Limited' used for other official domains - and with a non-official registrar), and may represent a 'semi-official domain', perhaps registered by an individual store franchisee, highlighting also some requirement for portfolio consolidation - an additional point which the new brand owners would also be well advised to address.

Given the range of relevant findings from just a small pool of domains for Poundland, we strongly also recommend other brand owners to remain vigilant in their brand protection endeavours. In the eyes of infringers, any brand-related news is good news, as it generally results in increased levels of public interest and volumes of search traffic. At such times, bad actors will find opportunities to take full advantage, and brand owners will generally find that, in those moments, good preparation and a robust brand protection strategy will pay off.

References

[1] https://www.ft.com/content/31c6338d-74c8-4c71-ad20-337beade4c71

[2] https://www.theguardian.com/business/2025/jun/12/poundland-sold-for-1-with-dozens-of-store-closures-expected

[3] https://www.bbc.co.uk/news/articles/c36594lr29ko

[4] https://www.iamstobbs.com/opinion/wilko-a-target-for-scams-following-administration

[5] https://www.iamstobbs.com/opinion/high-steaks-game-hawksmoors-ipo-and-its-domains

[6] https://www.iamstobbs.com/opinion/ip-and-digital-due-diligence-constructing-a-domain-policy-that-matches-brand-owner-requirements

[7] https://www.iamstobbs.com/opinion/no-party-ip-associated-with-the-fallen-tupperware-brand

[8] https://www.iamstobbs.com/insights/alas-smiths-an-exploration-of-wh-smiths-domains-following-their-store-closures

[9] https://www.iamstobbs.com/opinion/strategies-for-constructing-a-domain-name-registration-and-management-policy

[10] The analysis includes direct interrogation of raw domain-name zone files where available, generally thereby giving comprehensive coverage across all gTLDs, and is augmented by the use of additional datasets for ccTLD results, to gain maximum possible (though not completely comprehensive) coverage in these cases.

This article was first published on 10 July 2025 at:

https://www.iamstobbs.com/insights/literally-everythings-ps1-the-poundland-domain-landscape

Thursday, 3 July 2025

Exploring a domain scoring system with 'tricky' brands

by David Barnett and Frankie Cheung

EXECUTIVE SUMMARY

Our new study illustrates how a relatively simple 'domain risk scoring' approach, analysing just the domain name itself and incorporating 'weightings' dependent on the context within the domain name where the brand reference appears, and the presence of relevant and non-relevance keywords, can be used to effectively rank domains identified through broad searches. In extensions to this idea, it would be possible to extend the scoring formulation to take account of other inherent characteristics of the domain, such as TLD, MX record, or registrant, registrar or hosting-provider characteristics.

Furthermore, by combining this domain risk scoring approach with a 'content risk score' formulation, comprising an analysis of the content of any associated webpage, it is possible to carry out a deeper dive into the set of ranked results, to identify live content of potential interest, to serve as priority targets for further analysis, content tracking, or enforcement.

This article was first published on 3 July 2025 at:

https://www.iamstobbs.com/insights/exploring-a-domain-scoring-system-with-tricky-brands

* * * * *

WHITE PAPER

Introduction

A very significant objective in brand monitoring applications is the ability to be able to rank findings in order of importance, or potential threat level, with a view to identifying priority targets for further analysis, content tracking, or enforcement^[1]. This can particularly be important in the case of monitoring for domains containing brand names which may be short or common words in their own right, and/or which frequently appear as sub-strings of other unrelated terms. A requirement for effective prioritisation arises from the fact that, for these types of 'tricky' (from a monitoring point of view) brand names, searches often generate large numbers of results - many of which are non-related 'false positives' - and it is often difficult to be able to find the results of interest amongst the 'noise'.

For domain monitoring specifically, it is generally necessary to be able to apply an effective filtering and sorting approach even in the absence of any live site content - so as to be able to identify examples which may be 'weaponised' at a later date, which may be in use for other purposes such as for their e-mail functionality, or which may be candidates for acquisition or dispute. In these cases, the analysis therefore needs to take account of inherent features of the domain name itself, rather than necessarily considering the content of any associated webpage.

In this paper, we consider the cases of the following selection of short/common brand names (sometimes referred to as 'generic' terms - though not in the trademark-related sense of the word) (all of which use the .com domain featuring an exact match to their brand name as their primary website domain), taken from the list of top-50 most valuable brands in 2024, as provided by Interbrand^[2]:

Apple (#1, brand value: $488.9B)
IBM (#19, brand value: $37.3B)
SAP (#20, brand value: $36.8B)
Visa (#32, brand value: $21.1B)
UPS (#35, brand value: $20.0B)
Intel (#37, brand value: $19.7B)
GE ('General Electric') (#47, brand value: $17.1B)
AXA (#48, brand value: $16.8B)

For simplicity, the study is based (just) on searches for gTLD (i.e. generic top-level domains, such as .com, .net, etc.) domains containing the brand names of interest, for which comprehensive datasets are available through the analysis of domain-name zone files.

Analysis

The scale of the landscape

Table 1 shows the total raw numbers of domain results returned in response to a search for each of the brand names in question.

Brand-name string	No. gTLD domains
apple	84,556
ibm	25,812
sap	298,759
visa	81,433
ups	202,648
intel	144,323
ge	10,174,156
axa	71,306

Table 1: Numbers of gTLD domains containing the names of each of the brands under consideration

Shown below, for each of the brands, is a sample of the domains returned in the raw data (actually each 5,000th, 10,000th, 25,000th, 50,000th or 1,000,000th result - depending on the numbers of results returned - when sorted into alphabetical order). These examples are intended to give an indication of the types of results picked up the searches, the extent to which the vast majority of these names reference the brand name in an unrelated context, and the corresponding importance of employing an effective filtering and scoring process to prioritise the results and identify the significant findings.

apple:

0000apple[.]com
apple-company[.]com
applelens[.]app
appleshears[.]com
applewaysuzuki[.]com
dapplevalleyfarm[.]com
kappler[.]group
pineapplepods[.]com
thehalfeatenapplecompany[.]com

ibm:

001lisn9itt6q5db7uc3ibms2273h9ha[.]shop
aribm78ifopp3r5k0k9ffk3dt5v241v9[.]org
hibmw[.]com
ibmtivoli[.]com
om13g2l2rlg8ibmsvf82hcj2coiu8pco[.]com
vetoj10th2ibmcu9j2kr774uo89kk7l8[.]store

sap:

000webhosapp[.]com
chapaexpresstrainsapa[.]com
hesapliarsa[.]online
myhsapps[.]com
sapia-ai[.]com
supersapphirewins[.]com

visa:

007ukvisas[.]com
childvisas[.]com
expeditevisavietnam[.]org
invisalign-nuernberg[.]info
nohasslevisaonline[.]com
swedenvisa-palestinianterritory[.]com
visabahis717[.]com
visamastersindia[.]com
winwinvisa[.]com

ups:

003oijaviqr4a39nubups221f8nav1lr[.]com
funeralstartups[.]com
p707nllm9pg5igjdf2h1rh581ups0d7p[.]net
tmallups[.]com
www-trackingshipment-ups[.]com

intel:

007intel[.]com
customsintel[.]com
intelibud[.]com
intelligentbusinessoperations[.]com
intelspect[.]com
saintelizabethcalgary[.]com

ge (due to the size of the dataset, showing only examples from the set of .com results, for simplicity):

0-100agency[.]com
brridgewaybentech[.]com
eventgeneratorsandcooling[.]com
getgeniusmindai[.]com
klargehtdas[.]com
numberonepage[.]com
significantsurgery[.]com
vo44digms6age13m2nob75e8743cldqr[.]com

axa:

00axax[.]com
axarn[.]com
energietaxatie[.]com
laxallstars[.]net
mydaxa[.]com
relaxationexpert[.]com
taxandglobal[.]com
xaxasp10[.]xyz

Domain scoring

In order to filter and prioritise the results, we propose as a first step the use of a 'domain risk score', based just on characteristics of the domain name itself, and intended to provide a measure of the degree of relevance of the brand name in question. Note that, in more comprehensive scoring systems, it may be appropriate to consider additional domain features which can provide an overall indication of the potential level of risk, such as the TLD (top-level domain, or domain-name extension), presence of any MX (mail exchange) record, or registrant, registrar or hosting-provider characteristics, but these are not considered in this study.

The proposed basic algorithm incorporates a number of components to the final calculated domain risk score, as follows:

A weighting dependent on where, within the domain name, the brand reference appears, from the following options (from greatest to least significance):

Instances where the SLD (the second-level domain name, or the part of the name to the left of the dot) consists of the brand name only
Instances where the brand name appears at the start of the domain name
Instances where the brand name appears at the end of the domain name
Instances where the brand name appears elsewhere within the domain name

A greater weighting for instances where the brand reference is 'hyphen-separated' from the rest of the domain name (e.g. apple-abc.com would be deemed to be more brand-relevant than appleabc.com, as there is less scope for confusion with cases where the brand name can appear as a sub-string of other terms)

An optional greater weighting for domain names containing a more highly-distinctive variant of the basic brand name

Additional score increments for each reference to any of a pre-determined set of 'relevance keywords' (which can relate to the industry area of the brand in question, or to specific issue types of interest - e.g. phishing-related keywords) (i.e. 'positive filtering'); these keywords can also be assigned into 'tiers', with higher-relevance keywords being assigned larger scores

A negative score increment for any reference to a known non-relevant 'false positive' (e.g. for 'axa', we may choose to explicitly downweight any domain containing the term 'relaxation') (i.e. 'negative filtering')

An additional score component reflecting the proportion of the domain name (in terms of the number of characters) consisting only of the brand name or any of the relevant keywords (or numerical digits, which are also disregarded), with the rationale being that a domain is more likely to be interesting if it consists only of the brand name plus relevant keywords)

Examples of these sorts of keywords (and as also used in the analysis which follows) are shown in Table 2.

Brand name	Relevance keywords ('tier 1')	Relevance keywords ('tier 2')	Known 'false positives'
apple	iphone, ipad, airpod, mac, watch, vision	shop, store, login, verif, secur, auth	grapple, pineapple
ibm	business, cloud, storage, analy, network, secur, software
sap	business, cloud, tech, software, enterprise, system, data		sapien, sapporo, whatsapp
visa	credit, payment, contactless, commerc	login, verif, secur, auth	immigrat, travel, citizen, asylum, passport, student, invisalign, envisage, televisa, visable
ups	deliver, track, ship, logistic, courier, parcel, packag	login, verif, secur, auth	pop(-)ups, start(-)ups, catch(-)ups, check(-)ups, grown(-)ups, hook(-)ups, set(-)ups, touch(-)ups, clean(-)ups, upscale, upside, upstate, upshot, upsanddowns, groups
intel	core, xeon, business, process, system, device, driver, network, software		intelligen, inteligen, intellect
ge	general(-)electric, aerospace, healthcare, vernova, tech
axa	insur, quot, claim, business, health, multicar, breakdown, bank, banq, fund, financ	login, verif, secur, auth	relaxation, taxation, taxadv, taxacc, laxative

Table 2: Groups of keywords used in the scoring algorithm for each of the brands

Following the analysis, the top-scored (i.e. potentially most relevant) domains for each of the brands are shown in Tables 3 a - h (excluding, for the purposes of illustration, any examples where the SLD is an exact match to the domain name, as these are anyway easily identified and will always be worthy of review). Please note also that, in a live service, any domains under official ownership would likely be excluded on the basis of the use of a whitelist or analysis of registrant / registrar information (not carried out in this study).

Domain name	Domain risk score
applemacipadipodstore[.]com	637
applemacipodipadstore[.]com	637
apple-iphone-ipad-ipod[.]com	611
apple-store-iphone[.]com	603
apple-watch-store[.]com	601
apple-watch-store[.]online	601
apple-ipad-shop[.]com	598
appleiphoneipad[.]com	575
apple-macbook-shop[.]com	558
apple-loginsecure[.]com	551

Table 3a: Top ten results by domain risk score for 'apple'

Domain name	Domain risk score
ibm-business-analytics[.]com	620
cloudsecurity-ibm[.]com	603
ibmbusinesscloud[.]com	575
ibmcloudsoftware[.]com	575
ibmcloudstorage[.]com	575
ibmcloudsecurity[.]com	538
ibmsmartbusinesscloud[.]biz	527
ibmsmartbusinesscloud[.]com	527
ibmsmartbusinesscloud[.]info	527
ibmsmartbusinesscloud[.]net	527
ibmsmartbusinesscloud[.]org	527

Table 3b: Top ten results by domain risk score for 'ibm'

Domain name	Domain risk score
business-data-cloud-sap[.]com	774
sapbusinessdatacloud[.]com	725
sapbusiness1cloud[.]com	575
sapbusinesscloud[.]com	575
sapenterprisecloud[.]com	575
sapbusinessonesoftware[.]com	548
sapbusinessonesoftware[.]info	548
sapbusinessonesoftware[.]net	548
sapbusinessonesoftware[.]org	548
sapbusinessonecloud[.]com	543
sapbusinessonecloud[.]net	543

Table 3c: Top ten results by domain risk score for 'sap'

Domain name	Domain risk score
visasecurepayment[.]com	513
visa-payment[.]com	508
visa-credit[.]com	507
visa-credit[.]net	507
visa-credits[.]com	492
unsecured-visa-credit-cards[.]net	486
payment-visa[.]com	483
securvisapayment[.]com	475
unsecured-visa-credit-card-applications[.]com	452
visa-secure[.]com	439
visa-secure[.]net	439
visa-verify[.]com	439

Table 3d: Top ten results by domain risk score for 'visa'

Domain name	Domain risk score
track-package-rescheduled-delivery-ups[.]com	711
ups-parceltrack[.]org	662
ups-delivery-parcel[.]com	643
ups-packagedelivery[.]com	643
ups-parceltracking[.]com	631
deliveryparcel-ups[.]com	628
trackpackage-ups[.]com	625
ups-deliverytrack-mt[.]com	625
ups-parcell-tracker[.]com	622
ups-parcel-tracking[.]com	622

Table 3e: Top ten results by domain risk score for 'ups'

Domain name	Domain risk score
intelsoftwarenetwork[.]com	575
intellcoresystems[.]com	551
intellicore-network[.]info	543
intellicorenetworks[.]com	543
intellicoresystems[.]com	542
intel-business[.]com	511
intel-software[.]com	511
intel-network[.]com	510
intel-system[.]com	508
intel-core[.]com	505
intel-core[.]net	505
intel-core[.]vip	505

Table 3f: Top ten results by domain risk score for 'intel'

Domain name	Domain risk score
ge-healthcaretech[.]com	663
ge-healthcaretechinc[.]com	635
ge-healthcaretechnology[.]com	614
ge-healthcaretechnologies[.]com	603
ge-healthcaretechnologiesinc[.]net	589
gehealthcaretech[.]com	575
gentechhealthcare[.]com	563
gehealthcaretechinc[.]com	543
geltechealthcare[.]com	525
gentechealthcare[.]com	525

Table 3g: Top ten results by domain risk score for 'ge' (noting that only one example of a result for each unique SLD is shown, due to the large numbers of repeated SLDs in the overall dataset)

Domain name	Domain risk score
axa-banque-finance[.]com	619
axa-health-insurance-slovakia[.]online	572
axafinancebank[.]com	561
axafinancialbank[.]com	538
axainsurancebreakdown[.]com	537
axabusinessinsurance[.]biz	535
axabusinessinsurance[.]com	535
axabusinessinsurance[.]info	535
axabusinessinsurance[.]mobi	535
axabusinessinsurance[.]net	535

Table 3h: Top ten results by domain risk score for 'axa'

The examples show that the algorithm performs well in terms of separating out the relevant examples from the large numbers of other results in the datasets.

Extensions to the approach

i. Use of domain name (SLD) entropy

In some cases, particularly for the shortest brand names, the dastasets may include instances of long, pseudo-random domain names (such as several of the examples shown above for 'ibm'). These types of domains are often associated with automated registrations intended for fraudulent use^[3], but will not, in general, be associated with the brand whose name may be contained within them, and should ideally be disregarded (or downweighted) in the types of scoring algorithms described in this paper.

However, the analysis shows that the basic scoring algorithm outlined in this study often does not effectively distinguish between domain names of this type and other 'better' brand matches (i.e. more relevant results). For example, for 'ibm', all of the following examples are assigned a domain risk score of 125:

i03204i8ua9n7sle6sdrm81mri0cibm9[.]net
i0f29td98etcibm9gkc29v4v9j39p5qm[.]top
i0lf99g2t8u92p7ibmlj4tvav849jp1n[.]tel
i216r5835dfoush9k1iibm4vpd669dka[.]top
i2ai773hvhan7l9001its1r8ibm84cav[.]site
i5u5127lfb56iibmj4bfa4c0m03mjt4f[.]motorcycles
i66t9t7vau8of667ibmlho120ab32bbv[.]online
i6crr5n3uqmmsmm5it7874uj099ibm87[.]com
i6t27emh03o11cfm6oa0r73l2ibmeki4[.]com
i76kcibmcu3310epn6lagpp292ivj114[.]top
i967pv1vn4outp103ibm7673diirjp3c[.]top
i98fmfibmcnjnbg2999s402pgem2258s[.]top
ia0n95j263iibmvue4s8v6lhll753a7s[.]com
ibmclassroom[.]com
ibmclienteng[.]com
ibmcognitive[.]com
ibmcognitive[.]org
ibmcomputers[.]asia
ibmcomputers[.]com
ibmcomputing[.]com
ibmcomputing[.]info
ibmconfigure[.]com
ibmcontracts[.]com
ibmcorporate[.]com

This mix of result types is due to the wide range of factors contributing to the final overall calculated score, including the fact that many of the long, random domain names consist of large numbers of digits, meaning that once these are disregarded, the 'ibm' string accounts for a significant proportion of the remainder of the domain name.

One possible way to account for the differences between these types of domain name would be to make use of the concept of domain name (SLD) entropy; essentially, a measure of the length and randomness of the domain name. The categorisation can be achieved by applying a 'correction' to the calculated domain risk score, by reducing it by a factor which is dependent on the domain name entropy (and, in the proposed methodology, applying this only to domains with entropy values above a certain threshold, since some of the visually-relevant domain names are found have 'mid-range' entropy values).

As a case study, we can consider the dataset of 1,504 'ibm' domains in total which are assigned a (raw) domain risk score of 125. The entropy values of these domains sit in a range between 1.4591 (mibmim[.]com) and 4.6350 (fhibmd96pt2or8745a2cltjj1gu4373e[.]com), with (by inspection) most of the 'random' domain names found to have entropy values above around 3.5 (which can be termed the entropy 'threshold', H_th). As such, a suitable reduction factor (R) for the domain risk score can be defined in terms of the domain entropy (H) as:

R = exp(H) / exp(H_th)	(for H > H_th)
R = 1	(otherwise)

such that the adjusted final domain risk score (D_adj) can be defined in terms of the 'raw' score (D) as:

D_adj = D / R

The form of this reduction factor function is as shown in Figure 1.

Figure 1: One possible formulation of a domain risk score reduction factor (R) to be used to 'down-score' high entropy (H) domains

This correction results in a 'down-scoring' of 642 of the 1,504 domains. As an illustration, Table 4 gives a selection of those domains whose final scores have been reduced as a result of the entropy-based correction (actually alphabetically the first domain assigned to each adjusted score value), showing that the correction does, as intended, preferentially affect the 'random' domain names.

Domain name	Adjusted domain risk score (D_adj)
slmibm8epk1u84[.]com	122
ibmpower4saphana[.]com	116
ibmathsworld[.]com	115
4659sib4645muss5msgf5buribm8e1u6[.]top	103
shibmaro323429fjcnrin43rncnr43rvnfuiru448484848484[.]com	94
97bj94io2ibm42fppgqi7n274f73fsji[.]how	92
647d75i7co7mj7b0l7vmmqr4ibmd06qu[.]net	90
kidmi5b71tibm7b0ff560iuq1c5ir477[.]pro	89
ibmknaj5mcimebc3iaqchinml5l3h6ve[.]top	88
413b3ibmlu6n9iq4qa4441cancjm96ap[.]com	87
br74cgrf32bbsgr3rsc7s6ofs94nqibm[.]com	86
v0q7bbtnb0atnqj68l0au0age1a7bibm[.]com	85

Table 4: Examples of domains whose risk scores have been reduced by the entropy-based correction factor

ii. Content risk scoring

As an extension to the above ideas, it is also possible to calculate a second score, based on an analysis of the content of any associated webpage (if present), as an alternative or secondary means of sorting the results (working on the basis that, other factors being equal, a domain will be of greater concern if it is associated with live, brand-related content).

To this end, we can formulate a 'content risk score', which itself is composed of two constituent components:

A 'brand content score' , reflecting the number and prominence of mentions of the brand name on the page

An additional metric reflecting the numbers of unique relevance keywords mentioned at least once anywhere in the page content (to take account of the fact that, for common / 'generic' brand terms, the brand name could be mentioned in contexts unrelated to the brand in question, but the presence of relevance keywords will indicate that the subject matter of the page is relevant to the brand in question).

As an illustration, we can calculate the content risk scores for sets of the domains assigned the highest domain risk scores for each of the brands in question, as a means of identifying live content of interest (e.g. potential infringements).

As an example, Table 5 shows the website details for the examples achieving the highest content risk scores (i.e. potentially the most relevant websites) out of a set of those results for 'apple' which themselves receive the highest domain risk scores (>300) (i.e. potentially the most relevant domain names).

Domain name	Domain risk score	Website page title	Content risk score
applewatchjournal.net	343	Apple Watch Journal - Apple Watch （アップルウォッチ）の総合情報サイト。 Apple Watchの基本的な使い方やWatch アプリの情報、最新ニュースを紹介します！	4,640
applelivingstore.com	300	Apple Living Store – Vente des iphones neufs et occasions	4,150
appleministore.com	318	Shop the Latest Apple Products iPhones; MacBooks; iPads & More	4,020
applewatchcast.com	368	The Apple WatchCast Podcast - A podcast dedicated to the Apple Watch	2,700
applewatchrepairz.com	343	Get Professional Apple Watch Repair Services \| Fast & Affordable	2,300
apple-mac.support	503	Apple Spezialist im Rheinland \| Mac Support für Kunden in Köln, Bonn, Düsseldorf und Aachen \| KLEUTGENS.IT	2,226
apple.watch	500	Apple Watch - Apple	2,150
apple-wholesale-stores.com	366	Apple Wholesale Store - Buy Apple Products at the Best Price	2,145

Table 5: Website details for the examples achieving the highest content risk scores for Apple

On this basis, Figure 2 shows one example of an identified live website of interest (i.e. brand-related content / potential brand infringement) for each of the brands under consideration.

Figure 2: Examples of an identified live website of interest for each of the brands under consideration: apple-wholesale-stores[.]com, ibmisecurity[.]com, sap-system[.]com, paymentvisanet[.]com, ups17track[.]com, intel-processor[.]com, gevernovatechtraining[.]com, axainsurancebali[.]com

Conclusion

The studies presented in this paper have illustrated how a relatively simple 'domain risk scoring' approach can be used to effectively rank domains identified through broad searches, so as to identify names of particular interest, even in cases where the brand name used as the basis of the search may be a very short or common term.

In extensions to this idea, it would be possible to extend the scoring formulation to take account of other inherent characteristics of the domain, such as TLD, MX record, or registrant, registrar or hosting-provider characteristics, many of which can themselves be assigned into 'tiers' of potential threat level, and scored accordingly.

Finally, by combining this domain risk scoring approach with a 'content risk score' formulation, it is possible to carry out a deeper dive into the set of ranked results, to identify live content of potential interest, to serve as priority targets for further analysis, content tracking, or enforcement.

References

[1] https://circleid.com/posts/towards-a-generalised-threat-scoring-framework-for-prioritising-results-from-brand-monitoring-programmes

[2] https://interbrand.com/best-global-brands/

[3] https://circleid.com/posts/20230703-an-overview-of-the-concept-and-use-of-domain-name-entropy

This article was first published as a white paper on 3 July 2025 at:

https://www.iamstobbs.com/uploads/general/Exploring-a-domain-scoring-system-with-tricky-brands-e-book.pdf