Thursday, 3 July 2025

Exploring a domain scoring system with 'tricky' brands

by David Barnett and Frankie Cheung

EXECUTIVE SUMMARY

A very significant objective in brand monitoring applications is the ability to be able to rank findings in order of importance, or potential threat level, with a view to identifying priority targets for further analysis, content tracking, or enforcement . This can particularly be important in the case of monitoring for domains containing brand names which may be short or common words in their own right, and/or which frequently appear as sub-strings of other unrelated terms.

Our new study illustrates how a relatively simple 'domain risk scoring' approach, analysing just the domain name itself and incorporating 'weightings' dependent on the context within the domain name where the brand reference appears, and the presence of relevant and non-relevance keywords, can be used to effectively rank domains identified through broad searches. In extensions to this idea, it would be possible to extend the scoring formulation to take account of other inherent characteristics of the domain, such as TLD, MX record, or registrant, registrar or hosting-provider characteristics.

Furthermore, by combining this domain risk scoring approach with a 'content risk score' formulation, comprising an analysis of the content of any associated webpage, it is possible to carry out a deeper dive into the set of ranked results, to identify live content of potential interest, to serve as priority targets for further analysis, content tracking, or enforcement.

This article was first published on 3 July 2024 at:

https://www.iamstobbs.com/insights/exploring-a-domain-scoring-system-with-tricky-brands

* * * * *

WHITE PAPER

Introduction

A very significant objective in brand monitoring applications is the ability to be able to rank findings in order of importance, or potential threat level, with a view to identifying priority targets for further analysis, content tracking, or enforcement[1]. This can particularly be important in the case of monitoring for domains containing brand names which may be short or common words in their own right, and/or which frequently appear as sub-strings of other unrelated terms. A requirement for effective prioritisation arises from the fact that, for these types of 'tricky' (from a monitoring point of view) brand names, searches often generate large numbers of results - many of which are non-related 'false positives' - and it is often difficult to be able to find the results of interest amongst the 'noise'.

For domain monitoring specifically, it is generally necessary to be able to apply an effective filtering and sorting approach even in the absence of any live site content - so as to be able to identify examples which may be 'weaponised' at a later date, which may be in use for other purposes such as for their e-mail functionality, or which may be candidates for acquisition or dispute. In these cases, the analysis therefore needs to take account of inherent features of the domain name itself, rather than necessarily considering the content of any associated webpage.

In this paper, we consider the cases of the following selection of short/common brand names (sometimes referred to as 'generic' terms - though not in the trademark-related sense of the word) (all of which use the .com domain featuring an exact match to their brand name as their primary website domain), taken from the list of  top-50 most valuable brands in 2024, as provided by Interbrand[2]:

  • Apple (#1, brand value: $488.9B)
  • IBM (#19, brand value: $37.3B)
  • SAP (#20, brand value: $36.8B)
  • Visa (#32, brand value: $21.1B)
  • UPS (#35, brand value: $20.0B)
  • Intel (#37, brand value: $19.7B)
  • GE ('General Electric') (#47, brand value: $17.1B)
  • AXA (#48, brand value: $16.8B)

For simplicity, the study is based (just) on searches for gTLD (i.e. generic top-level domains, such as .com, .net, etc.) domains containing the brand names of interest, for which comprehensive datasets are available through the analysis of domain-name zone files. 

Analysis

The scale of the landscape

Table 1 shows the total raw numbers of domain results returned in response to a search for each of the brand names in question.

Brand-name
string
                              
No. gTLD
domains
                              
apple 84,556
ibm 25,812
sap 298,759
visa 81,433
ups 202,648
intel 144,323
ge 10,174,156
axa 71,306

Table 1: Numbers of gTLD domains containing the names of each of the brands under consideration

Shown below, for each of the brands, is a sample of the domains returned in the raw data (actually each 5,000th, 10,000th, 25,000th, 50,000th or 1,000,000th result - depending on the numbers of results returned - when sorted into alphabetical order). These examples are intended to give an indication of the types of results picked up the searches, the extent to which the vast majority of these names reference the brand name in an unrelated context, and the corresponding importance of employing an effective filtering and scoring process to prioritise the results and identify the significant findings.

apple:

  • 0000apple[.]com
  • apple-company[.]com
  • applelens[.]app
  • appleshears[.]com
  • applewaysuzuki[.]com
  • dapplevalleyfarm[.]com
  • kappler[.]group
  • pineapplepods[.]com
  • thehalfeatenapplecompany[.]com

ibm:

  • 001lisn9itt6q5db7uc3ibms2273h9ha[.]shop
  • aribm78ifopp3r5k0k9ffk3dt5v241v9[.]org
  • hibmw[.]com
  • ibmtivoli[.]com
  • om13g2l2rlg8ibmsvf82hcj2coiu8pco[.]com
  • vetoj10th2ibmcu9j2kr774uo89kk7l8[.]store

sap:

  • 000webhosapp[.]com
  • chapaexpresstrainsapa[.]com
  • hesapliarsa[.]online
  • myhsapps[.]com
  • sapia-ai[.]com
  • supersapphirewins[.]com

visa:

  • 007ukvisas[.]com
  • childvisas[.]com
  • expeditevisavietnam[.]org
  • invisalign-nuernberg[.]info
  • nohasslevisaonline[.]com
  • swedenvisa-palestinianterritory[.]com
  • visabahis717[.]com
  • visamastersindia[.]com
  • winwinvisa[.]com

ups:

  • 003oijaviqr4a39nubups221f8nav1lr[.]com
  • funeralstartups[.]com
  • p707nllm9pg5igjdf2h1rh581ups0d7p[.]net
  • tmallups[.]com
  • www-trackingshipment-ups[.]com

intel:

  • 007intel[.]com
  • customsintel[.]com
  • intelibud[.]com
  • intelligentbusinessoperations[.]com
  • intelspect[.]com
  • saintelizabethcalgary[.]com

ge (due to the size of the dataset, showing only examples from the set of .com results, for simplicity):

  • 0-100agency[.]com
  • brridgewaybentech[.]com
  • eventgeneratorsandcooling[.]com
  • getgeniusmindai[.]com
  • klargehtdas[.]com
  • numberonepage[.]com
  • significantsurgery[.]com
  • vo44digms6age13m2nob75e8743cldqr[.]com

axa:

  • 00axax[.]com
  • axarn[.]com
  • energietaxatie[.]com
  • laxallstars[.]net
  • mydaxa[.]com
  • relaxationexpert[.]com
  • taxandglobal[.]com
  • xaxasp10[.]xyz

Domain scoring

In order to filter and prioritise the results, we propose as a first step the use of a 'domain risk score', based just on characteristics of the domain name itself, and intended to provide a measure of the degree of relevance of the brand name in question. Note that, in more comprehensive scoring systems, it may be appropriate to consider additional domain features which can provide an overall indication of the potential level of risk, such as the TLD (top-level domain, or domain-name extension), presence of any MX (mail exchange) record, or registrant, registrar or hosting-provider characteristics, but these are not considered in this study.

The proposed basic algorithm incorporates a number of components to the final calculated domain risk score, as follows:

  • A weighting dependent on where, within the domain name, the brand reference appears, from the following options (from greatest to least significance):
    • Instances where the SLD (the second-level domain name, or the part of the name to the left of the dot) consists of the brand name only
    • Instances where the brand name appears at the start of the domain name
    • Instances where the brand name appears at the end of the domain name
    • Instances where the brand name appears elsewhere within the domain name
  • A greater weighting for instances where the brand reference is 'hyphen-separated' from the rest of the domain name (e.g. apple-abc.com would be deemed to be more brand-relevant than appleabc.com, as there is less scope for confusion with cases where the brand name can appear as a sub-string of other terms)
  • An optional greater weighting for domain names containing a more highly-distinctive variant of the basic brand name
  • Additional score increments for each reference to any of a pre-determined set of 'relevance keywords' (which can relate to the industry area of the brand in question, or to specific issue types of interest - e.g. phishing-related keywords) (i.e. 'positive filtering'); these keywords can also be assigned into 'tiers', with higher-relevance keywords being assigned larger scores 
  • A negative score increment for any reference to a known non-relevant 'false positive' (e.g. for 'axa', we may choose to explicitly downweight any domain containing the term 'relaxation') (i.e. 'negative filtering')
  • An additional score component reflecting the proportion of the domain name (in terms of the number of characters) consisting only of the brand name or any of the relevant keywords (or numerical digits, which are also disregarded), with the rationale being that a domain is more likely to be interesting if it consists only of the brand name plus relevant keywords)

Examples of these sorts of keywords (and as also used in the analysis which follows) are shown in Table 2. 

Brand
name
                              
Relevance keywords
('tier 1')
                                    
Relevance keywords ('tier 2')
                                    
Known
'false positives'
                                    
  apple iphone, ipad, airpod,
mac, watch, vision

shop, store, login,
verif, secur, auth
grapple, pineapple
  ibm business, cloud, storage, analy,
network, secur, software

     
  sap business, cloud, tech, software, enterprise, system, data

   sapien, sapporo, whatsapp
  visa credit, payment, contactless,
commerc
login, verif, secur,
auth
immigrat, travel, citizen,
asylum, passport, student,
invisalign, envisage, televisa,
visable

  ups deliver, track, ship, logistic,
courier, parcel, packag
login, verif, secur,
auth
pop(-)ups, start(-)ups, catch(-)ups,
check(-)ups, grown(-)ups,
hook(-)ups, set(-)ups,
touch(-)ups, clean(-)ups,
upscale, upside, upstate,
upshot, upsanddowns,
groups

  intel core, xeon, business, process,
system, device, driver, network,
software

   intelligen, inteligen, intellect
  ge general(-)electric, aerospace,
healthcare, vernova, tech

     
  axa insur, quot, claim, business,
health, multicar, breakdown,
bank, banq, fund, financ
login, verif, secur,
auth
relaxation, taxation, taxadv,
taxacc, laxative

Table 2: Groups of keywords used in the scoring algorithm for each of the brands

Following the analysis, the top-scored (i.e. potentially most relevant) domains for each of the brands are shown in Tables 3 a - h (excluding, for the purposes of illustration, any examples where the SLD is an exact match to the domain name, as these are anyway easily identified and will always be worthy of review). Please note also that, in a live service, any domains under official ownership would likely be excluded on the basis of the use of a whitelist or analysis of registrant / registrar information (not carried out in this study).

Domain name
                                                                                                
Domain risk score
                                
  applemacipadipodstore[.]com 637
  applemacipodipadstore[.]com 637
  apple-iphone-ipad-ipod[.]com 611
  apple-store-iphone[.]com 603
  apple-watch-store[.]com 601
  apple-watch-store[.]online 601
  apple-ipad-shop[.]com 598
  appleiphoneipad[.]com 575
  apple-macbook-shop[.]com 558
  apple-loginsecure[.]com 551

Table 3a: Top ten results by domain risk score for 'apple'

Domain name
                                                                                                
Domain risk score
                                
  ibm-business-analytics[.]com 620
  cloudsecurity-ibm[.]com 603
  ibmbusinesscloud[.]com 575
  ibmcloudsoftware[.]com 575
  ibmcloudstorage[.]com 575
  ibmcloudsecurity[.]com 538
  ibmsmartbusinesscloud[.]biz 527
  ibmsmartbusinesscloud[.]com 527
  ibmsmartbusinesscloud[.]info 527
  ibmsmartbusinesscloud[.]net 527
  ibmsmartbusinesscloud[.]org 527

Table 3b: Top ten results by domain risk score for 'ibm'

Domain name
                                                                                                
Domain risk score
                                
  business-data-cloud-sap[.]com 774
  sapbusinessdatacloud[.]com 725
  sapbusiness1cloud[.]com 575
  sapbusinesscloud[.]com 575
  sapenterprisecloud[.]com 575
  sapbusinessonesoftware[.]com 548
  sapbusinessonesoftware[.]info 548
  sapbusinessonesoftware[.]net 548
  sapbusinessonesoftware[.]org 548
  sapbusinessonecloud[.]com 543
  sapbusinessonecloud[.]net 543

Table 3c: Top ten results by domain risk score for 'sap'

Domain name
                                                                                                
Domain risk score
                                
  visasecurepayment[.]com 513
  visa-payment[.]com 508
  visa-credit[.]com 507
  visa-credit[.]net 507
  visa-credits[.]com 492
  unsecured-visa-credit-cards[.]net 486
  payment-visa[.]com 483
  securvisapayment[.]com 475
  unsecured-visa-credit-card-applications[.]com 452
  visa-secure[.]com 439
  visa-secure[.]net 439
  visa-verify[.]com 439

Table 3d: Top ten results by domain risk score for 'visa'

Domain name
                                                                                                
Domain risk score
                                
  track-package-rescheduled-delivery-ups[.]com 711
  ups-parceltrack[.]org 662
  ups-delivery-parcel[.]com 643
  ups-packagedelivery[.]com 643
  ups-parceltracking[.]com 631
  deliveryparcel-ups[.]com 628
  trackpackage-ups[.]com 625
  ups-deliverytrack-mt[.]com 625
  ups-parcell-tracker[.]com 622
  ups-parcel-tracking[.]com 622

Table 3e: Top ten results by domain risk score for 'ups'

Domain name
                                                                                                
Domain risk score
                                
  intelsoftwarenetwork[.]com 575
  intellcoresystems[.]com 551
  intellicore-network[.]info 543
  intellicorenetworks[.]com 543
  intellicoresystems[.]com 542
  intel-business[.]com 511
  intel-software[.]com 511
  intel-network[.]com 510
  intel-system[.]com 508
  intel-core[.]com 505
  intel-core[.]net 505
  intel-core[.]vip 505

Table 3f: Top ten results by domain risk score for 'intel'

Domain name
                                                                                                
Domain risk score
                                
  ge-healthcaretech[.]com 663
  ge-healthcaretechinc[.]com 635
  ge-healthcaretechnology[.]com 614
  ge-healthcaretechnologies[.]com 603
  ge-healthcaretechnologiesinc[.]net 589
  gehealthcaretech[.]com 575
  gentechhealthcare[.]com 563
  gehealthcaretechinc[.]com 543
  geltechealthcare[.]com 525
  gentechealthcare[.]com 525

Table 3g: Top ten results by domain risk score for 'ge' (noting that only one example of a result for each unique SLD is shown, due to the large numbers of repeated SLDs in the overall dataset)

Domain name
                                                                                                
Domain risk score
                                
  axa-banque-finance[.]com 619
  axa-health-insurance-slovakia[.]online 572
  axafinancebank[.]com 561
  axafinancialbank[.]com 538
  axainsurancebreakdown[.]com 537
  axabusinessinsurance[.]biz 535
  axabusinessinsurance[.]com 535
  axabusinessinsurance[.]info 535
  axabusinessinsurance[.]mobi 535
  axabusinessinsurance[.]net 535

Table 3h: Top ten results by domain risk score for 'axa'

The examples show that the algorithm performs well in terms of separating out the relevant examples from the large numbers of other results in the datasets.

Extensions to the approach

i. Use of domain name (SLD) entropy

In some cases, particularly for the shortest brand names, the dastasets may include instances of long, pseudo-random domain names (such as several of the examples shown above for 'ibm'). These types of domains are often associated with automated registrations intended for fraudulent use[3], but will not, in general, be associated with the brand whose name may be contained within them, and should ideally be disregarded (or downweighted) in the types of scoring algorithms described in this paper. 

However, the analysis shows that the basic scoring algorithm outlined in this study often does not effectively distinguish between domain names of this type and other 'better' brand matches (i.e. more relevant results). For example, for 'ibm', all of the following examples are assigned a domain risk score of 125:

  • i03204i8ua9n7sle6sdrm81mri0cibm9[.]net
  • i0f29td98etcibm9gkc29v4v9j39p5qm[.]top
  • i0lf99g2t8u92p7ibmlj4tvav849jp1n[.]tel
  • i216r5835dfoush9k1iibm4vpd669dka[.]top
  • i2ai773hvhan7l9001its1r8ibm84cav[.]site
  • i5u5127lfb56iibmj4bfa4c0m03mjt4f[.]motorcycles
  • i66t9t7vau8of667ibmlho120ab32bbv[.]online
  • i6crr5n3uqmmsmm5it7874uj099ibm87[.]com
  • i6t27emh03o11cfm6oa0r73l2ibmeki4[.]com
  • i76kcibmcu3310epn6lagpp292ivj114[.]top
  • i967pv1vn4outp103ibm7673diirjp3c[.]top
  • i98fmfibmcnjnbg2999s402pgem2258s[.]top
  • ia0n95j263iibmvue4s8v6lhll753a7s[.]com
  • ibmclassroom[.]com
  • ibmclienteng[.]com
  • ibmcognitive[.]com
  • ibmcognitive[.]org
  • ibmcomputers[.]asia
  • ibmcomputers[.]com
  • ibmcomputing[.]com
  • ibmcomputing[.]info
  • ibmconfigure[.]com
  • ibmcontracts[.]com
  • ibmcorporate[.]com

This mix of result types is due to the wide range of factors contributing to the final overall calculated score, including the fact that many of the long, random domain names consist of large numbers of digits, meaning that once these are disregarded, the 'ibm' string accounts for a significant proportion of the remainder of the domain name.

One possible way to account for the differences between these types of domain name would be to make use of the concept of domain name (SLD) entropy; essentially, a measure of the length and randomness of the domain name. The categorisation can be achieved by applying a 'correction' to the calculated domain risk score, by reducing it by a factor which is dependent on the domain name entropy (and, in the proposed methodology, applying this only to domains with entropy values above a certain threshold, since some of the visually-relevant domain names are found have 'mid-range' entropy values).

As a case study, we can consider the dataset of 1,504 'ibm' domains in total which are assigned a (raw) domain risk score of 125. The entropy values of these domains sit in a range between 1.4591 (mibmim[.]com) and 4.6350 (fhibmd96pt2or8745a2cltjj1gu4373e[.]com), with (by inspection) most of the 'random' domain names found to have entropy values above around 3.5 (which can be termed the entropy 'threshold', Hth). As such, a suitable reduction factor (R) for the domain risk score can be defined in terms of the domain entropy (H) as:

            R = exp(H) / exp(Hth)    (for HHth)

            R = 1                                (otherwise)

such that the adjusted final domain risk score (Dadj) can be defined in terms of the 'raw' score (D) as:

            Dadj = D / R

The form of this reduction factor function is as shown in Figure 1. 

Figure 1: One possible formulation of a domain risk score reduction factor (R) to be used to 'down-score' high entropy (H) domains

This correction results in a 'down-scoring' of 642 of the 1,504 domains. As an illustration, Table 4 gives a selection of those domains whose final scores have been reduced as a result of the entropy-based correction (actually alphabetically the first domain assigned to each adjusted score value), showing that the correction does, as intended, preferentially affect the 'random' domain names.


Domain name
                                                                                                
Adjusted domain
risk score (Dadj)
                                
  slmibm8epk1u84[.]com 122
  ibmpower4saphana[.]com 116
  ibmathsworld[.]com 115
  4659sib4645muss5msgf5buribm8e1u6[.]top 103
  shibmaro323429fjcnrin43rncnr43rvnfuiru448484848484[.]com 94
  97bj94io2ibm42fppgqi7n274f73fsji[.]how 92
  647d75i7co7mj7b0l7vmmqr4ibmd06qu[.]net 90
  kidmi5b71tibm7b0ff560iuq1c5ir477[.]pro 89
  ibmknaj5mcimebc3iaqchinml5l3h6ve[.]top 88
  413b3ibmlu6n9iq4qa4441cancjm96ap[.]com 87
  br74cgrf32bbsgr3rsc7s6ofs94nqibm[.]com 86
  v0q7bbtnb0atnqj68l0au0age1a7bibm[.]com 85

Table 4: Examples of domains whose risk scores have been reduced by the entropy-based correction factor

ii. Content risk scoring

As an extension to the above ideas, it is also possible to calculate a second score, based on an analysis of the content of any associated webpage (if present), as an alternative or secondary means of sorting the results (working on the basis that, other factors being equal, a domain will be of greater concern if it is associated with live, brand-related content). 

To this end, we can formulate a 'content risk score', which itself is composed of two constituent components:

  • A 'brand content score' , reflecting the number and prominence of mentions of the brand name on the page
  • An additional metric reflecting the numbers of unique relevance keywords mentioned at least once anywhere in the page content (to take account of the fact that, for common / 'generic' brand terms, the brand name could be mentioned in contexts unrelated to the brand in question, but the presence of relevance keywords will indicate that the subject matter of the page is relevant to the brand in question). 

As an illustration, we can calculate the content risk scores for sets of the domains assigned the highest domain risk scores for each of the brands in question, as a means of identifying live content of interest (e.g. potential infringements). 

As an example, Table 5 shows the website details for the examples achieving the highest content risk scores (i.e. potentially the most relevant websites) out of a set of those results for 'apple' which themselves receive the highest domain risk scores (>300) (i.e. potentially the most relevant domain names).

Domain name
                                                     
Domain
risk score
                        
Website page title
                                                                
Content
risk score
                        
  applewatchjournal.net 343 Apple Watch Journal - Apple Watch
(アップルウォッチ)の総合情報サイト。
Apple Watchの基本的な使い方やWatch
アプリの情報、最新ニュースを紹介します!
4,640
  applelivingstore.com 300 Apple Living Store – Vente des iphones neufs
et occasions
4,150
  appleministore.com 318 Shop the Latest Apple Products iPhones;
MacBooks; iPads & More
4,020
  applewatchcast.com 368 The Apple WatchCast Podcast - A podcast
dedicated to the Apple Watch
2,700
  applewatchrepairz.com 343 Get Professional Apple Watch Repair Services
 | Fast & Affordable
2,300
  apple-mac.support 503 Apple Spezialist im Rheinland | Mac Support
für Kunden in Köln, Bonn, Düsseldorf und
 Aachen | KLEUTGENS.IT
2,226
  apple.watch 500 Apple Watch - Apple 2,150
  apple-wholesale-stores.com 366 Apple Wholesale Store - Buy Apple Products
at the Best Price
2,145

Table 5: Website details for the examples achieving the highest content risk scores for Apple

On this basis, Figure 2 shows one example of an identified live website of interest (i.e. brand-related content / potential brand infringement) for each of the brands under consideration.

Figure 2: Examples of an identified live website of interest for each of the brands under consideration: apple-wholesale-stores[.]com, ibmisecurity[.]com, sap-system[.]com, paymentvisanet[.]com, ups17track[.]com, intel-processor[.]com, gevernovatechtraining[.]com, axainsurancebali[.]com

Conclusion

The studies presented in this paper have illustrated how a relatively simple 'domain risk scoring' approach can be used to effectively rank domains identified through broad searches, so as to identify names of particular interest, even in cases where the brand name used as the basis of the search may be a very short or common term.

In extensions to this idea, it would be possible to extend the scoring formulation to take account of other inherent characteristics of the domain, such as TLD, MX record, or registrant, registrar or hosting-provider characteristics, many of which can themselves be assigned into 'tiers' of potential threat level, and scored accordingly.

Finally, by combining this domain risk scoring approach with a 'content risk score' formulation, it is possible to carry out a deeper dive into the set of ranked results, to identify live content of potential interest, to serve as priority targets for further analysis, content tracking, or enforcement.

References

[1] https://circleid.com/posts/towards-a-generalised-threat-scoring-framework-for-prioritising-results-from-brand-monitoring-programmes

[2] https://interbrand.com/best-global-brands/

[3] https://circleid.com/posts/20230703-an-overview-of-the-concept-and-use-of-domain-name-entropy

This article was first published as a white paper on 3 July 2024 at:

https://www.iamstobbs.com/uploads/general/Exploring-a-domain-scoring-system-with-tricky-brands-e-book.pdf


No comments:

Post a Comment

Exploring a domain scoring system with 'tricky' brands

by David Barnett and Frankie Cheung EXECUTIVE SUMMARY A very significant objective in brand monitoring applications is the ability to be abl...