by David Barnett and Frankie Cheung
EXECUTIVE SUMMARY
A very significant objective in brand monitoring applications is the ability to be able to rank findings in order of importance, or potential threat level, with a view to identifying priority targets for further analysis, content tracking, or enforcement . This can particularly be important in the case of monitoring for domains containing brand names which may be short or common words in their own right, and/or which frequently appear as sub-strings of other unrelated terms.
Our new study illustrates how a relatively simple 'domain risk scoring' approach, analysing just the domain name itself and incorporating 'weightings' dependent on the context within the domain name where the brand reference appears, and the presence of relevant and non-relevance keywords, can be used to effectively rank domains identified through broad searches. In extensions to this idea, it would be possible to extend the scoring formulation to take account of other inherent characteristics of the domain, such as TLD, MX record, or registrant, registrar or hosting-provider characteristics.
Furthermore, by combining this domain risk scoring approach with a 'content risk score' formulation, comprising an analysis of the content of any associated webpage, it is possible to carry out a deeper dive into the set of ranked results, to identify live content of potential interest, to serve as priority targets for further analysis, content tracking, or enforcement.
This article was first published on 3 July 2024 at:
https://www.iamstobbs.com/insights/exploring-a-domain-scoring-system-with-tricky-brands
* * * * *
WHITE PAPER
Introduction
A very significant objective in brand monitoring applications is the ability to be able to rank findings in order of importance, or potential threat level, with a view to identifying priority targets for further analysis, content tracking, or enforcement[1]. This can particularly be important in the case of monitoring for domains containing brand names which may be short or common words in their own right, and/or which frequently appear as sub-strings of other unrelated terms. A requirement for effective prioritisation arises from the fact that, for these types of 'tricky' (from a monitoring point of view) brand names, searches often generate large numbers of results - many of which are non-related 'false positives' - and it is often difficult to be able to find the results of interest amongst the 'noise'.
For domain monitoring specifically, it is generally necessary to be able to apply an effective filtering and sorting approach even in the absence of any live site content - so as to be able to identify examples which may be 'weaponised' at a later date, which may be in use for other purposes such as for their e-mail functionality, or which may be candidates for acquisition or dispute. In these cases, the analysis therefore needs to take account of inherent features of the domain name itself, rather than necessarily considering the content of any associated webpage.
In this paper, we consider the cases of the following selection of short/common brand names (sometimes referred to as 'generic' terms - though not in the trademark-related sense of the word) (all of which use the .com domain featuring an exact match to their brand name as their primary website domain), taken from the list of top-50 most valuable brands in 2024, as provided by Interbrand[2]:
- Apple (#1, brand value: $488.9B)
- IBM (#19, brand value: $37.3B)
- SAP (#20, brand value: $36.8B)
- Visa (#32, brand value: $21.1B)
- UPS (#35, brand value: $20.0B)
- Intel (#37, brand value: $19.7B)
- GE ('General Electric') (#47, brand value: $17.1B)
- AXA (#48, brand value: $16.8B)
For simplicity, the study is based (just) on searches for gTLD (i.e. generic top-level domains, such as .com, .net, etc.) domains containing the brand names of interest, for which comprehensive datasets are available through the analysis of domain-name zone files.
Analysis
The scale of the landscape
Table 1 shows the total raw numbers of domain results returned in response to a search for each of the brand names in question.
Brand-name string |
No. gTLD domains |
---|---|
apple | 84,556 |
ibm | 25,812 |
sap | 298,759 |
visa | 81,433 |
ups | 202,648 |
intel | 144,323 |
ge | 10,174,156 |
axa | 71,306 |
Table 1: Numbers of gTLD domains containing the names of each of the brands under consideration
Shown below, for each of the brands, is a sample of the domains returned in the raw data (actually each 5,000th, 10,000th, 25,000th, 50,000th or 1,000,000th result - depending on the numbers of results returned - when sorted into alphabetical order). These examples are intended to give an indication of the types of results picked up the searches, the extent to which the vast majority of these names reference the brand name in an unrelated context, and the corresponding importance of employing an effective filtering and scoring process to prioritise the results and identify the significant findings.
apple:
- 0000apple[.]com
- apple-company[.]com
- applelens[.]app
- appleshears[.]com
- applewaysuzuki[.]com
- dapplevalleyfarm[.]com
- kappler[.]group
- pineapplepods[.]com
- thehalfeatenapplecompany[.]com
ibm:
- 001lisn9itt6q5db7uc3ibms2273h9ha[.]shop
- aribm78ifopp3r5k0k9ffk3dt5v241v9[.]org
- hibmw[.]com
- ibmtivoli[.]com
- om13g2l2rlg8ibmsvf82hcj2coiu8pco[.]com
- vetoj10th2ibmcu9j2kr774uo89kk7l8[.]store
sap:
- 000webhosapp[.]com
- chapaexpresstrainsapa[.]com
- hesapliarsa[.]online
- myhsapps[.]com
- sapia-ai[.]com
- supersapphirewins[.]com
visa:
- 007ukvisas[.]com
- childvisas[.]com
- expeditevisavietnam[.]org
- invisalign-nuernberg[.]info
- nohasslevisaonline[.]com
- swedenvisa-palestinianterritory[.]com
- visabahis717[.]com
- visamastersindia[.]com
- winwinvisa[.]com
ups:
- 003oijaviqr4a39nubups221f8nav1lr[.]com
- funeralstartups[.]com
- p707nllm9pg5igjdf2h1rh581ups0d7p[.]net
- tmallups[.]com
- www-trackingshipment-ups[.]com
intel:
- 007intel[.]com
- customsintel[.]com
- intelibud[.]com
- intelligentbusinessoperations[.]com
- intelspect[.]com
- saintelizabethcalgary[.]com
ge (due to the size of the dataset, showing only examples from the set of .com results, for simplicity):
- 0-100agency[.]com
- brridgewaybentech[.]com
- eventgeneratorsandcooling[.]com
- getgeniusmindai[.]com
- klargehtdas[.]com
- numberonepage[.]com
- significantsurgery[.]com
- vo44digms6age13m2nob75e8743cldqr[.]com
axa:
- 00axax[.]com
- axarn[.]com
- energietaxatie[.]com
- laxallstars[.]net
- mydaxa[.]com
- relaxationexpert[.]com
- taxandglobal[.]com
- xaxasp10[.]xyz
Domain scoring
In order to filter and prioritise the results, we propose as a first step the use of a 'domain risk score', based just on characteristics of the domain name itself, and intended to provide a measure of the degree of relevance of the brand name in question. Note that, in more comprehensive scoring systems, it may be appropriate to consider additional domain features which can provide an overall indication of the potential level of risk, such as the TLD (top-level domain, or domain-name extension), presence of any MX (mail exchange) record, or registrant, registrar or hosting-provider characteristics, but these are not considered in this study.
The proposed basic algorithm incorporates a number of components to the final calculated domain risk score, as follows:
- A weighting dependent on where, within the domain name, the brand reference appears, from the following options (from greatest to least significance):
- Instances where the SLD (the second-level domain name, or the part of the name to the left of the dot) consists of the brand name only
- Instances where the brand name appears at the start of the domain name
- Instances where the brand name appears at the end of the domain name
- Instances where the brand name appears elsewhere within the domain name
- A greater weighting for instances where the brand reference is 'hyphen-separated' from the rest of the domain name (e.g. apple-abc.com would be deemed to be more brand-relevant than appleabc.com, as there is less scope for confusion with cases where the brand name can appear as a sub-string of other terms)
- An optional greater weighting for domain names containing a more highly-distinctive variant of the basic brand name
- Additional score increments for each reference to any of a pre-determined set of 'relevance keywords' (which can relate to the industry area of the brand in question, or to specific issue types of interest - e.g. phishing-related keywords) (i.e. 'positive filtering'); these keywords can also be assigned into 'tiers', with higher-relevance keywords being assigned larger scores
- A negative score increment for any reference to a known non-relevant 'false positive' (e.g. for 'axa', we may choose to explicitly downweight any domain containing the term 'relaxation') (i.e. 'negative filtering')
- An additional score component reflecting the proportion of the domain name (in terms of the number of characters) consisting only of the brand name or any of the relevant keywords (or numerical digits, which are also disregarded), with the rationale being that a domain is more likely to be interesting if it consists only of the brand name plus relevant keywords)
Examples of these sorts of keywords (and as also used in the analysis which follows) are shown in Table 2.
Brand name |
Relevance keywords ('tier 1') |
Relevance keywords ('tier 2') |
Known 'false positives' |
---|---|---|---|
apple | iphone, ipad, airpod, mac, watch, vision |
shop, store, login, verif, secur, auth |
grapple, pineapple |
ibm | business, cloud, storage, analy, network, secur, software |
||
sap | business, cloud, tech, software, enterprise, system, data |
sapien, sapporo, whatsapp | |
visa | credit, payment, contactless, commerc |
login, verif, secur, auth |
immigrat, travel, citizen, asylum, passport, student, invisalign, envisage, televisa, visable |
ups | deliver, track, ship, logistic, courier, parcel, packag |
login, verif, secur, auth |
pop(-)ups, start(-)ups, catch(-)ups, check(-)ups, grown(-)ups, hook(-)ups, set(-)ups, touch(-)ups, clean(-)ups, upscale, upside, upstate, upshot, upsanddowns, groups |
intel | core, xeon, business, process, system, device, driver, network, software |
intelligen, inteligen, intellect | |
ge | general(-)electric, aerospace, healthcare, vernova, tech |
||
axa | insur, quot, claim, business, health, multicar, breakdown, bank, banq, fund, financ |
login, verif, secur, auth |
relaxation, taxation, taxadv, taxacc, laxative |
Table 2: Groups of keywords used in the scoring algorithm for each of the brands
Following the analysis, the top-scored (i.e. potentially most relevant) domains for each of the brands are shown in Tables 3 a - h (excluding, for the purposes of illustration, any examples where the SLD is an exact match to the domain name, as these are anyway easily identified and will always be worthy of review). Please note also that, in a live service, any domains under official ownership would likely be excluded on the basis of the use of a whitelist or analysis of registrant / registrar information (not carried out in this study).
Domain name |
Domain risk score |
---|---|
applemacipadipodstore[.]com | 637 |
applemacipodipadstore[.]com | 637 |
apple-iphone-ipad-ipod[.]com | 611 |
apple-store-iphone[.]com | 603 |
apple-watch-store[.]com | 601 |
apple-watch-store[.]online | 601 |
apple-ipad-shop[.]com | 598 |
appleiphoneipad[.]com | 575 |
apple-macbook-shop[.]com | 558 |
apple-loginsecure[.]com | 551 |
Table 3a: Top ten results by domain risk score for 'apple'
Domain name |
Domain risk score |
---|---|
ibm-business-analytics[.]com | 620 |
cloudsecurity-ibm[.]com | 603 |
ibmbusinesscloud[.]com | 575 |
ibmcloudsoftware[.]com | 575 |
ibmcloudstorage[.]com | 575 |
ibmcloudsecurity[.]com | 538 |
ibmsmartbusinesscloud[.]biz | 527 |
ibmsmartbusinesscloud[.]com | 527 |
ibmsmartbusinesscloud[.]info | 527 |
ibmsmartbusinesscloud[.]net | 527 |
ibmsmartbusinesscloud[.]org | 527 |
Table 3b: Top ten results by domain risk score for 'ibm'
Domain name |
Domain risk score |
---|---|
business-data-cloud-sap[.]com | 774 |
sapbusinessdatacloud[.]com | 725 |
sapbusiness1cloud[.]com | 575 |
sapbusinesscloud[.]com | 575 |
sapenterprisecloud[.]com | 575 |
sapbusinessonesoftware[.]com | 548 |
sapbusinessonesoftware[.]info | 548 |
sapbusinessonesoftware[.]net | 548 |
sapbusinessonesoftware[.]org | 548 |
sapbusinessonecloud[.]com | 543 |
sapbusinessonecloud[.]net | 543 |
Table 3c: Top ten results by domain risk score for 'sap'
Domain name |
Domain risk score |
---|---|
visasecurepayment[.]com | 513 |
visa-payment[.]com | 508 |
visa-credit[.]com | 507 |
visa-credit[.]net | 507 |
visa-credits[.]com | 492 |
unsecured-visa-credit-cards[.]net | 486 |
payment-visa[.]com | 483 |
securvisapayment[.]com | 475 |
unsecured-visa-credit-card-applications[.]com | 452 |
visa-secure[.]com | 439 |
visa-secure[.]net | 439 |
visa-verify[.]com | 439 |
Table 3d: Top ten results by domain risk score for 'visa'
Domain name |
Domain risk score |
---|---|
track-package-rescheduled-delivery-ups[.]com | 711 |
ups-parceltrack[.]org | 662 |
ups-delivery-parcel[.]com | 643 |
ups-packagedelivery[.]com | 643 |
ups-parceltracking[.]com | 631 |
deliveryparcel-ups[.]com | 628 |
trackpackage-ups[.]com | 625 |
ups-deliverytrack-mt[.]com | 625 |
ups-parcell-tracker[.]com | 622 |
ups-parcel-tracking[.]com | 622 |
Table 3e: Top ten results by domain risk score for 'ups'
Domain name |
Domain risk score |
---|---|
intelsoftwarenetwork[.]com | 575 |
intellcoresystems[.]com | 551 |
intellicore-network[.]info | 543 |
intellicorenetworks[.]com | 543 |
intellicoresystems[.]com | 542 |
intel-business[.]com | 511 |
intel-software[.]com | 511 |
intel-network[.]com | 510 |
intel-system[.]com | 508 |
intel-core[.]com | 505 |
intel-core[.]net | 505 |
intel-core[.]vip | 505 |
Table 3f: Top ten results by domain risk score for 'intel'
Domain name |
Domain risk score |
---|---|
ge-healthcaretech[.]com | 663 |
ge-healthcaretechinc[.]com | 635 |
ge-healthcaretechnology[.]com | 614 |
ge-healthcaretechnologies[.]com | 603 |
ge-healthcaretechnologiesinc[.]net | 589 |
gehealthcaretech[.]com | 575 |
gentechhealthcare[.]com | 563 |
gehealthcaretechinc[.]com | 543 |
geltechealthcare[.]com | 525 |
gentechealthcare[.]com | 525 |
Table 3g: Top ten results by domain risk score for 'ge' (noting that only one example of a result for each unique SLD is shown, due to the large numbers of repeated SLDs in the overall dataset)
Domain name |
Domain risk score |
---|---|
axa-banque-finance[.]com | 619 |
axa-health-insurance-slovakia[.]online | 572 |
axafinancebank[.]com | 561 |
axafinancialbank[.]com | 538 |
axainsurancebreakdown[.]com | 537 |
axabusinessinsurance[.]biz | 535 |
axabusinessinsurance[.]com | 535 |
axabusinessinsurance[.]info | 535 |
axabusinessinsurance[.]mobi | 535 |
axabusinessinsurance[.]net | 535 |
Table 3h: Top ten results by domain risk score for 'axa'
The examples show that the algorithm performs well in terms of separating out the relevant examples from the large numbers of other results in the datasets.
Extensions to the approach
i. Use of domain name (SLD) entropy
In some cases, particularly for the shortest brand names, the dastasets may include instances of long, pseudo-random domain names (such as several of the examples shown above for 'ibm'). These types of domains are often associated with automated registrations intended for fraudulent use[3], but will not, in general, be associated with the brand whose name may be contained within them, and should ideally be disregarded (or downweighted) in the types of scoring algorithms described in this paper.
However, the analysis shows that the basic scoring algorithm outlined in this study often does not effectively distinguish between domain names of this type and other 'better' brand matches (i.e. more relevant results). For example, for 'ibm', all of the following examples are assigned a domain risk score of 125:
- i03204i8ua9n7sle6sdrm81mri0cibm9[.]net
- i0f29td98etcibm9gkc29v4v9j39p5qm[.]top
- i0lf99g2t8u92p7ibmlj4tvav849jp1n[.]tel
- i216r5835dfoush9k1iibm4vpd669dka[.]top
- i2ai773hvhan7l9001its1r8ibm84cav[.]site
- i5u5127lfb56iibmj4bfa4c0m03mjt4f[.]motorcycles
- i66t9t7vau8of667ibmlho120ab32bbv[.]online
- i6crr5n3uqmmsmm5it7874uj099ibm87[.]com
- i6t27emh03o11cfm6oa0r73l2ibmeki4[.]com
- i76kcibmcu3310epn6lagpp292ivj114[.]top
- i967pv1vn4outp103ibm7673diirjp3c[.]top
- i98fmfibmcnjnbg2999s402pgem2258s[.]top
- ia0n95j263iibmvue4s8v6lhll753a7s[.]com
- ibmclassroom[.]com
- ibmclienteng[.]com
- ibmcognitive[.]com
- ibmcognitive[.]org
- ibmcomputers[.]asia
- ibmcomputers[.]com
- ibmcomputing[.]com
- ibmcomputing[.]info
- ibmconfigure[.]com
- ibmcontracts[.]com
- ibmcorporate[.]com
This mix of result types is due to the wide range of factors contributing to the final overall calculated score, including the fact that many of the long, random domain names consist of large numbers of digits, meaning that once these are disregarded, the 'ibm' string accounts for a significant proportion of the remainder of the domain name.
One possible way to account for the differences between these types of domain name would be to make use of the concept of domain name (SLD) entropy; essentially, a measure of the length and randomness of the domain name. The categorisation can be achieved by applying a 'correction' to the calculated domain risk score, by reducing it by a factor which is dependent on the domain name entropy (and, in the proposed methodology, applying this only to domains with entropy values above a certain threshold, since some of the visually-relevant domain names are found have 'mid-range' entropy values).
As a case study, we can consider the dataset of 1,504 'ibm' domains in total which are assigned a (raw) domain risk score of 125. The entropy values of these domains sit in a range between 1.4591 (mibmim[.]com) and 4.6350 (fhibmd96pt2or8745a2cltjj1gu4373e[.]com), with (by inspection) most of the 'random' domain names found to have entropy values above around 3.5 (which can be termed the entropy 'threshold', Hth). As such, a suitable reduction factor (R) for the domain risk score can be defined in terms of the domain entropy (H) as:
R = exp(H) / exp(Hth) (for H > Hth)
R = 1 (otherwise)
such that the adjusted final domain risk score (Dadj) can be defined in terms of the 'raw' score (D) as:
Dadj = D / R
The form of this reduction factor function is as shown in Figure 1.
Figure 1: One possible formulation of a domain risk score reduction factor (R) to be used to 'down-score' high entropy (H) domains
This correction results in a 'down-scoring' of 642 of the 1,504 domains. As an illustration, Table 4 gives a selection of those domains whose final scores have been reduced as a result of the entropy-based correction (actually alphabetically the first domain assigned to each adjusted score value), showing that the correction does, as intended, preferentially affect the 'random' domain names.
Domain name |
Adjusted domain risk score (Dadj) |
---|---|
slmibm8epk1u84[.]com | 122 |
ibmpower4saphana[.]com | 116 |
ibmathsworld[.]com | 115 |
4659sib4645muss5msgf5buribm8e1u6[.]top | 103 |
shibmaro323429fjcnrin43rncnr43rvnfuiru448484848484[.]com | 94 |
97bj94io2ibm42fppgqi7n274f73fsji[.]how | 92 |
647d75i7co7mj7b0l7vmmqr4ibmd06qu[.]net | 90 |
kidmi5b71tibm7b0ff560iuq1c5ir477[.]pro | 89 |
ibmknaj5mcimebc3iaqchinml5l3h6ve[.]top | 88 |
413b3ibmlu6n9iq4qa4441cancjm96ap[.]com | 87 |
br74cgrf32bbsgr3rsc7s6ofs94nqibm[.]com | 86 |
v0q7bbtnb0atnqj68l0au0age1a7bibm[.]com | 85 |
Table 4: Examples of domains whose risk scores have been reduced by the entropy-based correction factor
ii. Content risk scoring
As an extension to the above ideas, it is also possible to calculate a second score, based on an analysis of the content of any associated webpage (if present), as an alternative or secondary means of sorting the results (working on the basis that, other factors being equal, a domain will be of greater concern if it is associated with live, brand-related content).
To this end, we can formulate a 'content risk score', which itself is composed of two constituent components:
- A 'brand content score' , reflecting the number and prominence of mentions of the brand name on the page
- An additional metric reflecting the numbers of unique relevance keywords mentioned at least once anywhere in the page content (to take account of the fact that, for common / 'generic' brand terms, the brand name could be mentioned in contexts unrelated to the brand in question, but the presence of relevance keywords will indicate that the subject matter of the page is relevant to the brand in question).
As an illustration, we can calculate the content risk scores for sets of the domains assigned the highest domain risk scores for each of the brands in question, as a means of identifying live content of interest (e.g. potential infringements).
As an example, Table 5 shows the website details for the examples achieving the highest content risk scores (i.e. potentially the most relevant websites) out of a set of those results for 'apple' which themselves receive the highest domain risk scores (>300) (i.e. potentially the most relevant domain names).
Domain name |
Domain risk score |
Website page title |
Content risk score |
---|---|---|---|
applewatchjournal.net | 343 | Apple Watch Journal - Apple Watch (アップルウォッチ)の総合情報サイト。 Apple Watchの基本的な使い方やWatch アプリの情報、最新ニュースを紹介します! |
4,640 |
applelivingstore.com | 300 | Apple Living Store – Vente des iphones neufs et occasions |
4,150 |
appleministore.com | 318 | Shop the Latest Apple Products iPhones; MacBooks; iPads & More |
4,020 |
applewatchcast.com | 368 | The Apple WatchCast Podcast - A podcast dedicated to the Apple Watch |
2,700 |
applewatchrepairz.com | 343 | Get Professional Apple Watch Repair Services | Fast & Affordable |
2,300 |
apple-mac.support | 503 | Apple Spezialist im Rheinland | Mac Support für Kunden in Köln, Bonn, Düsseldorf und Aachen | KLEUTGENS.IT |
2,226 |
apple.watch | 500 | Apple Watch - Apple | 2,150 |
apple-wholesale-stores.com | 366 | Apple Wholesale Store - Buy Apple Products at the Best Price |
2,145 |
Table 5: Website details for the examples achieving the highest content risk scores for Apple
On this basis, Figure 2 shows one example of an identified live website of interest (i.e. brand-related content / potential brand infringement) for each of the brands under consideration.
Figure 2: Examples of an identified live website of interest for each of the brands under consideration: apple-wholesale-stores[.]com, ibmisecurity[.]com, sap-system[.]com, paymentvisanet[.]com, ups17track[.]com, intel-processor[.]com, gevernovatechtraining[.]com, axainsurancebali[.]com
Conclusion
The studies presented in this paper have illustrated how a relatively simple 'domain risk scoring' approach can be used to effectively rank domains identified through broad searches, so as to identify names of particular interest, even in cases where the brand name used as the basis of the search may be a very short or common term.
In extensions to this idea, it would be possible to extend the scoring formulation to take account of other inherent characteristics of the domain, such as TLD, MX record, or registrant, registrar or hosting-provider characteristics, many of which can themselves be assigned into 'tiers' of potential threat level, and scored accordingly.
Finally, by combining this domain risk scoring approach with a 'content risk score' formulation, it is possible to carry out a deeper dive into the set of ranked results, to identify live content of potential interest, to serve as priority targets for further analysis, content tracking, or enforcement.
References
[2] https://interbrand.com/best-global-brands/
[3] https://circleid.com/posts/20230703-an-overview-of-the-concept-and-use-of-domain-name-entropy
This article was first published as a white paper on 3 July 2024 at:
No comments:
Post a Comment