Monday, 23 June 2025

'Notorious IP Addresses' and initial steps towards the formulation of an overall threat score for website

Part of the 'Patterns in Brand Monitoring: Brand Protection Data is Beautiful' series of articles[1,2,3]

EXECUTIVE SUMMARY

The ability to rank results according to the level of threat they pose is a key component of many brand protection services, offering the ability to identify priority targets for further analysis, content tracking or enforcement.

Metrics providing the capability to rank results in this way are often based on a range of website characteristics, including webpage content and technical configuration features of the associated domain name. 

This study considers the case of website hosting characteristics, with a specific focus on the IP address at which the website is hosted. The IP address - and, by extension, the associated hosting service provider - can be an important factor to consider, as hosting providers can vary in their level of attractiveness to infringers, based on a range of factors such as their compliance to takedown requests. 

The analysis presented in this case utilises data from an IP address 'blacklist', compiled using insights from any identified association of the address in question with content found to be infringing, such as use for spamming or malware distribution. The construction of one possible formulation of a threat-score component based on the host IP-address is then presented, calculated using the proximity of the IP address in question to other addresses explicitly included in the blacklist. The algorithm is based on the subdivision of IP-address space into 'netblocks', across which patterns in the frequency of infringing content are also considered.

This article was first published on 19 June 2025 at:

https://www.iamstobbs.com/insights/notorious-ip-addresses-and-initial-steps-towards-the-formulation-of-an-overall-threat-score-for-websites

* * * * *

WHITE PAPER

Introduction

The identification of those website characteristics which are disproportionately associated with infringing or illicit activity is a key element of the process of threat quantification for brand-protection findings. Quantifying the level of (current or potential future) risk of an identified domain name or website has a number of benefits, including the ability to identify priority targets for further analysis, content tracking or enforcement, amongst potentially large datasets[4,5]. The same datasets can also provide insights into 'clusters' of associated findings[6], and into the likelihood of enforcement success in any particular case.

Two such characteristics are the registrar (the organisation through which the associated domain name was registered) and the hosting provider (the organisation supplying the physical infrastructure - i.e. a webserver - on which a site is hosted) of any given website in question. There are many possible reasons why specific registrars and hosting providers may be disproportionately popular with infringers, including differences in their inherent level of cooperativeness ('compliance') to notifications of IP infringements, their speed of response[7], geographic region(s) of operation, and so on. 

In the case of registrars, various research organisations collate information on those providers which are found to be more commonly associated with infringing activity of a range of types. The most meaningful datasets are those in which the numbers of infringements are expressed as a proportion of the total number of domains registered, to give an overall 'trust' or 'reputation' score for each registrar, rather than just considering the raw numbers of infringing sites (since this will skew the data towards those registrars which are simply more popular generally). One such dataset is that provided by Spamhaus[8], which (as of 29-Jan-2025) gives the top five 'low-trust' registrars (by quantitative 'bad reputation score') as 'Ultahost, Inc.', 'Domain International Services Limited', 'nicenic.net (ZhuHai NaiSiNiKe Information Technology)', '香港翼优有限公司' ('Hong Kong Wingyou Co., Ltd.') and 'Dnsgulf Pte. Ltd.'. Note that this list does not necessarily imply that the registrars in question are non-compliant with enforcement notices, although it has been noted that the frequent or repeated association of a registrar with infringing activity is often an indication of non-compliance[9]. Examples of non-compliant registrars are also discussed in forums between infringers looking for providers to use for their content[10].

Moreover, many brand protection service providers will have collated (in many cases, quantifiable) information on the compliance of individual registrars, based on their previous enforcement experience. This allows for the construction of a risk 'score' for each registrar, which can serve as an input into algorithms for quantifying the overall level of potential threat of any associated website.

Similar comments are also true of hosting providers. Indeed, some providers explicitly bill themselves as 'bulletproof' -  implying a lack of compliance to enforcement notices - as a means of attracting business from providers of illicit content (Figure 1).

Figure 1: Examples of websites of self-proclaimed ‘bulletproof’ hosting providers (and/or registrars)

Other websites also exist to serve as resources for content producers looking for recommendations of non-compliant providers (Figure 2).

Figure 2: Website offering recommendations of 'bulletproof' hosting providers

Similarly to the case with 'high-risk' registrars, a number of resources are also available where information on infringing hosting providers is collated, such as the information provided again by Spamhaus[11]

In this study, we aim to collate information from a related dataset; namely a 'blacklist' of IP addresses which has been compiled based on reports of associated infringing activity of a variety of types, and from a range of sources. This analysis aims to identify any trends and patterns in the groups of high-risk IP addresses[12] - and, by extension, the hosting providers with which they are associated - as a means of establishing additional datasets which could be used as data inputs into algorithms for assessing overall website potential risk level (i.e. if a website is hosted on a high-risk IP address, it is potentially more likely to be associated with illicit activity). 

Analysis

The dataset used in this case is the IP address blacklist provided by Myip.ms[13], containing around 169,000 listings (0.0039% of the total possible IP-space)[14] as of 29-Jan-2025. The (IPv4) addresses are of the format xx.xx.xx.xx, where each 'xx' is a number between 0 and 255. In this study, we use the terminology 'netblock' to refer to a group of IP addresses with the same initial elements; a group of addresses of the form A.xx.xx.xx (with fixed 'A') would be a 'first-level netblock', A.B.xx.xx a 'second-level netblock' and A.B.C.xx a 'third-level'.

The most obvious initial stage of analysis would simply be to consider the hosting provider and country associated with each of the IP addresses in the dataset. This 'granular' approach in some ways provides more meaningful information than any insights gained by grouping together the individual IP addresses into their respective netblocks, not least because there is not necessarily any reason to believe that all addresses in a particular netblock are associated with each other, or with a common hosting provider (although it is often the case that major providers may control entire netblocks). Nevertheless, a netblock-based analysis can provide some useful insights.

The most obvious observation is that the blacklisted IP addresses are not distributed evenly across IP-space; Figure 3 shows the total number of such addresses within each first-level netblock.

 

Figure 3: Total number of blacklisted IP addresses within each first-level netblock

The 10.xx.xx.xx, 11.xx.xx.xx, 127,xx.xx.xx and all blocks from 224.xx.xx.xx onwards do not contain any blacklisted addresses. The majority of these have special uses, however, such as the 127 netblock, which is reserved for (internal) loopback addresses[15], and the 10 netblock, reserved for private networks[16]

Next, we consider the IP address 'universe' grouped into second-level netblocks (A.B.xx.xx), of which there are 65,536 (i.e. 2562) in total. Using this framework, it is possible to determine how many blacklisted IP addresses appear in each block (which may provide valuable insights, working on the principle that those blocks more highly populated with blacklisted addresses could, all other factors being equal, be deemed 'higher risk' for any arbitrary associated other websites). This dataset is presented graphically in Figure 4.

Figure 4: Number of blacklisted IP addresses (out of a possible maximum of 65,536) in each second-level netblock - first-level address component ('A' in A.B.xx.xx) (from 0 to 255) shown across the horizontal axis; second-level address component ('B' in A.B.xx.xx) (from 0 to 255) shown down the vertical axis

The next associated insight is the identification of those individual netblocks which are associated with the greatest numbers of infringements (i.e. the greatest numbers of blacklisted addresses) - i.e. the brightest 'hotspots' in the figure - of which the top ten are shown in Table 1.

Netblock
                                    
No. blacklisted
addresses
                                    
114.119.xx.xx 2,353
159.138.xx.xx 1,606
104.21.xx.xx 1,253
172.67.xx.xx 986
47.251.xx.xx 882
17.241.xx.xx 670
183.130.xx.xx 658
54.36.xx.xx 604
3.145.xx.xx 507
116.2.xx.xx 496

Table 1: Top ten 'high-risk' (second-level) netblocks, by the numbers of blacklisted IP addresses (out of a possible maximum of 65,536)

In additional to the individual 'hotspot' netblocks, a number of vertical 'stripes' are present in the visualisation, indicating groups of adjacent netblocks, all (or many) of which are associated with unusually high levels of infringements (and also more strongly suggesting meaningful links between them). Examples include the first-level netblocks 45.xx.xx.xx (3,279 blacklisted addresses out of a possible 16.7 million (i.e. 2563)), 103.xx.xx.xx (4,210 addresses), 185.xx.xx.xx (4,867 addresses) (red arrows in Figure 5), and the groups of second-level blocks 35.159.xx.xx to 35.243.xx.xx (962 addresses out of a possible 5.5 million), 54.144.xx.xx to 54.246.xx.xx (1,492 / 6.8 million), and 91.190.xx.xx to 91.247.xx.xx (1,026 / 3.8 million) (blue arrows in Figure 5).

Figure 5: Version of Figure 4, but with arrows highlighting 'clusters' of adjacent netblocks all (or many) of which contain high numbers of blacklisted IP addresses

Note that it would also be possible to carry out a similar analysis looking at the third-level netblocks, in which the equivalent of Figure 3 would be a visualisation as a 3D cube. Although a graphical analysis is somewhat more cumbersome, it is a relatively simple matter to identify the highest-risk netblocks (by the number of blacklisted IP addresses - out of a possible maximum of 256 - contained within them), in a way analogous to Table 1. This analysis is shown in Table 2, for all third-level netblocks in which at least half the IP addresses are blacklisted. 

Netblock
                                    
No. blacklisted
addresses
                                    
54.36.148.xx 256
195.154.122.xx 255
95.108.213.xx 254
213.180.203.xx 253
87.250.224.xx 252
110.52.235.xx 252
17.241.219.xx 226
17.241.75.xx 225
17.241.227.xx 219
5.255.231.xx 209
113.123.0.xx 200
52.167.144.xx 195
54.36.150.xx 192
20.171.206.xx 179
117.45.252.xx 175
95.163.255.xx 160
185.220.101.xx 159
195.154.123.xx 146
159.138.152.xx 142
13.66.139.xx 141
52.233.106.xx 136
159.138.128.xx 133
159.138.156.xx 132
159.138.157.xx 132
64.124.8.xx 130
159.138.154.xx 129
159.138.155.xx 129
159.138.153.xx 128

Table 2: Top 'high-risk' (third-level) netblocks, by the numbers of blacklisted IP addresses (out of a possible maximum of 256)

From this data, we can start to see the possible basis of a threat scoring algorithm for arbitrary websites. A website hosted on an IP address which is actually blacklisted is highly likely to be of concern; however, one hosted in one of the netblocks featured in Table 2 (for example) will still warrant careful analysis (i.e. being assigned a 'secondary' level of concern), even if it is hosted on one of the specific IP addresses within the block which is not explicitly blacklisted.

The next stage of analysis is to consider the hosting provider and geographical country of location associated with each of the blacklisted addresses, in order to determine which providers and countries appear most commonly in the dataset and might therefore be deemed 'highest risk'. This information is generally readily available via an IP address 'whois' look-up in each case. 

From this dataset, some patterns are immediately apparent. For example, the set of 'high-risk' addresses between 35.159.xx.xx and  35.243.xx.xx are all associated with Amazon Technologies Inc. and Google LLC as hosting providers, and the 54.144.xx.xx to 54.246.xx.xx set is also under the management of Amazon Technologies Inc.

As a simple way of post-processing the data (so as to extract a 'clean' version of the name of the hosting provider in each case, and to most efficiently collect together - at a high level - IP addresses pertaining to what is actually the same provider), the name of the hosting provider as given by the whois look-up in each case is truncated at the first instance of a comma - so that, for example,  'GoDaddy.com' and 'GoDaddy.com, LLC' are both treated as the same entity. This yields a set of 8,757 distinct entities.

It is worth pointing out that the whois look-ups required specifically to identify the hosting providers of the IP addresses in question (noting that the original IP address blacklist dataset itself also gives country information) failed in 51,696 cases, which may cause the statistics to be 'skewed' somewhat, if the failures are disproportionately associated with particular providers or geographic regions.

From the available data, Tables 3 and 4 show the top (i.e. 'highest risk') hosting providers and countries most commonly associated with the IP addresses in the blacklist.

Hosting provider
                                                                                                        
No. blacklisted
IP addresses
                                    
  Amazon Technologies Inc. 14,030
  CHINANET jiangsu province network 7,285
  Cloudflare 3,317
  Microsoft Corporation 2,817
  Amazon.com 2,796
  Huawei-Cloud-SG 2,526
  DigitalOcean 2,329
  HostPapa 2,157
  Alibaba Cloud LLC 1,971
  CHINANET SHANDONG PROVINCE NETWORK 1,869
  CHINANET Jiangxi province network 1,619
  Google LLC 1,584
  Huawei HongKong Clouds 1,538
  CHINANET Anhui province network 1,382
  CHINANET Guangdong province network 1,222
  PSINet 1,206
  PT TELKOM INDONESIA 1,070
  CHINANET-ZJ Zhongxin node network 873
  CHINANET henan province network 821
  Apple Inc. 756

Table 3: Top (i.e. 'highest risk') hosting providers represented in the IP address blacklist (where data available)

Host country
                                                
No. blacklisted
IP addresses
                                    
  US (USA) 53,373
  CN (China) 27,189
  RU (Russia) 9,669
  SG (Singapore) 6,099
  DE (Germany) 4,734
  ID (Indonesia) 4,264
  BR (Brazil) 4,125
  GB (UK) 3,607
  IN (India) 3,557
  FR (France) 2,450
  VN (Vietnam) 2,140
  UA (Ukraine) 2,065
  PL (Poland) 2,061
  BD (Bangladesh) 2,012
  CA (Canada) 1,989
  TH (Thailand) 1,886
  NL (Netherlands) 1,604
  RO (Romania) 1,558
  RS (Serbia) 1,468
  ZA (South Africa) 1,407

Table 4: Top (i.e. 'highest risk') host countries represented in the IP address blacklist (where data available)

In order to take a more granular view, it is possible to convert each IP address to a city-level location (and an associated latitude / longitude reference) through a process called 'geolocation', for which a number of standard tools are available[17]. From this analysis, we can also extract the top 'high risk' city locations for hosting blacklisted content (Table 5). 

Host city
                                                
No. blacklisted
IP addresses
                                    
  Shanghai, CN 19,981
  Columbus, US 7,061
  Ashburn, US 6,088
  Singapore, SG 4,270
  San Francisco, US 3,553
  Moscow, RU 3,287
  Hong Kong, HK 3,077
  Los Angeles, US 3,063
  Jiaxing, CN 2,714
  Frankfurt am Main, DE 2,497
  San Jose, US 2,478
  Seattle, US 2,127
  New York City, US 1,822
  Amsterdam, NL 1,687
  Buffalo, US 1,674
  London, GB 1,669
  Jakarta, ID 1,571
  Paris, FR 1,431
  Tokyo, JP 1,318
  Dallas, US 1,297

Table 5: Top (i.e. 'highest risk') city host locations represented in the IP address blacklist (where data available)

Following on from the above, it is also possible to construct a 'heat map' to visualise the host locations of the blacklisted IP addresses (essentially, aggregating together the geolocation information into grid squares, and shading them according to the number of blacklisted IP addresses within each square. This visualisation is shown in Figures 6 and 7 (where each grid square covers 1° of latitude / longitude). 

Figure 6: Global heat map showing the host locations of the blacklisted IP addresses (shading denotes the number of addresses hosted within each grid square)

Figure 7: Detailed views of Figure 6 - top to bottom: Americas; Europe and Middle East; Asia

Whilst the numbers presented in this study are meaningful in their own right (in terms of reflecting where (and with whom) the blacklisted IP addresses are hosted - i.e.. the 'dark spots' on the heat map in Figures 6 and 7), they do reflect both the locations of the infringements and the locations where content is most commonly hosted generally. For example, if a particular hosting provider is generally very commonly used, it might not be unreasonable to expect that provider also to be associated with high volumes of infringements (even if the extent of abuse is not disproportionate). For a future piece of analysis, it may be instructive to compare the extent to with which locations and hosting providers are associated with high levels of threat (i.e. numbers of blacklisted IP addresses) with the overall numbers of IP addresses associated with those same locations and hosting providers (e.g. the total numbers of IP addresses under their management), so as to get a more meaningful measure of rate of association with infringing activity (i.e. a 'reputation' score). 

Discussion: Steps towards a threat-scoring framework

The main application of this type of analysis is the determination of factors which are most commonly associated with infringing websites. Once these databases are in place, they can be used as inputs into overall algorithms to quantify the likely level of threat which may be posed by an arbitrary (perhaps newly-identified) website, even in cases where no live website content is not yet present (in the cases of characteristics such as registrars and hosting providers and locations, which are inherent to the technical infrastructure of the domain name in question).

Looking at the case of hosting IP address as an example, it may also be appropriate to assign IP addresses, and IP address ranges, into threat 'tiers' (with associated threat-score components) based on the 'closeness' of their association with known infringing content. A host IP address which is actually blacklisted is likely to be associated with the highest level of potential threat, followed by a non-blacklisted IP address within a netblock which itself contains high numbers of blacklisted addresses. Lower tiers of threat may be appropriate for IP addresses in higher-level netblocks which are generally found to be associated with higher-than-average rates of abuse (such as those covered by the vertical 'stripes' in Figures 4 and 5). 

A fuller formulation of a threat-scoring framework along these lines may also be a topic for future research, but it is instructive to test an initial prototype version based on the characteristics (high-risk IP addresses, hosting providers and registrars[18]) discussed in this study. For this analysis, we consider a sample set of arbitrary domain names registered on a particular day, based on zone-file analysis[19].

For this dataset of around 11,000 domain names, whois look-ups were run to determine the host IP address, the associated hosting provider and the registrar in each case. For each of these three characteristics, a threat-score component (nominally between 0 and 100) was calculated (based on comparison with the datasets outlined in this study, pertaining to the frequency of each of these characteristics with infringing content) for each domain in question. Details of the methodology are given in Appendix A. 

These components were then aggregated together to yield an overall potential threat score for the domain; the simplest implementation of the threat score is that given by simply adding the three components together. In this case, this yields 398 jointly top-scored domains, all with a score of 171, all of which are hosted on an IP address which is explicitly blacklisted (score component = 100), with the dominant remaining component of the score being a contribution of 70, caused by the fact that the sites in question are hosted with Amazon Technologies Inc., which appears extensively in the IP address blacklist. However, the score for this provider is probably artificially rather too high, appearing as an artefact of the fact that Amazon is a very popular hosting provider generally, and highlighting the requirement for some kind of normalisation according to the total number of websites / IP addresses under management. 

The use of a high-threat registrar (according to the Spamhaus list) is probably a better indication of potential infringing activity than either of the other two domain characteristics being considered, so it may be appropriate to increase (by some factor) the weight of the contribution of the registrar score to the overall threat score. In so doing, we gain (apparently) a much more meaningful assessment of the level of potential threat posed by the domains, as verified in many cases by an inspection of site content (where present), or a simple analysis of the types of keywords present in the domain names (suggesting that, even where no live site is yet present, several of the most highly-scored names are likely to have been registered for use in conjunction with the types of content which are frequently of concern, making them worthy of future monitoring). The most highly-scored domain registrations by this weighted threat score are shown in Table 6 (noting that some of these may, of course, actually be legitimate).

* 'NiceNIC' = NiceNIC International Group Co., Limited

Table 6: Top-ranked domains in the dataset by potential threat score

Indeed, of the top twenty domains of with greatest potential threat scores (shown in Table 6), several feature characteristics of particular concern:

  • Two (eflowtollsystem[.]com and kraken2trfqodidvlh4aa337cpzfrhdlfldhve5nf7ujhnmwr7instad[.]com*) generate browser warning pages advising of 'dangerous' content
  • Some are blocked from viewing in certain geographic locations
  • Some resolve or re-direct to apparently innocuous content, but which may also be a means of 'masking' infringing content, which might only be visible at certain times or from certain locations (i.e. 'geoblocking')[20]
  • Some pertain to content which is commonly associated with scams or other types of abuse, such as blockchain technology or cryptocurrency (e.g. claim-pinlink[.]com - re-directs to claims-realios[.]net/main*, proposai-soniclabs[.]com*, resasfinance[.]com*)
  • Others are soliciting for the input of personal details and may be impersonating trusted brands (e.g. 1298245[.]com*)
  • Of the domains which do not resolve to live content (or where the content is not visible as of the date of analysis), several have domain names which are highly suggestive of suspicious or fraudulent use (e.g. unlock-e-trade[.]com, netbotrade[.]com, contactlloydsonline[.]com, secure-coinb[.]com)

Those examples marked with an asterisk are shown in Figure 8.

Figure 8: Examples of live site content of potential concern hosted on domains listed in Table 6.

In cases where this type of threat-scoring approach is applied to sets of domain registrations pertaining to a specific brand (or other issue of interest), the ranking is likely to offer an efficient way of determining which of the names in the dataset are most worthy of initial prioritised analysis or enforcement.

A final point to note is that insights regarding the geographical focuses of infringing activity, as presented in this study, can also help inform wider policies on intellectual property protection, such as identifying key territories in which additional trade mark protection would be advisable.

Appendix A: Methodology for calculating the prototype threat score components

i. Score component based on host IP address / third-level netblock

If the host IP address is explicitly one of the blacklisted addresses, it is automatically assigned a score of 100. If this is not the case, but if the IP address appears in the same third-level netblock as at least one blacklisted address, the score component is calculated as the ratio between the number of blacklisted addresses within the netblock, and 256 (i.e. the total number of possible addresses in the block), multiplied by 100. 

For example, if a domain was found to be hosted in a non-blacklisted IP address in the 159.138.153.xx netblock (which contains 128 blacklisted addresses in total), the threat score component is calculated as (128 / 256) × 100 = 50. 

ii. Score component based on hosting provider

The score component assigned to each hosting provider is based on the frequency of association of each provider with blacklisted IP addresses contained within the dataset utilised in this study. The individual providers thereby fall into a range between 0 and 14,030 blacklisted addresses (Amazon Technologies Inc.). The score component assigned to a website associated with any given hosting provider is calculated as the ratio between the number of blacklisted addresses and (arbitrarily) 20,000, multiplied by 100 (giving a final value between 0 and 70.15). 

Note that the score as defined is therefore unnormalised relative to the total number of IP addresses under management for that hosting provider.

N.B. It is generally necessary to apply an element of data 'cleansing' before carrying out the matching of hosting provider names (and also to aggregate together entries in the blacklist, as necessary), as the same provider may be referenced differently by distinct whois look-ups - e.g. GoDaddy may variably be referenced as 'GoDaddy', 'GoDaddy.com', 'GoDaddy.com, LLC', etc.

iii. Score component based on registrar

The score component associated with the domain registrar is simply based on the dataset provided by Spamhaus, as referenced in the Introduction section of this study (which itself already incorporates an element of 'normalisation', based on the total numbers of domains under management).

The registrars in the Spamhaus database are assigned scores which sit in a range from 0 to 7.6. Wherever a registrar for a domain in the analysed dataset appears in the Spamhaus list, the associated threat-score component is calculated just as the Spamhaus score multiplied by ten (to give a score in the range from 0 to 76, for the dataset provided as of the date of analysis). 

N.B. (1) As for the hosting providers, it is generally necessary to apply an element of data 'cleansing' before matching the registrar given by a whois look-up against the contents of the Spamhaus list (rather than simply carrying out a straight look-up), since the same registrar may be referenced differently across the lists (e.g. 'CSL Computer Service Langenbach GmbH d/b/a joker.com' is referenced by Spamhaus as 'Joker (CSL Computer Service)'.

N.B. (2) In cases where the same registrar appears more than once in the Spamhaus list with a variant name, but with different scores (e.g. 'Turkticaret.net Yazilim Hizmetleri Sanayi ve Ticaret A.S.' (0.0355) and 'Turkticaret.net Yazılım Hizmetleri Sanayi ve Ticaret A.Ş.' (0.0595)), the score used in this analysis is taken simply as the mean of the relevant Spamhaus scores (i.e. 0.0475 in the above case).  

References

[1] https://www.linkedin.com/pulse/brand-protection-data-beautiful-david-barnett-c66be/

[2] https://www.linkedin.com/pulse/brand-protection-data-still-beautiful-part-1-year-domains-barnett-juwhe/

[3] https://www.linkedin.com/pulse/brand-monitoring-data-niblet-5-law-firm-scam-websites-david-barnett-ap5de/

[4] 'Patterns in Brand Monitoring' (D.N. Barnett, Business Expert Press, 2025), Chapter 3: 'Brand content scoring'

[5] 'Patterns in Brand Monitoring' (D.N. Barnett, Business Expert Press, 2025), Chapter 5: 'Prioritization criteria for specific types of content'

[6] 'Patterns in Brand Monitoring' (D.N. Barnett, Business Expert Press, 2025), Chapter 6: 'Result clustering'

[7] https://brandsec.com.au/phishing-malicious-domain-names/

[8] https://www.spamhaus.org/reputation-statistics/registrars/domains/

[9] https://bfore.ai/navigating-domain-takedowns-with-non-cooperative-registrars/

[10] e.g. https://www.blackhatworld.com/seo/question-looking-for-bulletproof-domain-registrar.1412558/

[11] https://www.spamhaus.org/resource-hub/bulletproof-hosting/bulletproof-hosting-theres-a-new-kid-in-town/

[12] This study uses the terminology of 'notorious' IP addresses, in reference to the USTR 'Notorious Markets List', which is published annually to reflect those high-risk platforms most commonly associated with facilitating counterfeiting and piracy - see https://ustr.gov/about-us/policy-offices/press-office/press-releases/2025/january/ustr-releases-2024-review-notorious-markets-counterfeiting-and-piracy

[13] https://myip.ms/browse/blacklist/Blacklist_IP_Blacklist_IP_Addresses_Live_Database_Real-time

[14] Note that the analysis focuses only on the 'old format' (IPv4) IP addresses (of the form xx.xx.xx.xx, where each 'xx' is a number between 0 and 255) in the blacklist; this type of analysis is likely to become much more complex as IP address usage transitions to the IPv6 format (yyyy:yyyy:yyyy:yyyy:yyyy:yyyy:yyyy:yyyy, where each 'yyyy' is a four-digit hexadecimal (base-16) number) in the future.

[15] https://www.cronj.com/blog/localhost-127001-a-special-address/

[16] https://en.wikipedia.org/wiki/List_of_assigned_/8_IPv4_address_blocks

[17] In this study, we utilise the Python-based library tool 'IPinfo' (https://pypi.org/project/ipinfo/), which references the dataset available from IPinfo.io. In order to limit the number of geolocation look-ups required in this study, we perform a query only for one IP address in any range where (a) the second-level netblock, (b) the hosting provider name, and (c) the hosting provider country are all the same (i.e. for point (a), where the first- and second-level IP address components are the same). The latitude and longitude of the physical location of all other IP addresses in the range sharing these characteristics is then assumed to be identical.

[18] This approach allows us to incorporate additional information than would be available by (say) just considering the host IP address as a means of identifying the associated hosting provider - this is appropriate given (for example) the fact that, just because an IP address under the management of a particular provider may be blacklisted, it does not necessarily follow that all of that provider's addresses will be higher risk.

[19] The dataset is taken from one day's worth of registrations (117,456) of .com domain names - a TLD for which registration information is generally readily available - as provided by the zonefiles.io website on 01-Feb-2025, relating to the previous day's registrations. The sample analysed in this study consists of every tenth domain name (when sorted into alphabetical order), yielding a dataset of 11,745 domains. Analysis of site content was carried out on 03-Feb-2025.

[20] https://circleid.com/posts/20220531-do-you-see-what-i-see-geotargeting-in-brand-infringements

This article was first published as a white paper on 19 June 2025 at:

https://www.iamstobbs.com/uploads/general/Notorious-IP-addresses-e-book.pdf

Tuesday, 17 June 2025

The new new-gTLDs - Part 2: A wider domain of language support

As the build-up to the second round of the new-gTLD programme[1] continues towards its launch in April 2026, we take a look at the issue of non-English-language support within the framework.

The programme itself initially began in 2012, involving the addition of large numbers of new domain-name extensions (global top-level domains, or gTLDs) to the Internet infrastructure. It incorporated a process whereby individual entities were able to apply to run (i.e. act as registry organisation for) their own extension, thereby maintaining control over features such as whether the TLD would be reserved for their own use (e.g. as a 'dot-brand'[2]) or open for registrations by third parties. A second round of applications is set to begin in Q2 of next year.

As the new phase approaches, ICANN (the Internet Corporation for Assigned Names and Numbers, as the organisation overseeing the initiative) has announced a number of improvements to the way multiple languages are supported within the programme[3]. Key points include:

  • The addition of three additional non-Latin scripts in which applied-for TLDs can be expressed.
  • The support of a greatly increased number of languages within the programme generally (to 380, up from 23 in the first round) - e.g. in areas surrounding technical provisions (such as compatibility with associated portals) and DNS infrastructure.
  • Improvements to the process for assessing string similarity and potential for name 'collisions' (i.e. the same name existing in different namespaces), including the incorporation of visual and phonetic similarity evaluations. The application process will also feature the ability for specifying a 'second-choice' string (which must contain the first-choice version as a sub-string), for cases where the preferred version is deemed unacceptable, in addition to a more transparent process for resolving contentions.

These changes will give greater flexibility for entities operating in the non-English-speaking world, and are another area to consider for organisations assessing their place in the new-gTLD landscape (e.g. those considering an application for their own brand-specific extension). 

How might the implications of these changes manifest themselves as the second phase of the programme comes into fruition? One way of gaining possible insights in this area - e.g. regarding potential use-cases for foreign-language domains - is to consider the current state of the landscape, with an obvious source of relevant 'clean' data being the existing set of internationalised domain names (IDNs)[4] (i.e. those incorporating non-Latin characters). The IDNs specifically are a special subset of the full universe of non-English domain names generally, which do (of course) include large numbers of examples written just in Latin characters. The non-English Latin-character domain landscape already includes many whole gTLDs, such as (Chinese) .xin and .weibo; (French) .moi and .maison; (German) .jetzt, .kaufen, .reise and .versicherung; (Hindi) .desi; (Italian) .casa, .immo and .moda; (Portuguese) .bom; (Spanish) .abogado, .futbol, .gratis, .tienda, .uno and .viajes), all of which (as non-IDNs) do not require any 'special' technical infrastructure. 

As of the current time there are, however, around 150 internationalised new-gTLDs (i.e. where the domain-name extension itself is written in (or includes characters written in) a non-Latin script) which have been delegated into the Internet infrastructure[5]. Domain names (or domain-name extensions) of this type are sometimes expressed in an 'encoded' format called Punycode (in which they are converted to a string written wholly in Latin characters, denoted by the characters 'xn--' at the start), which is how they are expressed in zone-file raw data, for example.

Domain name zone files (containing lists of all registered domains across the extension in question, in addition to other technical configuration information) are published by (and are publicly available from) ICANN, for around 80 of these extensions, providing a ready source of data which can easily be analysed to identify trends and patterns in usage. Many of the remainder of the delegated IDN extensions are country-specific examples (e.g. comprising just a country name written in local language, or an abbreviation (analogous to familiar non-IDN ccTLDs such as .co.uk, .fr, .de, etc.)), or are extensions which may no longer be in active use.

For the approximately 80 IDN gTLDs for which zone-file data is available, it is possible to drill into the datasets to gain an overview of the specific domain names registered. Table 1 shows (for example) the most popular of these extensions by total numbers of registered domain names (for all IDN TLDs associated with more than 250 domains). 34 of the 80 extensions feature only five domains or fewer.

Table 1: The most popular IDN gTLDs currently, by numbers of registered domains (additional information mostly provided by Wikipedia[6])

It is noteworthy that the list of the most popular extensions is dominated by Chinese-language examples, mostly comprising generic terms (but with two brand-specific (China Mobile Communications Corporation, and CITIC Group) and two geographic (Guangdong and Foshan) extensions). 

As an illustrative example, it is informative to consider the list of 31,192 individual domains with the most popular of these extensions (.在线, a Chinese-language extension meaning 'online'). In the vast majority of these cases (29,811, or 95.6% of the total), the second-level domain (SLD) - i.e. the part of the domain name to the left of the dot - is also written in (wholly or partly) non-Latin script (Chinese in most cases), thereby comprising fully internationalized domain names. Of the remainder, 107 of the domain names consist purely of digits as the SLD (i.e. numeric domain names[7], which are often popular in markets such as China, where their use can circumvent language barriers and particular numbers may have specific cultural significance). The remainder of the domains feature a range of different types of (Latin-character) terms in their SLD, including transliterations of Chinese words, a range of generic terms, and numerous brand references (including, presumably, both official and non-legitimate (potentially infringing) examples). 

Overall, therefore, the landscape of IDNs essentially just comprises a microcosm of the domain landscape generally, but offers additional options and flexibility for brands and consumers in markets where the local language includes non-Latin characters. In light of the increased support for a wider range of languages in the forthcoming phase of the new-gTLD programme, brand owners will once again want to consider the opportunities and risks within the ever-expanding online landscape.

References

[1] https://www.iamstobbs.com/opinion/the-new-new-gtlds

[2] https://www.iamstobbs.com/opinion/a-review-of-the-current-state-of-the-new-gtld-programme-dot-brands

[3] https://www.worldtrademarkreview.com/article/hundreds-of-languages-added-help-internationalise-new-gtlds-in-2026

[4] https://www.iamstobbs.com/opinion/idn-tifying-trends-insights-from-the-set-of-non-latin-domain-names

[5] https://data.iana.org/TLD/tlds-alpha-by-domain.txt

[6] https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains

[7] https://www.iamstobbs.com/opinion/the-universe-of-numeric-domain-names

This article was first published on 17 June 2025 at:

https://www.iamstobbs.com/insights/the-new-new-gtlds-a-wider-domain-of-language-support

Thursday, 5 June 2025

An updated view of bad TLDs

A central component of the analysis of results identified through a brand monitoring programme is an assessment of the level of threat associated with each finding. Domain names specifically have a number of associated characteristics which can be used to quantify their potential threat, and much previous work has focused on the frequency of association of these characteristics with malicious content. This type of analysis can serve as a basis for the construction of algorithms to quantify the likely level of potential threat associated with any arbitrary identified website[1]. Threat scoring - as a method for prioritising findings - has a range of uses, including the identification of priority targets for further analysis, content tracking, or enforcement.

One such key feature is the domain-name extension (the top-level domain, or TLD - which includes examples such as .com or .xyz), with differing TLDs having wildly varying rates of popularity with infringers and other bad actors, due to a range of factors including registration costs, existence of IP protection programmes, and ease of enforcement. 

One previous study on the subject[2] compiled overall threat scores for a group of highly affected TLDs, based on an aggregation of data from other sources, including Spamhaus[3], Netcraft[4] and Palo Alto Networks[5], each of which encompassed insights relating to differing aspects of infringing behaviour. 

The release of Spamhaus' latest Domain Reputation Update report[6] - relating to classification of domains as malicious or suspicious based on a range of features, including association with spam, phishing, malware, ransomware and other fraudulent activities - provides an updated view of the highest-threat TLDs covering many relevant areas of infringement, and offers useful insights towards the construction of threat-scoring algorithms.

In considering the main insights, we focus primarily on the subset of Spamhaus' data concerning the rates of infringement within each TLD (i.e. the numbers of malicious domains as a proportion of the total number of domains across the TLD), rather than the absolute numbers, as this offers a more meaningful input into any potential threat-scoring metric. 

Amongst the main points to take away from the study are the facts that:

  • There are four TLDs (.xin, .qpon, .locker[7], and .lgbt) for which more than half of the total set of domains in the zone file are marked as malicious - with the top example (.xin, popular with a Chinese audience, and usually translated as the word for 'new'[8]) having over 82% of its domains marked as malicious - and 11 TLDs where more than one-in-three of the domains are malicious. All of the top twenty TLDs (by proportion of malicious domains) are mid-size extensions, each with total numbers of domains in a range between 11,000 and 182,000.
  • A significant proportion of the domains marked as malicious have been found to be associated with Chinese gambling sites. Previous research has also found some suggestion that these types of sites may operate in conjunction with other types of infringement, such as comprising material which is an alternative to the 'primary' content displayed to certain users, but which may only be visible in certain locations (i.e. geoblocked[9]) or at certain times or days. 
  • Eight of the top 20 highest-threat TLDs in the Spamhaus dataset are associated with a single registry, BinkyMoon LLC, whose business model involves the offer of highly competitive prices for domains, a tactic which can often drive high levels of abuse.
  • A large number of the .xin domains - including many of the malicious examples - have names beginning with 'com-', a strategy noted in a recent Stobbs study[10] as being one way to create compelling deceptive infringements. The vast majority of these are registered through Dominet (HK) Limited, the registrar which rebranded from Alibaba.com Singapore E-Commerce Private Limited, following the issuing of a compliance notice by ICANN in March 2024[11]
  • Amongst the set of malicious domains, use of brand-related terms appears to be decreasing - perhaps, in part, due to their relative ease of detection and enforcement through brand-protection programmes - in favour of more generic, industry- or subject-related terms.

As a follow-up piece of analysis, even considering just a direct visual inspection of the raw domain data in the zone files of the highest-threat TLDs (as given in the Spamhaus study), some trends are immediately apparent: the domains across the TLDs in question appear to include disproportionately high numbers of numeric domains[12] (i.e. where the SLD - or second-level domain name; the part to the left of the dot - contains digits only), perhaps indicating a popularity of such domains for infringing use; and the .xin file indeed does appear to contain large numbers of 'com-' examples, but also in addition to 'us-' domains, which may have a similar potential use-case (i.e. constructing URLs resembling legitimate domains on the .us domain extension). 

Results from a more detailed quantitative analysis of these and other relevant points, carried out using the full zone-file data, are outlined in Table 1 and Figure 1.

Table 1: Top level statistics for the domains in Spamhaus' top ten highest-risk TLDs

Figure 1: Numbers of numeric domain names, by SLD length, for each of Spamhaus' top ten highest-risk TLDs

Some of the main insights from the overall analysis are as follows:

  • Across the top-ten highest-risk TLDs (from Spamhaus' data), numerical domain names account for a significant proportion of the total (53% of all domains across the full set of ten zone files. For five of the TLDs, numeric domains account for more than 70% of the total. The highest proportions are seen for .loan (87.66% numeric domain names in total) and .locker (84.56%). Of the numeric domain names, the vast majority are 4, 5 or 6 characters in length (4.1%, 70.7% and 23.9% of the total, respectively).
  • There are no obvious patterns in domain name entropy (a mathematical measure of the length and 'randomness' of a domain name) and, despite the fact that a significant proportion of the domains under consideration are (by definition) malicious, this is not reflected by a prevalence of particularly high entropy names[13] (as might typically be associated with automated registrations)[14]. Of the ten TLDs considered, the highest mean entropy was seen for the domains on .xin.
  • A prevalence of (potentially deceptive) 'com-' and 'us-' domains was seen only on .xin. For many of these domains, the portion of the SLD after 'com-' consisted of what appeared to be essentially random characters, but some specific use-patterns were identified, such as groups of domains with SLDs featuring specific keywords. These have a range of potential fraudulent use-cases, such as the construction of deceptive URLs resembling official sites for making payments (e.g. for road tolls) or accessing other financial or technical information.

Example groups included:

    • Toll-related domains - e.g. names of the form: com-highroadXXX, com-roadXXXXXX, com-tollXXXX or com-tollbillXXX
    • (Other) billing- or payment-related domains - e.g. com-lnvoiceXX [sic], and mis-spellings of com-payment and com-statementXX
    • Other classes of keywords: e.g. mis-spellings of com-lucky, com-passX, com-serviceXXX, com-shtmlXXXXX, com-ticketXXXX, com-updateXXX, and us-etcXX (perhaps in reference to the cryptocurrency Ethereum Classic)

Overall, these types of insights into (in this case) the highest-threat TLDs can greatly aid in the construction of metrics for prioritising brand monitoring results, and can thereby build efficiencies into the analysis and enforcement processes. The TLD of a webpage (or other online finding) is just one relevant characteristic, so this type of analysis would generally need to be combined with findings from other studies, in the construction of any overall threat-scoring algorithm. 

References

[1] https://circleid.com/posts/towards-a-generalised-threat-scoring-framework-for-prioritising-results-from-brand-monitoring-programmes

[2] https://circleid.com/posts/20230117-the-highest-threat-tlds-part-2

[3] https://www.spamhaus.org/statistics/tlds/

[4] https://trends.netcraft.com/cybercrime/tlds

[5] https://unit42.paloaltonetworks.com/top-level-domains-cybercrime/

[6] https://www.spamhaus.org/resource-hub/domain-reputation/domain-reputation-update-oct-2024-mar-2025/

[7] https://www.iamstobbs.com/opinion/some-more-new-domains-in-the-.locker

[8] https://tld-list.com/tld/xin

[9] https://circleid.com/posts/20220531-do-you-see-what-i-see-geotargeting-in-brand-infringements

[10] https://www.iamstobbs.com/insights/com-away-with-me-use-of-com-domains-in-the-construction-of-deceptive-url-like-hostnames

[11] https://www.icann.org/uploads/compliance_notice/attachment/1221/hedlund-to-chu-27mar24.pdf

[12] https://www.iamstobbs.com/opinion/the-universe-of-numeric-domain-names

[13] https://www.iamstobbs.com/opinion/un-.zip-ping-and-un-.box-ing-the-risks-associated-with-new-tlds

[14] https://circleid.com/posts/20230703-an-overview-of-the-concept-and-use-of-domain-name-entropy

This article was first published on 5 June 2025 at:

https://www.iamstobbs.com/insights/an-updated-view-of-bad-tlds

Tuesday, 3 June 2025

Using clustering and investigation techniques to connect and identify scam law-firm websites

During the Spring of 2025, a series of reports emerged of a campaign of scams, in which fraudsters have been impersonating trademark law firms and targeting brand owners with allegations of third-party attempts to register the brand name. In some cases, the fraudsters have been observed to create wholly fake organisations; in others, the names of legitimate firms have been used, in instances of brand impersonation (often with the use of a deceptive domain name similar to that of the real company).

In one such set of scams, for example, a group of company names and associated contact details as shown in Table 1 were found to have been used. All of these examples have explicitly been reported as scams on the website of the Solicitors Regulation Authority[1,2,3,4]

Table 1: Company names and credentials used in identified trademark law-firm scams

Whilst there is no reason to suppose that all of these examples are necessarily connected to each other as a single coordinated campaign, there are certainly connections between at least some of the scam sites. For example, four of the five entities use the same (apparently London-based) telephone number. This might be indicative of nothing more than the use of a common website template being used across these scams, but if the telephone numbers are actively being used as means of contact in the campaigns, there is therefore a strong indication that at least these four are associated with the same underlying fraudsters.

These types of insights are key to the idea of clustering[5], which is a technique used in brand protection and other related areas of analysis, intended to establish links between infringements. An extension of this idea can also be combined with open-source intelligence ('OSINT') investigation techniques to identify other related examples. 

For example, a simple search for the shared phone number referenced above shows that it has also been used in yet another law-firm scam ('Wozi Law Firm', wozilawfirm[.]org)[6], impersonating a legitimate company of the same name (wozilaw[.]com). 

As of the date of analysis (08-May-2025), none of the scam websites in question were found to be live, although all were found previously to have been active for a sufficient length of time to have been indexed by Google. The associated abstracts provide some insights into the content which was formerly present (Figure 1), which can also serve as a basis for searches for other (potentially related) sites featuring similar content, or for further establishing similarities between the sites.

Figure 1: Example of the Google abstract for the scam site formerly present at cromfordlaws[.]com

In one of the above cases, a historical cached view of the site was available from the Archive.org website[7] (Figure 2).

Figure 2: Historical cached screenshots (from 24-Mar-2025) (courtesy of Archive.org) of the scam site formerly present at cromfordlawfirm[.]com

Carrying out a deeper analysis of the domains utilised in the various scams can also serve as a basis for establishing further clusters of associated examples. Table 2 shows the dates of registration, host-IP addresses, named registrants and registrars for the domains in question (for the most recent available whois records).

Table 2: Registration and configuration details for the domains utilised in the scams referenced above

It is also worth noting that, in some cases, the DomainTools website also possesses cached historical views of the sites in question (Figure 3).

Figure 3: Historical cached screenshot (from 13-Mar-2025) (courtesy of DomainTools) of the scam site formerly present at wozilawfirm[.]org

In cases where the registrant details are redacted (which is very common following the introduction of GDPR), information such as the identity of the privacy service provider does not serve as a very effective means of clustering together related results. However, the other details (as shown in Table 2) can be more diagnostic. It is particularly noteworthy that two of the IP addresses appear twice in the table which, whilst not definitive of a link between the co-hosted sites, can be a useful indicator if other commonalities are also present. 

Reverse-IP-address look-ups reveal four further sites which are co-hosted with at least one of the examples shown in Table 2, also feature references to 'law' in the domain name, and show other characteristics such as registrar, name patterns and nearby registration dates in common (Table 3). It is highly likely that these comprise additional clusters of related scam sites and, whilst again none is currently active, a cached screenshot was again available in one case (Figure 4).

Table 3: Additional potential scam domains sharing hosting characteristics with one or more examples from Table 2

Figure 4: Historical cached screenshot (from 30-Apr-2025) (courtesy of DomainTools) of the scam site formerly present at cndlawfirms[.]com

It is also possible to extend these ideas to much broader domain searches. For example, zone-file analysis reveals that there are over 47,000 domains with named ending with (for example) 'lawfirm(s)'. Considering just the .org domains (to provide an easily manageable dataset, and by analogy with the wozilawfirm[.]org example identified previously) and focusing just on the domains registered through Hostinger, PDR, or Namecheap as registrar, since the start of 2025 (i.e. those most likely to be live and associated with the identified campaign(s)), we find 17 further candidate domains, of which seven resolve to additional live sites of potential concern. Two examples are shown in Figure 4 - both are registered via Hostinger Operations, UAB, hosted at the same IP address (34.120.137.41) and registered in a similar timeframe (on 19-Jan and 23-Jan respectively), and feature other suggestions of possible non-legitimacy (such as the use of placeholder content, webmail addresses, inconsistent contact details, etc.). The sites also have a broadly similar appearance, possibly suggestive of the use of a common website template.

Figure 4: Example of a 'mini-cluster' of two further sites of potential concern ('Nazakat' and 'Elite Law Firm')

The ideas presented in this article - namely the use of analysis and investigation techniques to connect infringements and identify additional related examples - are key to a highly significant area of brand protection analysis. These types of approaches can be used to provide early identification of sites which may pose a threat - potentially before they are utilised extensively and subsequently reported online - and can be built into the analysis and prioritisation approaches used for active monitoring services. This is also an area where AI-based analysis can provide a compelling addition to traditional analysis techniques, in the identification of key features from highly rich datasets.

References

[1] https://www.sra.org.uk/consumers/scam-alerts/2025/apr/cromford-law/

[2] https://www.sra.org.uk/consumers/scam-alerts/2025/apr/spectre-law/

[3] https://www.sra.org.uk/consumers/scam-alerts/2025/may/cnd-law-ltd-david-marks-nick-cross/

[4] https://www.sra.org.uk/consumers/scam-alerts/2025/mar/ballard-and-trademark-expressive/

[5] https://circleid.com/posts/braive-new-world-part-1-brand-protection-clustering-as-a-candidate-task-for-the-application-of-ai-capabilities

[6] https://regulationandcomplianceoffice.co.uk/raco-roundup-9/

[7] https://web.archive.org/web/20250324145542/https://www.cromfordlawfirm.com/

This article was first published on 3 June 2025 at:

https://www.iamstobbs.com/insights/using-clustering-and-investigation-techniques-to-connect-and-identify-scam-law-firm-websites

E-mail address extraction from webpages: a quick case study in result 'clustering'

Introduction The concept of result 'clustering' - that is, the ability to establish connections between online brand monitoring find...