Thursday, 29 February 2024

Health scam websites: identifying related domains using clustering techniques

Introduction

A recent study reported the emerging trend of the use of large numbers of cheap domain registrations to promote bogus health products such as 'keto'-related dietary supplements. Frequently, the referring sites spoof the appearance of popular news websites to build credibility, whilst actually presenting fake news articles or featuring false endorsements. In some related cases, the sites may instead direct users to legitimate product pages, and are intended to generate click-through revenue via affiliate schemes[1].

It is often the case that these scams make use of new-gTLDs (generic top-level domains) such as .sbs and .cloud, as domains on these extensions can be purchased at very low cost (typically around $1). In one reported cluster, the scam involved the registration of very large numbers of randomly-generated domain names between around March and June 2023. The domain names all began with 'keto', followed by a string of (typically six or seven) random alphabetical characters, followed by three random digits[2].

With this trend in mind, we investigate the use of domain 'clustering' analysis techniques to identify examples of the domains in question - these types of approach can potentially be used in 'real time' to alert brand owners of new registrations relating to similar scams as they arise.

Analysis

For the initial analysis, we consider the full set of .cloud domain names, based on zone-file data from ICANN’s CZDS service (as of 12-Jan-2024). Currently there are around 364,000 domains registered on this TLD. Of the .cloud domains, there are over 3,400 with names beginning 'keto'. Only around 1% of these resolve to any live website content as of the time of analysis (between 18 and 19-Jan-2024); a handful of these were found to be promoting dietary supplement products, though did not appear to constitute part of the 'fake news article' cluster referenced above.

In total, 1,611 of the ‘keto’ domains were found to end with a three-digit string. All of these were between 11 and 15 characters in total SLD length[3] and follow the format of the domain names reported above, and accordingly are all potentially part of the set of registrations carried out for the original scam campaign (although none resolved to live sites as of the date of analysis).

Significantly, however, this set of 1,611 domains also share a number of other characteristics which could be use to 'cluster' them together (and potentially form part of an 'early warning' algorithm to alert of the appearance of new associated registrations of interest):

  • All domains sit within a narrow range of domain-name entropy values (a measure of the length and amount of 'randomness' of the domain-name string)[4,5], as a consequence of their similar name structures. The domains within the cluster all have entropy values between 2.6 and 3.9, compared with wider distributions for the set of all 'keto' .cloud domain names and for the total set of all .cloud domains (Figure 1).

Figure 1: Distributions of domain-name entropy values for the 1,611 domains in the cluster, compared with the set of all 'keto' .cloud domain names (right-hand axis) and the total set of all .cloud domains (left-hand axis)

  • The vast majority (98.3%) of the domains in the cluster were registered in a ten-day period between 19 and 28-May-2023 (Figure 2).

Figure 2: Daily numbers of registrations for the domains in the cluster

  • Almost all (99.7%) of the domains in the cluster were registered through the same registrar (a retail-grade provider previously noted as being popular with infringers)[6] and using the same privacy-protection service provider (1,606 instances).

For certain examples of domains within the cluster for which historical (cached) copies of the former website content are available, it is possible to verify that these sites were indeed previously associated with the 'fake news' health scams (Figure 3).

Figure 3: Cached screenshot (from DomainTools[7]) of content from an example of one of the sites in the cluster (ketoekezat333.cloud)

Within the .cloud zone-file dataset, we also find what appear to be other clusters of related domains, probably also groups of automated registrations associated with other former scams or affiliate revenue generation schemes. Examples include sets of domains of the form acv-ketomirrorXXX.cloud (100 instances), am-sXXX.cloud (71), guangyaoXXX.cloud (62) and videomediaseoXXX.cloud (443) (where 'XXX' are strings of three digits in each case). Widening out the search, we also find a similar cluster of domains on the .cyou extension (another new-gTLD often linked to infringing content)[8,9], comprising 90 domains with names of the form ketoAAAXXX.cyou (where 'AAA' is a string of alphabetical characters of variable lengths), all between 7 and 17 characters in SLD length and entropy values between 2.1 and 3.8. Other clusters feature slightly different patterns, such as a group of 200 ketoAAA.bar domains of 9 or 10 characters in SLD length, and 194 ketoXXXXmeto.buzz, 518 ketoXXXXdark.buzz and 178 ketoXXXXdark.today domains.

Conclusion

The findings illustrate how scammers make use of groups of low-cost domains with shared characteristics (often resulting from automated registrations), as part of high-volume scam campaigns. The large numbers are probably an indication that the scam sites are often short-lived (as borne out by the fact that none of the domains in the identified clusters currently resolve to live sites), presumably as part of an intention to generate revenue quickly and then deactivate the sites before they can be found and shut down through enforcement actions. However, the very fact that these domains do feature characteristics in common means that, when a new scam campaign is identified, it should be possible to design algorithms which, when combined with standard domain-monitoring techniques, can quickly identify additional associated registrations so that timely enforcements can be launched.

References

[1] https://www.techradar.com/pro/security/scammers-are-buying-up-cheap-domain-names-to-host-sites-that-sell-dodgy-health-products

[2] https://www.netcraft.com/blog/health-product-scam-campaigns-abusing-cheap-tlds/

[3] All domain lengths in this study are specified as that of the SLD (second-level domain) name; the part of the domain name to the left of the dot

[4] https://www.linkedin.com/pulse/investigating-use-domain-name-entropy-clustering-results-barnett/

[5] https://www.iamstobbs.com/opinion/the-randomest-domain-names-entropy-as-an-indicator-of-tld-threat-level

[6] https://www.iamstobbs.com/opinion/website-impersonations-a-case-study-of-domain-names-targeting-the-uk-government

[7] https://research.domaintools.com/research/screenshot-history/ketoekezat333.cloud/#1

[8] https://www.iamstobbs.com/opinion/expert-.watches-.new-.online-.website-.news-.lol-a-review-of-the-current-state-of-the-new-gtld-programme

[9] https://circleid.com/posts/20230117-the-highest-threat-tlds-part-2

This article was first published on 29 February 2024 at:

https://www.iamstobbs.com/opinion/health-scam-websites-identifying-related-domains-using-clustering-techniques

Tuesday, 27 February 2024

Website impersonations: a case study of domain names targeting the UK government

Large numbers of previous studies have considered the use of domains intended to cause confusion with the names of official and trusted sites, as a means of launching fraudulent attacks or creating other kinds of infringement. Indeed, this is one of the primary factors behind the importance of online brand protection.

In many cases, these types of attack are successful because of a lack of understanding by general Internet users of the structures of URLs (web addresses) and the significance of the contexts in which particular characters (such as dots / periods ('.'), slashes ('/') and hyphens ('-')) can be used. With this in mind, previous articles have considered a number of ways in which domain-based fraud can be conducted, including the use of subdomain names together with unbranded hyphenated domain names (e.g. a URL such as bankbrand.co.uk-account.help to impersonate bankbrand.co.uk/account-help)[1], use of subdomain names together with truncated brand domains (e.g. g.oogle.com in place of google.com)[2], hyphenated branded domain names intended to resemble official URLs broken across line-breaks - particularly when displayed on mobile devices (e.g. account.financebran-d.com)[3], and deceptive domains with names beginning 'www' or 'http', to create confusion with full URLs[4].

In this study, we consider domains intended to produce confusion with official UK government websites, which are generally hosted on .gov.uk domains, inspired by a screenshot of a light-hearted text-exchange with a scammer, posted on LinkedIn[5] (Figure 1). Government websites can be an attractive target for bad actors, as they are generally high-profile, well-used sites, and frequently incorporate a transactional (financial) element. Many members of the public will be familiar with these types of scam through communications received by e-mail or on mobile devices.

Figure 1: Posting including a screenshot of an SMS-based scam utilising a domain name impersonating a UK government website

Based on zone-file information available through ICANN’s CZDS service as of 12-Jan-2024, there are 321 registered gTLD domains with names containing 'uk?gov' or 'gov?uk' (where '?' is an optional extra character, in each case). Of these, 152 are considered to be 'high-risk' in terms of potential intent for confusion with official .gov.uk sites (i.e. neglecting non-sensical domain names or instances where the terms appear in other contexts such as the explicit string 'uk(-)government' or 'gov(...)ukraine').

14 of the 152 domains appear to be registered to official government departments, potentially as defensive registrations, many of which (in accordance with good domain 'hygiene' practice) are configured to re-direct to official websites, though this still leaves a significant majority of the domains registered to third parties.

Within the set of 138 third-party, high-risk domains, a number of recurring themes and patterns are present. There are several keywords suggestive of intended use in scams which appear within the SLDs (second-level domain names; the part of the domain name to the left of the dot) multiple times, including 'HMRC' (6 instances), 'DVLA' (6), 'debt-relief' (4), 'visa' (3), 'homeoffice' (3), and 'rebate' (2). It is also notable that several of the domains utilise new-gTLD extensions which have been commonly linked with non-legitimate activity[6,7], including .top (8 domains), .site (6), .online (6), .shop (6), .xyz (4), .cloud (2), .digital (2), and .live (2). A number of other concerning TLDs also feature just once within the dataset, namely .tax, .works, .chat, .date, .wtf, .agency, and .lol. Also of significance is the prevalence of use of retail-class registrars in the domain registration dataset, also previously noted as being popular with infringers[8], with a list topped by GoDaddy.com, LLC (18 domains), Namesilo, LLC (9), and Amazon Registrar, Inc. (9). Finally, of the 118 domains for which whois information was available via an automated look-up, 97 (82%) explicitly make use of privacy-protection service providers, as is common for infringers wishing to hide their identity.

Considering the content of the websites in question, the domains resolve to a range of page types, including several resolving only to parking pages or no live content, but which may warrant further monitoring for changes to content, or might potentially be being used for their e-mail functionality (e.g. in phishing attacks) - in fact, 57 of these sites (41%) have active MX (mail exchange) records, indicating that they have been configured to be able to send and receive e-mails. Even more concerningly, four of the domains generate browser warning pages that fraudulent or otherwise dangerous content is (or was formerly) present, and a small number of examples were found to resolve to fraudulent or other concerning live content as of the date of analysis (13-Jan-2024) (Figure 2).

(a)

(b)

(c)

(d)

Figure 2: Examples of live sites of concern, with SLDs as follows:

a) pay-uk-gov (a site offering 'credit' services to a French audience);

b) govukvisa (a Chinese-language site displaying 'UK Visas and Immigration' branding);

c) homeofficegovuk (a partially-constructed site purporting to relate to 'UK Visa and Immigration');

b) vww-dvla-gov-uk (a partially-constructed site currently with non-relevant content, but featuring a 'donate' link

The significant number of infringements - combined with the presence of a number of scam sites which are still live as of the time of analysis - shows the extent to which the government is targeted by these types of scam. The findings highlight the importance of official organisations, trusted entities and other brand owners putting in place proactive programmes of brand protection - consisting of both monitoring for the appearance of new infringements, and enforcement actions to take down threatening content, combined with ongoing monitoring of dormant content, to detect any subsequent appearance of material of concern - in addition to other measures such as defensive registrations and (where appropriate) the maintenance of relevant portfolios of protected IP. These measures can help protect the reputations of the organisations in question, and defend customers from the effects of fraud and other attacks.

References

[1] https://circleid.com/posts/20220504-the-world-of-the-subdomain

[2] https://www.linkedin.com/pulse/exploring-domain-hostname-based-infringements-david-barnett/

[3] https://www.linkedin.com/pulse/hyphenated-domain-infringements-david-barnett/

[4] https://circleid.com/posts/20220913-registration-patterns-of-deceptive-domains

[5] https://www.linkedin.com/feed/update/urn:li:activity:7151327407482847233/

[6] https://www.iamstobbs.com/opinion/expert-.watches-.new-.online-.website-.news-.lol-a-review-of-the-current-state-of-the-new-gtld-programme

[7] https://circleid.com/posts/20230117-the-highest-threat-tlds-part-2

[8] https://www.iamstobbs.com/opinion/web-dot-coms-but-once-a-year-holiday-shopping-activity-part-1-black-friday-domains

This article was first published on 27 February 2024 at:

https://www.iamstobbs.com/opinion/website-impersonations-a-case-study-of-domain-names-targeting-the-uk-government

Tuesday, 20 February 2024

The crossover: two recent developments in Web2/Web3 interaction

by David Barnett and Tom Ambridge

Interest in 'crossover' between the worlds of Web2 (the 'classic', DNS-based, regulated Internet) and Web3 (the emerging, decentralised, unregulated, blockchain-based Internet)[1] is increasing amongst many stakeholders involved in the digital environment.

Two recent stories have highlighted the evolving interactions between these formerly relatively distinct areas of the Internet, and raise a number of issues of which brand owners may be advised to be aware.

1. The development of .box

The nature of .box ('dot-box'), one of the new domain-name extensions recently launched[2] as part of the ongoing (Web2) gTLD programme[3], is becoming clearer following initial announcements that the extension would be associated with an innovative dual Web2/Web3 offering.

The extension is now fully live and domains are available for registration by the general public. A .box purchase from the provider[4] -  made on the basis of a payment of $120 in cryptocurrency per year, up to a maximum period of 10 years - now grants the user a 'classic' Web2 domain (including options for website and e-mail functionality), an identically-named Web3 domain (including the associated potential functionality of being able to create a decentralised website and be able to accept transfers of cryptocurrency and other blockchain assets), and access to the .box app, used to access and make changes to the domain properties. As such, this is the first credible Web2/Web3 crossover, and the first gTLD under ICANN regulation to offer Web3 functionality. The My.Box website states that the scheme operates in partnership with ICANN, Intercap Registry, and Web3 providers ENS, 3DNS, and Optimism.

Numbers of registrations are already ramping up rapidly and - in view of the potential threats of abuse and cybersquatting - brand owners are advised to be mindful of registrations which may infringe their IP, and to consider their own defensive strategies. The provider of the new domains even offers a search function[5] to check for (exact string-match) registrations, which links to a profile page for the owner.

Additionally, enforcement is possible in a similar way as for a regular domain name. As an ICANN TLD, the registrar / registry operator has an abuse policy and a registration agreement prohibiting trademark infringement. As such, it is possible to file UDRP / URS disputes against bad-faith registrations. If successful, the Web2 domain is transferred to the brand owner if requested, and the Web3 domain is cancelled and may be re-issued to the brand owner.

As of 08-Feb-2024, there are 1,910 .box domains listed in the zone-file, with almost all of the one-character alphanumeric names already taken (with the exceptions of 1.box, 3.box, a.box, e.box, and i.box). Many well-known brand terms have already been registered as SLD (second-level domain; the part of the domain name to the left of the dot) strings, including 'adidas', 'blackrock', 'bmw' (and 'bmwgroup'), 'chatgpt', 'coach', 'cocacola', 'ferrari', 'google', 'ibm', 'mastercard', 'mercedes-benz', 'o2', 'openai', 'oracle', 'orange', 'paypal', 'pokemon', 'pornhub', 'porsche', 'postoffice', 'prada', 'reddit', 'taobao', 'target', 'taylorswift', 'telegram', 'tencent', 'tesla', 'tiffany', 'toyota', 'tysonfury', 'uber', 'ups', 'visa', 'xbox', and 'youtube', in addition to a range of industry and product-related terms.

2. GoDaddy partners with ENS

It was announced on 5 February that domain name registrar GoDaddy would be partnering with ENS ('Ethereum Name Service', a Web3 provider and registrar, offering naming services for assets on the blockchain associated with the Ethereum cryptocurrency)[6,7].

The partnership will allow owners of Web2 domains registered through GoDaddy to link their domain with an Ethereum address (which must have been registered through a Web3 provider), allowing access to a range of Web3 services, and offering the option for the domain name to be used as a human-readable address to be used for sending and receiving assets such as cryptocurrency and NFTs. Users who do not require this functionality will retain the option to keep their domain name unintegrated. 

* * *

These two stories highlight the increasing options for interconnectivity between the Web2 and Web3 ecosystems, and it is likely that the boundaries between these portions of the Internet will become increasingly blurred, particularly if general compatibility of Web3 content with classic Web2 components (such as mainstream browsers supporting access to decentralised (blockchain) domains) improves. It is likely, however, that a number of changes to regulation and legislation also will be required in order to create a truly seamless landscape, such as initiatives to prevent naming collisions[8], and further improvements to enforcement routes for infringements in the Web3 environment.

Furthermore, connectivity between (hexadecimal) wallet addresses, blockchain domain names and familiar web2 names opens a lucrative on-ramp for traditional brands and their customers to benefit from Web3 technology and asset ownership, depending on what is in store for tradable goods following the demise of jpeg NFTs[9].

However, with broadening capabilities comes increased potential for bad faith in the form of phishing and fraud. Impersonators will be drawn to the addition of non-custodial wallets to heavyweight domain extensions such as .com. Rights owners must see these technological developments as not just opportunities for brand evolution, but also as vectors for detailed investigation and target tracing within online brand protection.

References

[1] https://www.iamstobbs.com/trends-in-web3-ebook

[2] https://www.iamstobbs.com/opinion/un-.zip-ping-and-un-.box-ing-the-risks-associated-with-new-tlds

[3] https://www.iamstobbs.com/opinion/expert-.watches-.new-.online-.website-.news-.lol-a-review-of-the-current-state-of-the-new-gtld-programme

[4] https://www.my.box/

[5] https://my.box/search

[6] https://blog.ens.domains/post/godaddy-partners-with-ens

[7] https://uk.godaddy.com/help/what-is-ens-41952

[8] https://www.iamstobbs.com/opinion/the-iotex-case-domain-naming-collisions-and-other-emerging-risks-in-the-blockchain-ecosystem

[9] https://dune.com/queries/47101/92814

This article was first published on 20 February 2024 at:

https://www.iamstobbs.com/opinion/the-crossover-two-recent-developments-in-web2/web3-interaction

Michelin most prominent and best sentiment tyre brand in 2024

by Chris Anthony and David Barnett

In 2005, Tyres & Accessories published the first in what would become a 12-year series of online brand prominence and sentiment analyses relating to the best-known tyre brands. That research was conducted in association with Envisional and latterly NetNames. Now, one of the brains behind the original data has developed "a new and improved methodology for quantifying brand prominence and sentiment"[1]. Here, David Barnett, now Brand Protection Strategist at intangible asset management specialists Stobbs, presents the results of a new study looking at the online prominence and sentiment of tyre brands.

The latest research looks at a set of 150 tyre brands drawn from various sources, including Tyres & Accessories' list of leading tyre companies[2] and Brand Finance's list of most valuable tyre brands[3,4]. The research also incorporates all brands considered in the previous NetNames/Envisional studies. The methodology involves the use of a series of generic tyre-related search queries to bring back a list of pages for analysis, resulting in a dataset of 4,318 distinct webpage URLs. Findings are based on searches and analysis carried out between 8 and 9 January 2024, utilising results returned on the first page of Google.com, browsing from a UK-based IP address.

Overall, Michelin achieves the position of being both the most prominent and the most positively-referenced brand. It is also striking that seven of the top 10 most prominent brands appear in Brand Finance's list of the top 15 most valuable tyre brands. Additionally, the top five most positively referenced brands all appear in Brand Finance's top six. Broadly, there is a moderate positive correlation between brand value and both online prominence (correlation coefficient = +0.67) and online sentiment (correlation = +0.79) - see Figures 3 and 4.

Figure 1: While the full Stobbs research ranks 150 brands, this figure shows the brands with the highest prominence scores (Source: Stobbs)

Overall it was generally good news for tyre brands. However, nine of the analysed brands achieved sentiment scores which were negative, of which the bottom three were Titan (-3.52), Venezia (-1.84) and Interstate (-1.39). However, in all cases, these scores result from what appear to be non-relevant references to the brand names on a small number of pages. Essentially, the low prominences (prominence scores of 0.011, 0.001 and 0.025, respectively) of these brands (that is the small number of pages on which references appear) mean that the sentiment scores can be 'skewed' by a small number of references.

Figure 2: The brands with the highest positive sentiment scores (i.e. the most favourably-referenced brands) (Source: Stobbs)

In addition, a quick glance at the unfiltered data reveals a surprising find - Komoran in what would have been second place after Michelin. Of course, Komoran is a Michelin group brand. However, there is no way that Komoran receives the same level of attention and marketing investment as Group Michelin's eponymous flag brand. In this study, the high ranking of Kormoran (which is positioned as a budget brand and currently offered by a number of UK suppliers) - achieving second place in the unfiltered prominence rankings, and 12th in the sentiment rankings - was something of an anomaly. That anomaly largely results from the fact that one of the search queries used (for "performance tyres") returned a set of approximately 30 distinct, but very similar, webpages. Those websites specifically shared URLs in the format [site].co.uk/tyre/details/kormoran/road-performance and featured extensive references to Kormoran products, when the searches were carried out. If all but one of these pages is removed from the calculations in each case, the prominence score for the Kormoran brand drops to 0.194 (10th place in the rankings), and the sentiment score drops to 2.74 (56th place). As a result, since Komoran's scores had been inflated, the values were recalculated by removing all but one of these pages from the analysis in each case.

Figure 3: Online prominence score compared with brand value, for the top 10 most valuable brands (Source: Stobbs; Brand Finance)

Figure 4: Online brand sentiment score compared with brand value, for the top 10 most valuable brands (Source: Stobbs; Brand Finance)

It is also informative to compare the performances of the surveyed tyre brands with those from the previous NetNames/Envisional studies (for those brands which were included in these earlier analyses). For comparisons of prominence, the scores from the earlier studies were renormalised (scaled), so that the mean score across all brands featured in the most recent study (from 2017) was the same as that for the same group of brands from the current (2024) study. The resulting trends over time in the relative prominences of the set of brands analysed previously is shown in Figure 5.

And the result is that, almost seven years since the last research of its kind, Michelin retains the top spot it has consistently held in all previous studies, and the previous 'big six' (Michelin, Goodyear, Continental, Pirelli, Bridgestone, Dunlop) remain largely unchanged. However, the news that Falken has overtaken Dunlop in terms of online prominence, is one departure from business as usual. However, that race remains very close with Dunlop and Falken achieving prominence scores of 0.49 and 0.50, respectively.

Similarly, we can also compare the relative sentiment rankings of the set of 18 brands which have been considered in some or all of the previous studies (see Figure 6).

Here, all 18 of these brands continue to receive positive sentiment scores overall (ranging from +2.08 for Uniroyal to +27.66 for Michelin), indicating that commentary is generally favourable overall for this set of brands. In terms of degree of positive sentiment, Michelin regains the top position it held in 2016, and every year between 2007 and 2014. As with the measure of prominence, the top six brands within this group remain generally unchanged. However, the most noticeable difference once again relates to Dunlop (+11.48), which actually drops below Nexen (rising 7 places; +15.68) and Falken (+16.07).

Outside the top eight of the group, significant increases in ranking were seen for Toyo (up 3 places in the ranking), Hankook (up 3) and Maxxis (up 4), and decreases for Kumho (down 5), Avon (down 7) and Yokohama (down 4).

Figure 5: Trends over time in (normalised) prominence score, for the set of brands analysed previously (source: Stobbs; Envisional; NetNames)

Figure 6: Trends over time in sentiment ranking, for the set of brands analysed previously (source: Stobbs; Envisional; NetNames)

References

[1] https://www.iamstobbs.com/online-brand-prominence-and-sentiment-ebook

[2] https://www.tyrepress.com/leading-tyre-manufacturers/

[3] https://brandirectory.com/rankings/tyres/

[4] https://www.tyrepress.com/tag/brand-finance/

This article was first published on 13 February 2024 at:

https://www.tyrepress.com/2024/02/michelin-most-prominent-and-best-sentiment-tyre-brand-in-2024/

Monday, 12 February 2024

A tasty new TLD: the launch of .food

by David Barnett and Richard Ferguson

As the initial phase of the new gTLD programme - the initiative to add a large selection of new domain-name extensions to the Internet's root zone - continues well into its second decade, the launch of five new extensions (.food, .diy, .lifestyle, .living and .vava)[1] by registry operator Internet Naming Co. was announced on 18 January[2], to enter their sunrise periods on 24th[3].

All five are unrestricted TLDs, with registrations to be available on a first-come, first-served basis once they enter their general availability phase (scheduled for 6 March). The stated plan is for .vana to integrate with Web3 functionality.

Of this set of extensions, .food is perhaps the most significant, in terms of its potential for mainstream appeal, but also for its potential attractiveness to infringers. The food sector has long been one of the industries most heavily targeted by counterfeiting, and was the specific subject of an EUIPO study on the issue in 2016[4]. Furthermore, as consumers increasingly move their purchasing strategies online, the e-commerce channel for food is predicted to climb to more than $250 billion by 2025[5]

As of 29 January (five days on from the start of sunrise), the .food domain-name zone file contained just a handful of registrations, but we would expect numbers to ramp up over the coming weeks. The initial batch of registered domains was as follows:

  • 0cfepd3bofk26cfdlh21d3rgtmd5ojdj.food (no whois record available)
  • a08hpeuisnljaqabffddkvivom6801e9.food (no whois record available)
  • gea.food* (registered 26-Jan-2024 by GEA Group Aktiengesellschaft)
  • nic.food (registered 11-Oct-2023 by Uniregistry, Corp)
  • sodexo.food* (registered 26-Jan-2024 by SODEXO, Societe anonyme)
  • wood.food* (registered 24-Jan-2024 by Wood Info Inc.)
  • zomato.food* (registered 25-Jan-2024 with a redacted whois record)

Some of these appear to be registrations made for technical infrastructure purposes, and only four (asterisked) appear to be brand-related registrations (generally official). As of the date of analysis, none of these domains displayed any content other than placeholder pages.

Overall, recent years have seen the launch of a significant number of new domain name extensions comprising terms of direct relevance to the industry areas of many brands. It remains to be seen how adoption of these new-gTLDs compares with that of new dot-brand extensions, as the new round of applications launches in the next couple of years[6]. This will be particularly relevant as we start to see more cases where organisations are rebranding in ways where their brand identity and full domain name are aligned (such as 'Go.Compare', whose website is currently branded as such, and who are now using the go.compare domain name - even if this currently re-directs to their legacy domain, at gocompare.com). 

Brand owners operating in the food industry would be well advised to consider their strategy in response to the launch of .food, including consideration of the value of defensive registrations, and monitoring for (and enforcing against) infringing activity on any new domains as they appear. Use of extensions of this type may also be appropriate for corporates with divisions in multiple industry verticals. It will be informative to see what trends emerge as the initial desirable domain names become taken.

References

[1] https://newgtlds.icann.org/en/program-status/sunrise-claims-periods

[2] https://www.linkedin.com/feed/update/urn:li:activity:7153909212203417601/

[3] https://comlaude.com/tm-pricing-changes-and-1-year-registrations-and-vegas-pricing-promotion-2/

[4] https://euipo.europa.eu/tunnel-web/secure/webdav/guest/document_library/observatory/documents/Knowledge-building-events/Counterfeiting_of_foodstuff_en.pdf

[5] https://www.theconsumergoodsforum.com/blog/2021/03/30/food-fraud-impact-of-counterfeits-on-consumers-safety-and-brand-reputation/

[6] https://www.iamstobbs.com/opinion/a-review-of-the-current-state-of-the-new-gtld-programme-dot-brands

This article was first published on 12 February 2024 at:

https://www.iamstobbs.com/opinion/a-tasty-new-tld-the-launch-of-.food

Tuesday, 6 February 2024

Utilisation of relevance keywords for prioritising results in brand monitoring

BLOG POST

The use of keyword-based matching to identify the most significant findings within a larger set of candidate webpages is a key element of many brand-protection technologies. The approach can build efficiencies into the overall analysis process, and can help to identify priority targets for content tracking or enforcement, and is therefore an essential component of effective tools used for brand-protection programmes. 

In our latest study, we outline a new methodology for analysing the proximity of 'relevance keywords' to brand terms, in order to calculate a metric for the potential level of relevance of a webpage. The framework is based on the methodology previously outlined for calculating the sense of the sentiment of brand references in our 'top 100 brands' study[1], and is a flexible approach which can be tailored to a range of different contexts. The methodology is illustrated using a series of short case studies.

The concept of filtering using relevance keywords is central to many areas of brand monitoring, all broadly covered under the description of 'issue monitoring'. These ideas set the scene for a number of additional applications in the brand-protection arena, such as the use of identification of instances of 'high-risk' e-commerce keywords to prioritise websites based on the likelihood of their association with the sale of counterfeit goods.

Reference

[1] https://www.iamstobbs.com/online-brand-prominence-and-sentiment-ebook

This article was first published on 6 February 2024 at:

https://www.iamstobbs.com/opinion/utilisation-of-relevance-keywords-for-prioritising-results-in-brand-monitoring

* * * * *

WHITE PAPER

Introduction

One of the primary requirements of brand-monitoring technologies is the ability to prioritise the identified results (i.e. detected webpages) by the likelihood that their content will be of relevance to the categories of material of interest. This prioritisation is of great importance for projects which may potentially involve the collection of very large numbers of candidate pages, to ensure that key findings can be identified in a timely manner, particularly when analyst time may be a high-cost resource. Prioritisation is also relevant in the identification of priority targets for enforcement and content tracking.

In this study, we investigate the use of 'relevance keywords' in the identification of specific pages relevant to a particular area of content, within a larger 'pool' of more general webpages. The methodology is an adaptation of that outlined previously for sentiment analysis of the top 100 global brands[1], but where relevance keywords are here used in place of sentiment keywords. Note that the approach is distinct from that outlined for cases where a particular brand name can occur in altogether relevant or non-relevant contexts (e.g. where the name of the Google 'Gemini' brand could occur in relation to AI (artificial intelligence) or in relation to astrology), where we suggest the use of a keyword-based overall content scoring approach to determine the subject area of the page as a whole[2].

The methodology is also different from the concept of content scoring for just the brand name (as is used in measurement of brand prominence), which - unless additional filtering is applied - will not distinguish between relevant and non-relevant / third-party references. In the relevance-keyword methodology outlined here, we consider only the appearance of the brand name in close proximity to instances of relevant keywords. In so doing, we can calculate a potential relevance score for the brand on each page, analogous to the sentiment score described in the 'top 100 brands' study.

The methodology is broadly applicable to an area of brand protection referred to as 'issue monitoring' - i.e. where content relating to a brand is of interest only if it relates to a particular subject area. This might pertain to a specific sub-brand or product, a news story, or an association with a particular individual, other company, or category of content; these references can accordingly be identified by searching for references to the brand near to keywords relating to the issue in question. Although this 'targeting' approach can be addressed to some degree through the incorporation of the relevance keywords into the search queries used to return the set of candidate pages for analysis, it will not always result in a 'clean' set of results for a number of reasons, not least the way in which the search-source handles multi-word search terms (e.g. whether it requires the results to feature one or both terms, whether it suggests alternatives, and the usual inability of search engines to return results only where the terms appear in close proximity to each other). These points are explored below through a number of case studies[3].

Case studies

Case study 1: News relating to the WH Smith rebrand

This first case study relates to the January 2024 news story about the re-branding of a number of WH Smith stores with a new logo appearing similar to that of the NHS[4]. Superficially, we might expect to be able toidentify references to this news story by searching for 'whs nhs' (which, on Google, adds an implicit Boolean 'AND' - i.e. requires the results returned to feature both 'whs' and 'nhs', but with no explicit condition on the proximity on the page of the two terms). However, there are a number of reasons why this is not effective. Firstly, Google suggests 'wsh' as an alternative for 'whs', and presents results for both terms together (which, in an automated monitoring tool, would just be added to the same dataset for processing) (Figure 1)

Figure 1: A search suggestion presented by google.com

Secondly, both search terms ('whs' and 'nhs') are sufficiently generic that they can occur in unrelated contexts. Our assertion is that a page would be more likely to relate to the WHS / NHS news story if both terms appear close to each other on the page. In order to test this, we treat 'WHS' (together with checking for variants such as 'WH Smith', 'W H Smiths', etc.) as the brand name, and calculate a relevance score based on each appearance of the brand name near to the relevance keyword(s) (just 'NHS' in this case), with greater numbers of closer mentions generating higher scores.

In this case, actually only one of the pages returned by Google in the first page of results actually relates to the news story in question, and this result is correctly picked up as the highest scored result within the dataset (Table 1).

Table 1: All non-zero relevance score webpages from the first page of Google results for 'whs nhs'

Case study 2: Announcement of Havaianas' new CEO

In December 2023, Alpargatas - the owner of footwear brand Havaianas - announced Mondelēz executive Liel Miranda as the new CEO, to take effect from February 2024[5,6]. In monitoring for references to this story, we may wish to search for (say) 'havaianas ceo'. In practice, this returns a range of results, including references to previous CEOs and other news stories, and references to the CEOs of other companies in conjunction to references to Havaianas[7].

However, if we apply the same keyword-based filtering approach to the set of pages returned - namely, classifying on the basis of relevance score for mentions of Havaianas (or Alpargatas) near to the relevance keywords 'miranda' and 'mondelez', we again find a relatively clean separation of the relevant results (Table 2) from the remainder.

Table 2: All non-zero relevance score webpages from the first page of Google results for 'havaianas ceo'

Case study 3: References to the Facebook 'news tag'

For the next case study, we suppose it was required to search for explicit references to the Facebook news tag (for example, as referenced in the news story that Meta was planning to deprecate the feature[8]). A search for 'facebook news tag' is actually not a particularly efficient way of collecting relevant pages, because of the genericness of the three individual terms in conjunction with each other (i.e. many references to Facebook occur in conjunction with mentions of news, newsfeeds, etc., and tags - e.g. photo tagging, etc.). Whilst this can be mediated to some extent through the use of exact-phrase searching ('facebook "news tag"' or even "facebook news tag" explicitly), this will not capture content where the phrases are not used in this exact format (e.g. an exact phrase search for "facebook news tag" will not return content where the terms are referenced differently, such as "news tag on facebook"). Instead, it is possible to filter the pages by analysing for references to Facebook in conjunction with the relevance keywords 'news' and 'tag', which will have the added benefit of further upweighting pages on which both terms appear near the brand name (i.e. multi-term matching) and which are therefore likely to be the most relevant (Table 3). This approach also provides a much better prioritisation than using just one keyword (say, 'news') for which the scores are also shown in Table 3 (where several relevant results score zero and would have been missed).

* This is the link to the news story referenced above

Table 3: All webpages from the first page of Google results for 'facebook news tag' with relevance scores of 100 or greater

Case study 4: Searches for Gucci handbags

As an illustration of how the same approach could be applied to a brand / product combination, we consider the case of searching for Gucci handbags (using 'gucci handbag' as our initial search term). Whilst the vast majority of the pages returned by a query of this type will be generally relevant, there may be a range of content types, including e-commerce sites, informational sites, and a mixture of sites dedicated specifically to Gucci handbags versus those also featuring content relating to other brands or products. However, use of the relevance-keyword approach (looking for references to Gucci near 'handbag(s)' or 'bag(s)' provides a basis for prioritising the results according to the degree to which the pages relate to Gucci handbags specifically (comparable to the content scoring approach for a single brand name or term) (Table 4).

Table 4: All webpages from the first page of Google results for 'gucci handbag' with relevance scores of 1000 or greater

As a further step, it would be possible to (for example) incorporate analysis of e-commerce-related keywords (on either a proximity or a content-scoring basis) to further rank the results on the basis of their likelihood to be offering the sale of relevant products. This type of analysis is central to the 'discovery' process for e-commerce sites (i.e. identifying sites which were not known at the outset of monitoring).

Conclusion

The approach outlined in this study allows us to prioritise results gathered from search engines, according to the proximity of the name of the brand under consideration to any of a list of keywords pertaining to the content area(s) of interest. These ideas can be incorporated into automated monitoring tools to provide a means of building efficiency into the analysis process, ease of identification of the highest-relevance results, and a reduction in false positives. The methodology sits alongside other related ideas, such as the use of content scoring and prominence and sentiment analysis. As with these other areas, the approach is flexible and can be tailored to specific requirements (e.g. by varying the keyword configurations and the proximity range over which the matching is carried out (i.e. the 'half life' of the decaying proximity function)).

These frameworks can be applied to a range of content types, including search results drawn from different search engines and platforms, and are of particular use in cases where multi-word search terms may not be handled by the platform in question in the expected way, resulting in a need for these types of post-processing to focus in on the desired findings.

References

[1] https://www.iamstobbs.com/online-brand-prominence-and-sentiment-ebook

[2] https://www.iamstobbs.com/google-gemini-ebook

[3] Findings based on results returned and analysis of content carried out on 03-Jan-2024

[4] https://www.theguardian.com/business/2023/dec/27/wh-smith-whs-rebrand-criticised-for-similarity-to-nhs-logo

[5] https://fashionunited.uk/news/people/havaianas-owner-alpargatas-appoints-liel-miranda-as-new-ceo/2023121373112

[6] https://www.drapersonline.com/news/havaianas-owner-announces-new-ceo

[7] e.g. https://www.opticaljournal.com/alpargatas-and-safilo-renew-havaianas-eyewear-licensing/ - "We are very proud of this early renewal, which aims to strengthen a project initiated in 2016," commented Angelo Trocchia, CEO of Safilo Group. "We want to grow the havaianas eyewear business through collections that reflect the unique personality and creative simplicity of this important Brazilian brand, which is receiving an exceptional reception, particularly in Southern Europe."

[8] https://www.wired.co.uk/article/facebook-is-giving-up-on-news-again

This article was first published as an e-book on 6 February 2024 at:

https://www.iamstobbs.com/utilisation-of-relevance-keywords-ebook

Thursday, 1 February 2024

Think globally, act locally: An overview of infringement hotspots around the world

by David Barnett and Jessica Wolff

Introduction

Insights into the geographical locations which are most commonly associated with brand infringements have a number of applications in brand protection. This knowledge can help inform policies on where IP protection (such as the registration of relevant trademarks) should be put in place, can identify areas of focus for online monitoring and enforcement, and can provide direction on regions where (if possible and appropriate) on-the-ground initiatives (such as investigations, raids and customs seizures) should be put in place.

When devising a trademark strategy, it is important for a brand owner to consider the countries in which infringements are likely to be encountered. This will typically depend on business model, the goods or services produced, and where infringing activity is already being encountered. It is also advisable to take a wider view, and consider which countries are typically seeing higher levels of infringing activity across all types of business.

In this study, we review a number of separate pieces of research giving information on the countries in which different types of infringing behaviour are typically concentrated. There may be a variety of reasons why certain geographical areas are more popular with infringers, including factors such as the proximity to manufacturing centres and the cost of labour, to differences in local laws, ease of anonymisation for infringers, and the typical policies of service providers operating in these regions. We consider three main areas of relevance (listed below), and create a dataset of infringement frequency 'scores' for all relevant countries, for each of these areas:

  1. Degree to which countries are involved in the manufacture and distribution of counterfeit (and otherwise infringing) goods - including both online and offline channels

  2. Frequency of association with online service providers used by infringers

  3. Level of risk of the TLDs (top-level domains, or domain extensions / country codes) associated with each country

In general, each of the above three datasets is composed of multiple sub-datasets, which can be either qualitative - i.e. just lists of countries where infringements are prevalent (in which case, all countries in the list are 'scored' equally in our analysis) - or quantitative (in which case, relative scores are used) . In general, when sub-datasets are combined, the values in each sub-dataset are 'normalised' (i.e. rescaled) so that the average score is 1, across all of the countries featured in the list. The individual country scores from each sub-dataset can then simply be added together (and the final scores then re-normalised), so that countries featuring in multiple lists are scored more highly. This then allows us to calculate a total 'infringement risk' score for each country, encompassing all of the measures considered. Even then, the true picture is likely to be somewhat more complex, as other factors - such as transportation routes for infringing goods - are also likely to come into play when formulating a brand-protection strategy.

Datasets

Dataset 1: Association with counterfeiting activity

This first dataset includes information from four data-sources: lists of countries commonly associated with counterfeiting, as provided by the USTR 2022 Review of Notorious Markets for Counterfeiting and Piracy[1] (which also gives a more granular list of the highest-risk individual online e-commerce marketplaces) and the Wikipedia overview of counterfeiting[2] (which itself draws data from other primary sources, including the Asia Business Council report referenced below); the OECD 'Global Trade in Fakes' 2021 report[3,4], which gives the top 25 countries "in terms of their propensity to export counterfeit products", quantified as a metric reflecting the value of counterfeit goods and the share of trade in counterfeit goods; and a much older report (2005) from the Asia Business Council[5], giving absolute monetary values of pirated copyright materials used by a list of top infringing countries.

The scores given by just this single combined dataset provide some useful insights in their own right, highlighting the top countries where - broadly - counterfeit goods tend to originate (which is a key area of focus in many brand-protection initiatives). The findings are summarised as a heat map in Figure 1, showing a familiar pattern of counterfeit hotspots particularly in China, Russia, Turkey, South and South-East Asia, and South America.

Figure 1: Heat map of degree of country association with counterfeiting activity (dataset 1)

Dataset 2: Service providers used by infringers

In many cases, specific countries or individual service providers (typically domain registrars and web-hosting providers) are disproportionately popular with infringers, based on factors such as cost, degree of customer identity checks, local regulations, and inherent level of service-provider compliance with enforcement requests. The second dataset aims to reflect these points, using information drawn from the following sources:

  • Spamhaus' list of the ten most abused domain registrars[6] (as of 10-Jan-2024), in which each registrar is assigned a 'badness index', and from which we take the country in which each of the registrars is based (and assign the 'badness index' score to that country, adding the scores together for any countries which feature more than once in the list).
  • Information on 'bulletproof hosting' (BPH) providers - illicit service providers whose business model specifically states that they are non-compliant to enforcement requests - for which a list of common host countries is given by Wikipedia[7], and a list of 'best' individual BPH providers is given by Hostings.info[8]. From these we construct sub-datasets based on both server country location and business country location, in each case 'multiple counting' any countries which appear more than once.

It is worth noting that increased levels of IP protection will not necessarily be an effective solution against some of the types of service providers represented within this particular dataset, who - by definition - are likely to be non-compliant even against legitimate enforcement actions. However, it is still meaningful to consider this data in the context of brand protection focus, as many of the factors leading to the concentration of infringements in these regions will still apply more generally.

Dataset 3: Highest-threat TLDs

The third dataset is based on a single study[9], which itself incorporates data drawn from multiple sources (Spamhaus' list of most abused TLDs, Netcraft's list of TLDs with the highest cybercrime rates, Palo Alto Networks' list of TLDs with the highest rate of malicious domains, and CSC's phishing data), quantifying the frequency with which individual country-code TLDs (ccTLDs) are associated with infringing content. Again, there are a number of reasons why some ccTLDs are popular with infringers, including factors such as domain registration cost and security policies, and even the degree of wealth of the country (which can affect the level of technical expertise of Internet service providers, and therefore the likelihood of compromise)[10].

Overall findings

By combining together the information from the three distinct datasets, we can produce an overall infringement risk score for all countries featured. The findings are shown in Figure 2 and Appendix A.

Figure 2: Heat map of overall degree of country association with infringing activity, encompassing all three datasets

This 'master' dataset shows a much more widespread geographical distribution of infringement focuses, reflecting the fact that different factors are favourable for the popularity of infringements as measured by the characteristics represented in each of the three distinct datasets. However, some countries and regions do stand out as overall hubs of infringement, notably China and Bangladesh (the only countries to appear in all three datasets, achieving first and twelfth place respectively in the overall rankings), Russia, Hong Kong and India (all of which appear both in datasets 1 and 2), and a number of other centres in North and South America, Europe, Africa, and South-East Asia.

This information provides a key input into formulations of strategies for IP protection and enforcement, although it is always worth bearing in mind that this is just one factor to weigh up against the brand owner's budget, geographical footprint, expansion plans and other priorities.

Appendix A: Infringement risk scores for all countries featured in any dataset

Country
                                  
Dataset 1
                          
Dataset 2
                          
Dataset 3
                          
All datasets
                          
  China 4.18 1.56 0.30 5.07
  Russia 2.64 1.50 3.47
  Netherlands 3.14 2.63
  Hong Kong 0.98 2.10 2.58
  USA 2.98 2.50
  Germany 2.69 2.26
  India 1.49 1.03 2.12
  Turkey 2.44 2.05
  Ivory Coast 2.28 1.91
  Zimbabwe 2.28 1.91
  Sint Maarten 2.15 1.80
  Bangladesh 0.65 0.67 0.80 1.78
  UAE 1.44 0.60 1.71
  Malawi 1.96 1.65
  Pakistan 0.94 0.90 1.54
  France 1.79 1.50
  Canada 1.07 0.67 1.46
  Malaysia 0.73 0.90 1.37
  Singapore 0.75 0.67 1.19
  Cambodia 1.40 1.18
  Armenia 1.38 1.16
  Brazil 1.38 1.16
  Mexico 1.34 1.12
  Bulgaria 0.73 0.60 1.12
  UK 1.22 1.02
  Panama 0.61 0.60 1.01
  Paraguay 1.18 0.99
  Syria 0.98 0.82
  Dominican Rep. 0.97 0.81
  Georgia 0.92 0.77
  Ukraine 0.90 0.75
  DR Congo 0.89 0.75
  Kenya 0.87 0.73
  Lebanon 0.86 0.72
  Senegal 0.82 0.69
  Libya 0.81 0.68
  Afghanistan 0.75 0.63
  Argentina 0.73 0.61
  Indonesia 0.73 0.61
  Kyrgyzstan 0.73 0.61
  North Korea 0.73 0.61
  Peru 0.73 0.61
  Philippines 0.73 0.61
  Taiwan 0.73 0.61
  Thailand 0.73 0.61
  Vietnam 0.73 0.61
  Benin 0.72 0.60
  Morocco 0.68 0.57
  Nigeria 0.67 0.57
  Switzerland 0.67 0.57
  CuraƧao 0.62 0.52
  Belize 0.60 0.50
  Moldova 0.60 0.50
  Romania 0.60 0.50
  Seychelles 0.60 0.50
  Tokelau 0.57 0.48
  Albania 0.57 0.48
  Italy 0.55 0.46
  Palau 0.55 0.46
  Serbia 0.54 0.45
  South Korea 0.48 0.41
  Australia 0.45 0.38
  Eq. Guinea 0.44 0.37
  Laos 0.43 0.36
  Japan 0.40 0.34
  Cent. Afr. Rep. 0.38 0.32
  Gabon 0.37 0.31
  Mali 0.36 0.30
  Austria 0.22 0.19
  Poland 0.22 0.19
  Sweden 0.22 0.19
  Spain 0.22 0.18
  Brit. Ind. Oc. Terr. 0.19 0.16

References

[1] https://ustr.gov/sites/default/files/2023-01/2022%20Notorious%20Markets%20List%20(final).pdf

[2] https://en.wikipedia.org/wiki/Counterfeit

[3] https://www.oecd.org/gov/global-trade-in-fakes-74c81154-en.htm

[4] https://www.oecd-ilibrary.org/sites/771e7a68-en/index.html?itemId=/content/component/771e7a68-en

[5] https://www.asiabusinesscouncil.org/docs/IntellectualPropertyRights.pdf

[6] https://www.spamhaus.org/statistics/registrars/

[7] https://en.wikipedia.org/wiki/Bulletproof_hosting

[8] https://hostings.info/hostings/rating/bulletproof-hosting

[9] https://circleid.com/posts/20230117-the-highest-threat-tlds-part-2

[10] https://circleid.com/posts/20230112-the-highest-threat-tlds-part-1

This article was first published on 1 February 2024 at:

https://www.iamstobbs.com/opinion/think-globally-act-locally-an-overview-of-infringement-hotspots-around-the-world

Unregistered Gems Part 6: Phonemizing strings to find brandable domains

Introduction The UnregisteredGems.com series of articles explores a range of techniques to filter and search through the universe of unregis...