Friday, 26 January 2024

"And the award goes to…": the online presence of the Oscars best-picture nominees

Following the announcement on 23-Jan-2024 of the nominees for this year's Academy Awards, we analyse the online prominence and sentiment of the contenders for Best Picture, to see if it may yield any insights into where the prize may go.

We apply the same methodology as applied to our recent study of the top 100 global brands, using generic queries ('movie', 'film' and 'picture', in conjunction with terms such as 'best', 'popular' and 'favorite' [sic - targeting US-centric content], and phrases such as 'Oscars 2024') to bring back relevant pages for analysis. The dataset thus generated consisted of around 1,500 distinct, highly-ranked pages from Google (with searches and analysis carried out on 24-Jan-2024).

The findings are shown in Figures 1 and 2.

Figure 1: Online prominence scores for the titles of the ten nominees for the Best Picture award, within the dataset of movie-related webpages

Figure 2: Online sentiment scores for the titles of the ten nominees for the Best Picture award, within the dataset of movie-related webpages

It is striking that - at least in the set of results most highly ranked by the search engine on the day after the announcement - Barbie and Oppenheimer show by far the highest degree of online prominence, with little difference between their scores. Oppenheimer also features as the most positively-referenced brand, though it is notable that Barbie's sentiment score is low.

These trends are perhaps not surprising; Oppenheimer has been widely talked about as the most highly nominated movie overall. Much of the apparent negativity surrounding Barbie may be in relation to the language used in relation to reports of the lack of nominations for director Greta Gerwig and lead actress Margot Robbie; three of the top five most negatively scored pages feature headlines where the word 'snub' is used alongside 'Barbie'[1]. There may also, of course, be some unfavourable comments about the film itself, though even some of the identified content which is broadly positive does reference the title in conjunction with negative keywords: "Any description of Barbie’s big themes (toxic masculinity)...", "Perky, playful, and deceptively caustic, Barbie is one of just a few films..."[2].

It is also noteworthy that these two films continue to dominate the online landscape, following the enormous amount of buzz following their simultaneous launch in July 2023. This was associated with a massive spike in online infringements relating to both titles, and to their joint neologism, 'Barbenheimer' (Figure 3), as is often the case for brands of all types when their online prominence spikes.

Figure 3: Growth in numbers of registered domains with names containing 'barbie', 'oppenheimer' and 'barbenheimer' (from Jan to Jul 2023), compared with the launch dates of the films (dotted line)

Overall, it will be interesting to see whether any predictions from the data on the relative prominence and sentiment of the nominee titles will be borne out when the awards are presented on 10th March...

References

[1] e.g. https://scrippsnews.com/stories/did-the-academy-awards-snub-barbie-star-director/

[2] https://www.polygon.com/what-to-watch/23597815/best-movies-2023

This article was first published on 26 January 2024 at:

https://www.iamstobbs.com/opinion/and-the-award-goes-to-the-online-presence-of-the-oscars-best-picture-nominees

Thursday, 25 January 2024

AI-t's raining brands: The top generative-AI brands in 2024

by David Barnett and Rebecca Newman

As the online buzz regarding artificial intelligence technologies[1] carries on into the new year, we take a look at which generative AI brands have the greatest degree of online presence at the start of 2024. In order to do so, we use our new methodology for measuring online prominence, as outlined in our recent analysis of the top 100 global brands[2] and subsequently applied in studies of the top fashion brands[3] and cryptocurrencies[4]. In this case, the analysis considers a set of 175 of the most popular generative AI brand names (including tools, models, etc.), drawn from various sources[5]. Some additional specifics on the details of the methodology are given in Appendix A.

From this analysis, the top thirty brands and their respective prominence scores (measuring their relative degrees of online prominence) are shown in Figure 1.

Figure 1: Prominence scores for the top thirty most prominent generative-AI brands

It is perhaps unsurprising that the very popular GPT / ChatGPT brand, and its parent company OpenAI, are the most prominent brands online by a significant margin. However, other well-known names, including Copilot, Bard, Gemini, DeepMind and Midjourney all have significant presences, all featuring within the top 12. It is also noteworthy that the high-profile brand names Bing Chat / Bing AI (Microsoft) and Grok[6] (xAI) are not currently well represented in the dataset of web content, with neither achieving top-30 placings (gaining scores of 0.165 (position 32) and 0.135 (position 35), respectively).

In terms of more in-depth trends, we also note the following:

  • Enterprise use - Jasper and Copilot achieve third and fourth place (respectively) in the overall ratings, with both BLOOM and Cohere in the top 15. This is consistent with the increasing focus on the application of AI within enterprise, where AI tools can provide templates, insights and increased workflow efficiency. As AI moves toward a phase of consistent productivity (rather than the initial spikes of hype and disillusionment), users are demanding tools with a practical application, which go beyond the immediate novelty factor of the first phase. Similar comments also apply to Copy AI (position 16) and ClickUp (position 21).
  • Relevance of open source - TensorFlow, HuggingFace and PyTorch (all in the top 20) are open-source repositories and communities for machine learning and AI. The prominence of these platforms reflects a growing recognition of open-source collaboration in AI development. LLaMA (position 24) has also been part of this conversation. The weights and starting code for LLaMA 2 were released amid much hype in Summer 2023, prompting accusations that Meta had leveraged the publicity around promising an open-source LLM whilst offering a model which is not truly open-source.
  • New complex content types - Synthesia, a platform for AI-generated video content, also scores highly (position 8), which is consistent with focus on the increasing capability of these complex media generation models. The same is also true of Murf (position 25), an AI voice generator.
  • The prominence of Google - Google models feature highly in the list, which is to be expected after their launch event in December last year, where they released the new Gemini LLM which now powers Bard, Gemini Nano and Gemini Pro. Google held off on their most anticipated model (Gemini Ultra, expected to outperform ChatGPT across a number of benchmarks), which is expected to be released in late Jan 2024. DeepMind - which has not been particularly prominent in the hype cycles over the last year, preferring to stay below the radar and continue its research - also features at position 9. This may in part be due to the publication of research in November, revealing that DeepMind AI tool GNoME has discovered nearly 400,000 new stable materials[7] which could power future technology. For context, this is a discovery of 2.2 million new crystal structures, compared to the 28,000 discovered in the last decade of human research.
  • Small LLMs (Large Language Models) - One trend we may expect to see going forward is a growth in prominence of 'small' or 'lightweight' LLMs which have smaller neural networks, fewer parameters and can be used offline and on mobile devices. Examples include Microsoft's Orca 2-7b and Falcon 7b (Falcon currently appears only at position 97 in the list, and Orca at 156) and Gemini Nano (which, as an explicit brand-name phrase - i.e. a subset of the results for Gemini generally - appears at position 154).

The data shows that, as we enter 2024, the public interest in AI is focusing on real-world outputs. These are diverse - in this article alone we have touched on various fields, including content, enterprise and productivity and material sciences. There is increasingly a voice for the open-source proponents, not least because the models being developed are equalling (or even outperforming) the big players across a number of key benchmarks. The fight to secure adequate compute power is likely to continue, but we may see a counter-camp advocating for small, targeted and cheaper (for providers, users and arguably also the planet) technology, with the rise of the small language model.

The reproducibility of our methodology means that the same queries can be run in future studies, to provide a means by which changes in prominence over time can be quantified on a like-for-like basis. This will allow us to track the relative fortunes of the generative-AI key players over coming months, monitor new brands as they emerge, and determine how the forthcoming data reflects our predictions!

Appendix A: Methodology

  • We use a series of generic search queries[8] relating to generative AI, to bring back a sample of pages for analysis, and then measure the number and prominence of mentions of each AI brand on each page, using the 'content scoring' approach.
  • The overall prominence score for each brand is calculated as the mean of the content scores on each page (calculated across the whole dataset)[9].
  • In general, the matching is carried out on a 'wildcard' basis (i.e. allowing the reference to be counted even if the brand-term appears as only a sub-string within a longer word), since most of the brands are relatively distinctive, and it is desirable to be able to capture brand variations and adaptations, and consider them to be 'references' to the brand (e.g. for GPT, we wish to include references to GPT(-)4, GPT(-)5, ChatGPT, AutoGPT, etc.).
  • The risk of 'false positives' (i.e. references to the same brand names in unrelated contexts) is also reduced through the use of the subject-area-specific search queries.
  • However, for the less distinctive brand names, we make use of explicit filtering where necessary (e.g. for 'Descript', we require the string not to be suffixed with any additional alphabetical characters, so as to avoid counting uses of words such as 'description').

References

[1] https://www.iamstobbs.com/opinion/trends-in-web3-part-1-a-look-at-blockchain-domains

[2] https://www.iamstobbs.com/online-brand-prominence-and-sentiment-ebook

[3] https://www.iamstobbs.com/measuring-brand-prominence-of-fashion-brands-ebook

[4] https://www.iamstobbs.com/opinion/coining-success-trends-in-the-online-brand-prominence-and-overall-value-of-cryptocurrencies

[5] 

https://en.wikipedia.org/wiki/Generative_artificial_intelligence

https://www.simplilearn.com/tutorials/artificial-intelligence-tutorial/top-generative-ai-tools

https://www.turing.com/resources/generative-ai-tools

https://scribehow.com/library/generative-ai-tools

https://aimagazine.com/top10/top-10-generative-ai-tools

https://www.bardeen.ai/posts/generative-ai-tools

https://www.cnbc.com/2023/12/24/the-top-10-ai-tools-of-2023-and-how-to-use-them-to-make-more-money.html

https://www.techopedia.com/6-free-generative-ai-tools-that-are-great-for-beginners

https://www.g2.com/categories/generative-ai

https://businesschief.com/top10/top-10-generative-ai-platforms

https://clickup.com/blog/ai-tools/

https://101blockchains.com/top-generative-ai-tools/

https://www.kommunicate.io/blog/19-generative-ai-tools-like-chatgpt-that-you-cannot-ignore-in-2023/

https://zapier.com/blog/best-ai-image-generator/

https://www.xcubelabs.com/blog/the-top-generative-ai-tools-for-2023-revolutionizing-content-creation/

https://themehunk.com/best-generative-ai-tools/

https://seo.ai/blog/generative-ai-applications

https://writesonic.com/blog/generative-ai-tools

https://www.neebal.com/blog/top-generative-ai-tools-for-innovation

https://www.synthesia.io/post/ai-tools

https://aithority.com/machine-learning/35-generative-ai-tools-for-2023-that-you-should-be-using-right-now/

https://redblink.com/generative-ai-tools-for-marketing-use-cases/

https://fortunescrown.com/ai-for-the-future-top-generative-ai-tools-to-check-out-in-2024/

https://www.unite.ai/best-ai-tools-for-business/

https://www.edureka.co/blog/top-12-artificial-intelligence-tools/

[6] https://www.iamstobbs.com/opinion/cant-stop-the-grok-domain-infringements-following-xs-ai-brand-launch

[7] https://deepmind.google/discover/blog/millions-of-new-materials-discovered-with-deep-learning/

[8] The queries consist of the terms 'AI', 'artificial intelligence' and 'generative AI', both in isolation and in combination with each of the terms 'brands', 'tools', 'products', 'models', and 'applications', and with each combination also submitted in conjunction with the terms 'top' and 'popular'. URLs are taken from the first page of results from google.com.

[9] Findings are based on searches and analysis carried out on 05-Jan-2024 and 07 to 08-Jan-2024, respectively. The dataset consisted of 1,889 unique webpages, and the means are calculated across the subset of pages which were accessible via the automated analysis script.

This article was first published on 25 January 2024 at:

https://www.iamstobbs.com/opinion/the-top-generative-ai-brands-in-2024

Friday, 19 January 2024

Seek and ye shall (not necessarily) find: the dangers of relying (just) on SEO

An ill-advised marketing trend adopted by many large brands in recent years is the reliance on just advising customers to "search [for] X Y Z" as a means of directing users to the main corporate website. The implication is that the official site will be the top-ranked result for the search-term in question (i.e. a successful implementation of search-engine optimisation, or SEO), but this is frequently not the case. The ranking of a URL in a page of search-engine result is dependent on a number of factors, including the search-engine used (all of which have proprietary algorithms for prioritising results), the location from which the search is carried out, the presence of sponsored ads (where third parties may have paid the search engine to be returned in response to a specific query) and potentially even the user's own search and browsing history (particularly for engines such as Google, where the user may be logged into their own account). This is, of course, not even factoring in instances where search terms may be misspelled or mistyped.

The fact that non-legitimate sites and other third parties can configure their websites (and adopt other SEO measures) to boost their own search-engine ranking and misdirect web users is essentially one of the primary reasons why online brand protection per se is important. At the very least, organisations should always explicitly give their official domain name (i.e. web address) in all advertising initiatives, so that customers know which link to click on within the search results (and, arguably, if this is done, there's no need to specify an appropriate search term to be used at all).

In this study, we look at the success rate of particular search queries in returning the desired website as the top-placed search-engine result.

In the first part of the analysis, we consider the website-access details given in a range of award-winning and 'best-in-class' advertisements from UK commercial radio over approximately the last four years[1]. We choose radio as the medium to consider since the only information given is that contained in the audio (with no accompanying on-screen cues to provide additional website information), and select a UK focus to reflect the fact that our research is carried out via a UK IP address. Of the 151 advertisements analysed, only 56 (37%) give any guidance at all on how to access the corresponding website of the organisation in question, despite several others encouraging users to 'visit online', 'search online' or 'download the app'. This clearly presents potential for the creation of deliberately misleading sites by bad actors. Of the 56, 47 (84%) do (reassuringly) give the domain name of the official website but, even then, there is potential for misdirection for misheard, misspelled, or mistyped URLs. For example, the domains train-line.com, netflicks.co.uk and morissons.com (all plausible erroneous replacements for the official sites trainline.com, netflix.co.uk and morrisons.com for a user based only on audio guidance) are currently registered and resolve to live sites. One offers the domain name for sale, and the other two feature pay-per-click links, which not only provide potential for customer misdirection in their current form, but also present the risk for the creation of actively fraudulent sites in the future.

An analysis of the effectiveness of the 'search' instructions for the remainder of the advertisements is shown in Table 1, which gives the (highest) position within the search results of the official website(s) in question, in response to the query-term cited, for the five most popular search engines for western users[2,3]. The analysis encompasses only the first page of results (according to default settings) returned by each search engine, and gives the position of the relevant result within both the organic (i.e. the algorithm-based) listings ('org') and the overall position ('all') on the page (given that sponsored results in some cases are displayed above the organic listings).

'-' denotes that the official site does not appear anywhere within the first page of results

Table 1: Search-engine position of the official website(s) of the advertised organisations, in response to the search-terms given, for the five most popular western search engines (all .com)

There are a number of points to take from this analysis:

  • In general, the suggested search queries are effective at returning the desired website at the top of the organic results, although the performance does vary by search engine (with Ask.com generally performing poorly).
  • However, in many cases, third-party sponsored links are displayed above the organic results, so the desired website is not always returned at the top of the page. This could provide a source of confusion for users (particularly when there is little visible distinction between the sponsored and organic listings) (Figure 1).
  • Additionally, in some cases, it is unclear which is the 'intended' website for customers - this is particularly true for Highways England, where up to four potentially-relevant websites are returned within the results. Also, for some less well-known brands, it may not be clear to users which is the intended website (e.g. the second result returned for '360 ict' on some search engines is for ict360.com, which is a separate organisation). This may be problematic where the official website domain has not been stated in the advertisement in question.

Figure 1: Example of a page of search results for 'smarty mobile' on Yahoo.com, where the organic results appear underneath sponsored ads for third-party companies

Any such problems will likely be accentuated in cases where the user is (for example) not searching from a UK IP address. This may be the case for an individual streaming the radio transmission via the Internet. When searching from the US[4], for example, the top results for 'nursing careers' and '360 ict' are registerednursing.org and ict360.com respectively - i.e. different organisations altogether.

A further illustration of the dangers of relying just on SEO to direct traffic to official websites can be obtained by turning the problem around, and looking at the queries which users are actually utilising. For the top five most popular websites globally (for example), several of the top five queries driving organic search traffic to the sites in each case do not feature the brand name[5], namely:

  • for youtube.com:
    • 'yt' (position 2)
    • 'y' (position 5)
  • for facebook.com:
    • 'fb' (position 2)
    • 'face' (position 5)
  • for instagram.com:
    • 'ig' (position 2)
    • 'insta' (position 4)
    • 'ins' (position 5)
  • for twitter.com:
    • 'tw' (position 4)

Whilst in many cases (based on UK searches on Google.com on 04-Jan-2024) these searches do return the respective site as the top result, there is no guarantee that this will always be the case. Furthermore, the top result for 'ig' is actually the website for trading platform ig.com, not instagram.com. Additionally, although Google correctly 'guesses' that searches for 'face' and 'insta' should include results for 'facebook' and 'instagram' (and accordingly returns the respective sites as the top results), this is not the case if we specify that the search engine returns only the results for the exact query as submitted. In this case, the top results are https://en.wikipedia.org/wiki/Face and https://www.instagram.com/insta/ (the specific account page for the user @insta), respectively).

There are a number of take-aways for brand owners. Firstly, it is certainly worth ensuring that official websites are effectively search-engine optimised, and are highly ranked in search-engine results in response to relevant and popular query terms. Bidding on appropriate keywords so as to be featured in sponsored advertisements can also be part of this process. Secondly, it is strongly recommended to always reference the domain name of the official website in advertising collateral, so that users know where to click even if they do run a search rather than browsing directly. There is no need to tell customers to "search [brand]" when it is just as easy (and better) to say "visit [brand.com]". Furthermore, there is no real advantage to encouraging users to access content via search engines - this simply exposes them to competitor content. Finally, an enforcement programme against infringing sites is also a key part of a successful marketing programme. Sponsored ads making unauthorised use of protected IP can be removed (providing certain criteria are met), and fraudulent sites can in many cases be delisted from organic search-engine results.

References

[1] https://www.radiocentre.org/how-to-do-it/creativity/get-inspired/

[2] https://www.reliablesoft.net/top-10-search-engines-in-the-world/

[3] Based on analysis carried out on 04-Jan-2024

[4] As determined via the use of a proxy

[5] https://www.similarweb.com/top-websites/

This article was first published on 19 January 2024 at:

https://www.iamstobbs.com/opinion/seek-and-ye-shall-not-necessarily-find-the-dangers-of-relying-just-on-seo

Tuesday, 16 January 2024

Coining Success: trends in the online brand prominence and overall value of cryptocurrencies

Following our initial proof-of-concept studies on the measurement of online brand prominence (looking at the top twenty fashion brands[1] and the 100 most valuable global brands[2]), we apply the same methodology to an analysis of the relative prominences of the top twenty cryptocurrencies[3,4], to determine which currencies are referenced most widely on the Internet, and determine whether this measure of prominence is correlated with their overall market capitalisation (i.e. total value).

The analysis is based on a set of approximately 500 webpages which are highly ranked on Google in response to one or more of a set of queries designed to return pages relating to cryptocurrency generally[5], but without specifying the name of any given currency specifically. An individual cryptocurrency is deemed to have been mentioned if a reference to either its name (specified as a 'match string'), or its currency abbreviation, are identified on the page, and a brand content score is then calculated for each individual cryptocurrency name on each page, based on the total number of mentions and the prominence on the page of each mention. The overall prominence score for each cryptocurrency is then calculated as the mean of the brand content scores across the set of pages analysed[6].

Cryptocurrency
                                            
Match string
                                            
Currency abbreviation
                                            
  Bitcoin   bitcoin   BTC
  Ethereum   ethereum   ETH
  Tether USDt   tether   USDT
  BNB Binance Coin   binance   BNB
  Solana   solana   SOL
  XRP Ripple   ripple   XRP
  US Dollar Coin   dollar.?coin   USDC
  Cardano   cardano   ADA
  Avalanche   avalanche   AVAX
  Dogecoin   dogecoin   DOGE
  Polkadot   polkadot   DOT
  TRON   tron   TRX
  Chainlink   chainlink   LINK
  Toncoin   toncoin   TON
  Polygon   polygon   MATIC
  Shiba Inu   shiba   SHIB
  Dai   dai   DAI
  Litecoin   litecoin   LTC
  Bitcoin Cash   bitcoin.?cash   BCH
  Cosmos   cosmos   ATOM

where '.?' denotes any or no (i.e. optional) additional single character

Table 1: Top twenty cryptocurrencies by market capitalisation, and the terms used to identify a mention of each cryptocurrency on a webpage

In order to treat all currencies identically as far as possible, each classified reference requires the match string or currency abbreviation to be preceded and succeeded by a character other than an alphabetical letter, for all currencies considered, to avoid false positives (e.g. for 'matic', we would not want to match words such as 'automatic', etc.). We also avoid the use of keyword-based filtering[7], working on the basis that the names are relatively distinctive, and all pages considered should relate to cryptocurrency and that the names are therefore unlikely to arise in unrelated contexts.

On the above basis, the relative online prominence scores for the twenty cryptocurrency brand names are shown in Table 2.

Cryptocurrency
                                            
Prominence score
                                            
  Bitcoin 10.693[8]
  Ethereum 2.735
  Tether USDt 0.304
  BNB Binance Coin 1.728
  Solana 0.588
  XRP Ripple 0.914
  US Dollar Coin 0.084
  Cardano 0.502
  Avalanche 0.288
  Dogecoin 0.669
  Polkadot 0.200
  TRON 0.097
  Chainlink 0.477
  Toncoin 0.070
  Polygon 0.198
  Shiba Inu 0.191
  Dai 0.043
  Litecoin 0.741
  Bitcoin Cash 0.193
  Cosmos 0.078

Table 2: Overall online prominence scores for the twenty cryptocurrencies

Figure 1 shows the relationship between these overall prominence scores and the market capitalisation of the cryptocurrencies in question.

Figure 1: (Log-log) plot of overall online prominence score against market capitalisation for the top twenty cryptocurrencies (with best-fit trend line as defined by a power series)

For the cryptocurrency brands, the correlation between online prominence and market capitalisation is striking (correlation coefficient = +0.984). Overall, the cryptocurrencies which have the greatest degree of online presence are those which are most valuable overall. It is also noteworthy that this is a much stronger relationship than was seen in the 'top 100 brands' study, where the brands were in a range of different industry sectors, and it was therefore much more difficult to assemble a set of webpages for analysis on which the brands could be compared on a like-for-like basis (as is much more the case for the cryptocurrency brands in this study).

In future studies, it will be interesting to determine whether similar relationships hold for other groups of brands in different industry verticals. Additionally, the methodology presented in this study will allow trends in prominence over time to be measured on a like-for-like basis; it will be instructive to track how closely the changing popularity of cryptocurrencies over time (and the emergence of new ones) follows variations in their value.

References

[1] https://www.iamstobbs.com/measuring-brand-prominence-of-fashion-brands-ebook

[2] https://www.iamstobbs.com/online-brand-prominence-and-sentiment-ebook

[3] https://coinmarketcap.com/

[4] https://cryptoslate.com/coins/ 

[5] 'crypto', 'cryptocurrency', 'cryptocurrency exchange', 'cryptocurrency invest', 'cryptocurrency market', 'cryptocurrency prices', 'cryptocurrency trading'

[6] Findings are based on searches and analysis carried out on 21-Dec-2023

[7] https://www.iamstobbs.com/google-gemini-ebook

[8] References to 'Bitcoin Cash' will also be counted as references to 'Bitcoin' unless the brand is referenced as 'BitcoinCash' (with no space) or 'BCH'. If we wanted to add a 'correction' to account for this double-counting, the worst-case scenario would be where all references to 'Bitcoin Cash' are double-counted in this way. If this were the case, we can adjust the brand content score for Bitcoin for each page, by subtracting the score for Bitcoin Cash for the page, before calculating the mean. This adjustment gives an overall prominence score for Bitcoin of 10.500 (i.e. a difference of under 2% from the original value) - and, in reality, the most representative score for Bitcoin will actually be somewhere between these two values. In this final study, however, the adjustment has not been applied, since it is also reasonable to make the case that any reference to Bitcoin Cash should also be counted as a reference to Bitcoin, since the name of the former cryptocurrency is derived from that of the latter (https://www.independent.co.uk/tech/bitcoin-cash-cryptocurrency-roger-ver-a8346816.html). 

This article was first published on 16 January 2024 at:

https://www.iamstobbs.com/opinion/coining-success-trends-in-the-online-brand-prominence-and-overall-value-of-cryptocurrencies

Friday, 12 January 2024

Searching for Google Gemini: A case study in handling false positives in brand monitoring

BLOG POST

The launch of Google's new artificial intelligence model 'Gemini' on 6 December was one of the most significant recent brand launches, and was followed almost immediately by a spike in associated brand infringements. The findings highlight the importance of a proactive programme of brand monitoring, as part of a wider brand protection (BP) initiative, but monitoring can be difficult in instances where - as in this case - the brand name is a generic term which is frequently used in unrelated contexts (specifically astrology in this instance).

In our latest study, we show how the application of 'positive' (relevance) and 'negative' (exclusion, or non-relevance) keywords used together can achieve an effective separation between relevant and non-relevant results, particularly when also combined with the utilisation of focused search keywords in the brand monitoring configuration.

These ideas are key to the implementation of an efficient BP programme, to avoid expensive spends on time required to manually review and filter out false positives - and also minimise the chances of significant findings being overlooked. It is also worth noting, however, that part of a holistic BP initiative should involve the review of 'borderline' results (which are neither obviously explicitly relevant nor non-relevant), to identify significant third-party brand uses - accordingly, it is essential to bear this point in mind when selecting the search terms to be used and the thresholds to be set for the relevance cut-offs.

This article was first published on 12 January 2024 at:

https://www.iamstobbs.com/opinion/searching-for-google-gemini-a-case-study-in-handling-false-positives-in-brand-monitoring

* * * * *

WHITE PAPER

Introduction

The launch of Google DeepMind's new artificial intelligence (AI) model 'Gemini'[1] on 6-Dec-2023 was one of the latest in a line of high-profile (even if somewhat controversial[2,3]) brand launches, and potentially the most significant in the AI arena since ChatGPT[4,5,6]. Within a week of its launch, Gemini was predictably already subject to large numbers of potential brand infringements, including significant numbers of domain registrations utilising the brand name as a means of passing themselves off as official or affiliated sites, misdirecting users to their own content, monetising search-based web traffic, or offering the domain names for sale (Figure 1). In total, 383 domains containing 'gemini' (including 43 containing 'ai', and 16, 22 and 16 containing 'ultra', 'pro' and 'nano' (the three variant Gemini models offered by Google), respectively) were registered in the two days between 6th and 7th December, compared with an average of 7.97 registrations per day across 2023 prior to 6th December[7].

Figure 1: Examples of apparently unofficial live websites hosted on Gemini-specific domain names within a week of brand launch

The generic nature of the Gemini brand name (which is commonly used in the context of astrology) means it presents a number of complications from a brand-protection point of view. Aside from the potential difficulties in securing relevant intellectual property protection and conducting enforcement actions against infringements, it is not even straightforward to monitor for relevant online content which may be of brand relevance or infringing, and to separate out false positives (i.e. mentions which are unrelated to the brand) - although this is clearly of importance when a high-profile brand is launched.

In this study, we consider the use of appropriate search strategies and keyword-based filtering which can be utilised to construct an effective and efficient Internet monitoring programme. The study builds on the concept of brand content scoring (together with ideas relating to search strategies and keyword matching), discussed in previous Stobbs studies of online prominence and sentiment[8,9] as a means of quantifying the amount and prominence of brand- or keyword-related content on a webpage.

Analysis

In the initial part of the study, we consider simply the first page of search results (94 URLs) returned by google.com in response to the search term 'gemini'[10]. As might be expected, this returns a mixture of content, including material relating to the Google Gemini brand, content relating to the term in an astrological content, and other webpages, including third-party usage of the same brand name in a way which may or may not be infringing.

In general, it is often desirable to be able to separate out these categories of content. This is necessary not only in general brand monitoring, where it is primarily only brand-related or potentially infringing content which is of interest, but also in (for example) studies of comparative online prominence, which are generally most meaningful if 'false positive' references can be excluded.

The simplest way to carry out the filtering is via the use of 'positive filtering' (relevance) or 'negative filtering' (exclusion) keywords. (Note that, in this context, we are not using the terms 'positive' and 'negative' to denote sentiment!) In this study, we consider the page to be likely to be relevant to the Gemini brand if it also contains any of the following (company- or industry-specific) ('positive') keywords:

  • DeepMind
  • Google
  • AI*
  • A.I.*
  • LLM* (Large Language Model)
  • MMLU* (Massive Multitask Language Understanding)
  • GPT*
  • ChatGPT

*terms marked with an asterisk must appear in isolation, or prefixed or suffixed by characters other than letters - e.g. we do not want to consider words where (for example) 'ai' appears as a sub-string (such as 'traits')

Similarly, we can construct a list of 'negative' keywords, which are likely to be present only if the Gemini name is mentioned in an explicitly non-relevant context (i.e. astrology):

  • Capricorn
  • Aquarius
  • Pisces
  • Taurus
  • Scorpio
  • Sagittarius
  • 'astrolog' (covers wildcard variants such as 'astrology' and 'astrologer')
  • zodiac
  • Castor
  • Pollux (Castor and Pollux are the 'twin' stars in the Gemini constellation)

It is advisable to select these keywords carefully to avoid including any terms which are more generic (and can appear in unrelated contexts) or which can appear as sub-strings of longer words (if wildcard matching is used)[11].

Rather than then simply treating pages as either relevant or non-relevant based on the appearance of any mention of a corresponding keyword, we adopt the approach of calculating the webpage content score for each of the above keywords. The sum of the scores of the 'positive' keywords thereby provides a measure of the likelihood of the page being relevant, and vice versa. A useful measure of overall potential relevance is then the difference between the total 'positive' (relevance) score and the total 'negative' (non-relevance) scores.

There are a couple of specific points to note with this approach:

  1. We would not want to include or exclude a page purely on the basis of any mention of a keyword, as these terms can sometimes appear in their 'opposite' context. For example, we may find some pages primarily relating to Google Gemini but which also feature a minor reference(s) to astrology[12] (such as may occur if a page makes reference to the fact that Gemini is named after the astrological star sign). Conversely, a relevance keyword may appear on an otherwise non-relevant site (particularly with a term such as 'Google' which can appear in links, advertisements, tracking functionality, etc. on websites).

  2. We do not necessarily want to set the 'threshold' at which we consider a page to be relevant / non-relevant at a net score of zero. As described above, some relevant pages may feature 'negative' keywords and, overall, some of the most significant pages (particularly when considering third-party uses of the same brand name) may yield scores of around zero, particularly when pertaining to less directly relevant business areas[13]. Accordingly, when collecting a set of pages for consideration from a brand analysis point of view, it is likely to be advisable to retain all pages down to a (small) negative score. Additionally, the zero-point is somewhat arbitrary, particularly if the analysis is utilising differing numbers of 'positive' and 'negative' keywords.

Overall, this technique does provide a good means of filtering by relevance; the top five (i.e. most likely to relate to Google Gemini) and bottom five (i.e. most likely to relate to astrology) pages, as ranked by overall potential relevance score, are shown in Tables 1 and 2.

Page title
  
Webpage host domain
                                          
Potential
relevance score
  
  Google launches Gemini, the AI model it hopes will take ...   theverge.com 444
  Google Gemini Vs OpenAI ChatGPT: What's Better?   businessinsider.com 359
  Google I/O 2023: Making AI more helpful for everyone   blog.google 358
  Google says new AI model Gemini outperforms ChatGPT in ...   theguardian.com 322
  Google's New AI, Gemini, Beats ChatGPT In 30 Of 32 Test ...    forbes.com 319

Table 1: Top five pages by potential relevance score 

Page title
  
Webpage host domain
                                          
Potential
relevance score
  
  Gemini Zodiac Sign: Characteristics, Dates, & More  astrology.com-248
  Gemini Zodiac Sign: Horoscope, Dates & Personality Traits   zodiacsign.com-232
  The Gemini - Zodiac Sign Dates and Personality   thoughtcatalog.com-200
  Gemini Personality Traits - The Times of India  indiatimes.com -186
  All About Gemini  tarot.com-174

Table 2: Bottom five pages by potential relevance score

By manual inspection, we find that - in general - all pages with a potential relevance score below approximately -20 (negative 20) in this particular dataset are non-relevant (i.e. primarily about astrology); the remainder would be potential candidates for further analysis from a brand-monitoring point of view (i.e. are potentially relevant) - this equates to 73 of the 94 results returned by the search query (i.e. 78%).

An alternative way of visualising this categorisation is to manually classify each of the 94 pages as definitively relevant (i.e. relating to Google Gemini), definitively non-relevant (i.e. relating to a false positive (astrology)), or 'neutral' (i.e. potential third party / other references to the Gemini name), and plot the set of pages according to their potential relevance score. This relationship is shown in Figure 2, highlighting that the categorisation by the use of potential relevance score is relatively 'clean' (i.e. the relevant pages generally appear at the top (with high potential relevance scores) and the non-relevant pages at the bottom).

Figure 2: Relationship between potential relevance score and manual categorisation of actual relevance, for each of the pages in the dataset

(N.B. the horizontal axis shows the brand content score for the term 'gemini' in each case which, in itself, is not a helpful basis for categorisation, since the term can appear in either relevant or non-relevant contexts)

It is important to note that this approach (i.e. considering the balance between the 'positive' and 'negative' keywords) can be seen to provide a better ('cleaner') separation of relevant from non-relevant results than just using either set of keywords in isolation (i.e. applying only a 'positive filtering' or only a 'negative filtering' approach, for the reasons discussed in point (1.) above). This comparison is shown in Figures 3 and 4.

Figure 3: Relationship between total 'positive' (relevance) score only and manual categorisation of actual relevance, for each of the pages in the dataset

Figure 4: Relationship between total 'negative' (non-relevance) score only and manual categorisation of actual relevance, for each of the pages in the dataset

Furthermore, it is possible to take this approach (i.e. the application of techniques to filter down a set of 'raw' results to the subset most likely to be relevant) one step further, through the explicit use of relevant (focused) search terms (i.e. the 'positive' keywords). In order to demonstrate this, we construct a series of searches in which the Gemini name is combined in turn with each of the 'positive' keywords, and then again extract the first page of results returned by google.com in each case - i.e. we search for:

  • gemini deepmind
  • gemini google
  • gemini ai

etc.

Once the results are de-duplicated (since the same URL may, in general, be returned by more than one of the above search terms), this yields a dataset of 486 distinct URLs.

In this case. The 'hit-rate' of relevant results is (unsurprisingly) much higher - none of these URLs yields a potential relevance score below zero, and - by manual inspection - all appear potentially relevant, with none of the results appearing to relate primarily to astrology (Figure 5). However, what this dataset does not potentially encompass is instances of usage of the Gemini name by third parties in industries outside AI; if this type of content is of interest, this consideration must be borne in mind when selecting the search queries for a programme of brand monitoring.

Figure 5: Comparison of the spread of potential relevance scores across the sets of pages returned by an unfocused search query (brand name only) and by focused queries (brand name plus relevance keywords)

It is also worth noting that this type of relevance filtering is also a consideration when measuring the online prominence of brands. If, for example, we just calculate the average brand content score for the term 'Gemini' across the sets of pages returned in each case, we obtain values of 107 for the unfocused dataset and 81 (i.e. actually a lower value) for the focused dataset, on which far more of the pages are actually relevant and relate to the brand in question. Of course, the difference is that many of the references to 'Gemini' in the former dataset will be unrelated to the brand in question (in many cases, referring to astrology), but this would not necessarily be apparent if purely the raw numbers were considered (see also Figure 2).

Key take-aways

In a brand-monitoring context, the use of both 'positive filtering' (relevance) and 'negative filtering' (exclusion) keywords together, combined with ideas related to the concept of content scoring (a means of quantifying the amount and prominence of mentions of a particular term on a page - or the extent to which a page is 'about' that term), but applied to these keywords, rather than to a brand name, provides an effective means of categorising relevant from non-relevant pages, in cases where the brand name itself is a generic term.

When combined with the application of relevance keywords in the search queries used to generate results, this approach is an efficient way of collecting a sample of pages relevant to a particular brand, and minimises the amount of analysis time required to filter out false positives. However, in many brand-monitoring contexts, pages which are essentially 'neutral' in character (e.g. relating to third-party use of the same brand name in potentially separate business areas) can be of interest, and it is therefore necessary to carefully select the search terms and scoring thresholds used, so as to avoid missing potentially significant findings.

Overall, however, the ideas presented in this study are key to the configuration of an efficient brand-monitoring solution which can effectively exclude false positives. Without these approaches, the sets of results identified through automated technologies can be dominated by non-relevant findings, which not only increases the cost of a service from the point of view of the amount of resource required to review the results, but can also lead to an increased possibility of relevant findings being overlooked. When combined with a content-scoring approach to prioritise the remaining (relevant) findings (in order to identify priority targets for more in-depth analysis, content tracking, or enforcement), a highly efficient brand-protection programme can be implemented.

References

[1] https://deepmind.google/technologies/gemini/ - Interestingly, a good example of a 'dot-brand' domain for an official website

[2] https://www.theverge.com/2023/12/7/23992737/google-gemini-misrepresentation-ai-accusation

[3] https://www.bloomberg.com/news/newsletters/2023-12-07/google-s-demo-for-chatgpt-rival-criticized-by-some-employees

[4] https://finance.yahoo.com/news/google-debuts-powerful-gemini-generative-ai-model-in-strike-at-openai-microsoft-150025435.html

[5] https://www.zdnet.com/article/what-is-google-gemini/

[6] https://www.forbes.com/sites/chriswestfall/2023/12/12/googles-new-ai-gemini-beats-chatgpt-in-30-of-32-test-categories/?sh=4e3f73566c80

[7] Considering only gTLD registrations as present in zone-files available via ICANN's CZDS service as of 13-Dec-2023, and where whois information is available via an automated look-up

[8] https://www.iamstobbs.com/measuring-brand-prominence-of-fashion-brands-ebook

[9] https://www.iamstobbs.com/online-brand-prominence-and-sentiment-ebook

[10] Results are based on searches and analysis carried out on 13-Dec-2023

[11] For example, in an early test, the word 'aries' (subsequently excluded from the list) was identified on a webpage (https://www.ft.com/content/e5cc4e36-efe3-4491-b435-75a712533257) featuring an article about Google Gemini, within a link on the page to the 'obituaries' section of the website

[12] e.g. https://www.news9live.com/technology/googles-deepmind-debuts-gemini-ai-model-to-compete-with-openais-gpt-4-2370930, which also features a link to the 'astrology' section of their website

[13] A good example is https://www.gemini.com/ (actually the top ranked result in the page of Google results), which yields an overall potential relevance score of 1. The site offers a cryptocurrency trading service, and it is unclear even on initial manual inspection whether it is passing off as being related to the Google Gemini brand, or whether it is using the same name as an unrelated third party. Even if so, further analysis would be needed to determine whether it is infringing Google IP, based on factors such as the trademark classes in which protection is held, geographical presence, and pre-existence.

This article was first published as an e-book on 12 January 2024 at:

https://www.iamstobbs.com/google-gemini-ebook

Wednesday, 10 January 2024

A review of the current state of the new-gTLD programme - Part 2: Dot-brands

Our previous overview of the new-gTLD (domain extension) programme[1] comprised a top-level summary of the overall landscape; in this follow-up, we take a deeper dive into the set of dot-brand extensions.

Dot-brands are a special class of restricted new-gTLDs, where a brand owner has been granted the responsibility of overseeing the infrastructure of their own, brand-specific domain extension, with examples including .barclays and .bmw. This can be an attractive prospect for an organisation, as it gives them full control over all domains registered across the extension in question[2] and - providing they host all official websites on this branded extension and can successfully educate their customer base that this is the case - can make it a much more complex prospect for fraudsters to create fake sites. However, it is a costly enterprise to apply for (with just the initial evaluation fee reaching $185,000[3]) and run a dot-brand TLD, whereby the brand owner needs to act as a domain registry in their own right, requiring extensive investment in the necessary technological infrastructure[4]. There are also other considerations, such as the requirements to 'rebuild' search-engine rankings when switching over to a new corporate domain name[5]. Overall, these facts have led to a number of dot-brand applicants subsequently discontinuing use. Nevertheless, for large corporates, it is a possibility worth considering, particularly in view of the new round of gTLD applications set to launch in 2026[6].

As of December 2023, there are 421 dot-brand extensions currently delegated[7,8],  (i.e. added into the highest level of Internet infrastructure, the Root Zone, and therefore available for use). A significant study of the extent of their utilisation has been carried out by management consultancy DOTZON, in the sixth edition of their Digital Company Brands report[9]. This study considers factors such as the number of registered domains on each extension, the extent of use for e-mail communication, the proportion of domains resolving to live sites, and other search-engine optimisation statistics, to calculate a metric of the 'digitalness' of each company and its dot-brand TLD[10]. It is significant to note that European entities, particularly in Germany and France, feature highly in the list of dot-brand adopters, with insurance and finance as the top industries[11]

A simple piece of analysis is just to consider the total number of registered domains across each of the dot-brand extensions (where data is available, i.e. for 334 of the extensions)[12]. The top extensions, by numbers of domains, are shown in Figure 1.

Figure 1: Top dot-brand extensions, by number of registered domains (where data available)

The top ten are listed in Table 1.

TLD
                                
Owner
  
No. domains
                                
  .ovh   OVH 94,090
  .quest   Quest Software 38,918
  .dvag   Deutsche Vermögensberatung 5,693
  .kred   KredTLD 3,158
  .giving   Giving Limited 1,845
  .mma   MMA IARD 1,667
  .allfinanz   Allfinanz Deutsche Vermögensberatung
  Aktiengesellschaft
1,264
  .crs   Federated Co-operatives (Co-Operative
  Retailing System)
1,127
  .leclerc   E.Leclerc 1,109
  .gmx   1&1 Mail & Media (GMX, Global Message
  Exchange)
949

Table 1: Top dot-brand extensions, by number of registered domains (where data available)

Overall, 48 of the extensions analysed have more than 100 domains registered. Of the dot-brands with fewer than this number, there is a significant proportion with only very small numbers of domains, as shown in Figure 2.

Figure 2: Number of dot-brand extensions with each number (between 1 and 100) of registered domains (where data available)

In total, 160 of the extensions analysed (48%) have ten registered domains or fewer.

Of course, the total number of registered domains on a dot-brand extension is not, in itself, a measure of how 'well' a brand owner is utilising the extension; it would be perfectly valid for a brand owner just to use (say) 'www.[brand]' as their sole corporate website, and nothing else (and, arguably, this presents the least risk of customer confusion). Nevertheless, a dot-brand extension does provide a number of compelling use-cases for brand owners, including possibilities for region-, product-, or corporate-division-specific sub-sites. For example, .mma and .allfinanz use subdomain names for individual financial consultants, and .bmw and .audi do likewise for specific dealerships. Studies from the last few years have consistently found that around three-quarters of dot-brand domains tend to resolve to active websites[13].

It is also informative to look at the most frequently used second-level domain names (SLDs, i.e. the part of the domain name to the left of the dot), across the dot-brand landscape. The top ten is shown in Table 2. 

SLD
                                
No. instances
                                
  www 63
  home 61
  go 42
  my 39
  mail 38
  careers 34
  global 33
  api 32
  cloud 31
  jobs 28

Table 2: Most frequently used second-level domain names (for dot-brands where data available)

These trends are largely unchanged from a similar study carried out five years earlier[14], from which the top five SLDs all still appear in the current top six (though now with the addition of 'go').

A special case is the two-character SLD - two-character strings are often used to denote country codes, and in the case of the SLD for a dot-brand extension, can be used to create regional sub-sites. The top ten is shown in Table 3.

SLD
                                
No. instances
                                
  go 42
  my 39
  id 21
  it 17
  de 17
  ai 17
  uk 16
  us 16
  in 14
  ru 13

Table 3: Most frequently used two-character SLDs (for dot-brands where data available)

Many of these (particularly de, uk, us, ru) are likely to be used most frequently to refer to their respective countries (Germany, UK, US, Russia), whilst others ('go', 'my') are more likely to be used as readable keywords. It is also striking that 'ai' appears in the top ten, most likely a reflection of the growing popularity of artificial intelligence (AI) technologies, and their adoption by major corporations. 

In general, insight into these frequently used keywords can be beneficial to organisations potentially looking to build out a dot-brand presence, providing guidance on common trends used to make websites navigable and avoid customer confusion, and also potentially to assist with conventions which may help to improve search-engine optimisation strategies. 

References

[1] https://www.iamstobbs.com/opinion/expert-.watches-.new-.online-.website-.news-.lol-a-review-of-the-current-state-of-the-new-gtld-programme

[2] https://icannwiki.org/Brand_TLD

[3] https://newgtlds.icann.org/en/applicants/global-support/faqs/faqs-en

[4] https://circleid.com/posts/20200822-why-you-should-not-apply-for-a-dot-brand-new-gtld

[5] https://circleid.com/posts/20160927_seo_and_new_dot_brand_gtld

[6] https://www.iamstobbs.com/opinion/the-new-new-gtlds

[7] https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains

[8] https://www.iana.org/domains/root/db

[9] https://dotzon.consulting/studien/

[10] https://circleid.com/posts/20231130-dotzon-study-digital-company-brands-2023; this study gives the top ten dot-brand TLDs as: leclerc, schwarz, audi, weber, mma, google, abbott, cern, lundbeck and allfinanz.

[11] https://www.worldipreview.com/contributed-article/global-adoption-of-dotbrand-domains

[12] This analysis is based on the versions of the ICANN zone-files downloaded on 04-Dec-2023. Analysis is only possible in cases where the respective zone-file is generally publicly available, which was not the case for 87 of the 421 extensions; however, zone-files were available for all of DOTZON’s top TLDs, with the exception of .audi.

[13] https://www.cscdbs.com/en/resources-news/dot-brand-report/

[14] https://www.cscglobal.com/cscglobal/pdfs/DBS/CSC_Dot_Brand_Insights_Report_Jun2018.pdf

This article was first published on 10 January 2024 at:

https://www.iamstobbs.com/opinion/a-review-of-the-current-state-of-the-new-gtld-programme-dot-brands

Unregistered Gems Part 6: Phonemizing strings to find brandable domains

Introduction The UnregisteredGems.com series of articles explores a range of techniques to filter and search through the universe of unregis...