Monday, 11 November 2024

Phishing trends 2024 - and a look at some new data for domain threat quantification

Overview

This year's annual phishing report by Internet technology consultants Interisle[1] has provided a number of key insights into the current state of the phishing landscape[2]. Phishing - that is, the use of websites to impersonate a brand or other trusted entity with a view to stealing personal details, financial information or funds, often as a 'gateway' for subsequent further cybercrime - continues to be a popular model for online criminals. This a significant concern for brand owners and consumers alike. Of the main findings from the report, some of the most significant are:

  • The number of phishing attacks continues to see year-on-year growth, having increased by around 50,000 to 1.9 million incidents, with an estimated financial loss of $12.5 billion. The top three most targeted brands were Facebook, Gazprom (a Russian energy corporation), and the United States Postal Service (USPS).
  • 1 million unique domain names were utilised in the identified set of phishing attacks, though with a decrease in popularity in the use of domain names containing (an exact match to) the name of the targeted brand, in part probably due to the ease in detection of such names. The majority of attacks take place on specifically maliciously registered domains, rather than on compromised sites.
  • Numbers of other styles of attacks have seen increases, notably in subdomain-based attacks (i.e. where the style of phishing-site URL was of the style [brand-string].[domain].TLD, using e.g. blogspot.com, duckdns.org or weebly.com) (accounting for nearly one-quarter of all cases), or elsewhere in the URL, and the use of the InterPlanetary File System (IPFS) - a Web3 P2P-based technology - to host phishing content (most usually through a Web2-based 'gateway' provider such as dweb.link or ipfs.io).
  • New-gTLD extensions continue to be popular for domains used for phishing (42% of cases), primarily due to the low-cost and ease of registration (i.e. fewer verification checks). ccTLDs have seen a drop in fraudulent usage – to a significant extent, as a result of the exit of Freenom (the former provider of domains on the .tk, .ml, .ga, .cf and .gq extensions) from the registrar business[3], following the termination of their ICANN agreement[4] in response to reports of extensive criminal domain use. Overall, the most common TLDs used for phishing sites in the analysis period were .com, .top, .xyz, .cn, and .info, though when normalised to reflect the numbers of phishing sites as a proportion of the total domains across the extension in question, the highest-risk TLDs were found to be .lol, .bond, .support, .top, and .sbs.
  • Bulk / automated registration of domains has increased in popularity as a methodology used by phishers, accounting for over one-quarter of all phishing-related domains. These most usually make use of strings of random characters or random combinations of dictionary words. The largest set of such domains used for a coordinated set of attacks was a group of over 17,000 domains generally consisting of eight-letter (second-level domain name, or SLD) random strings, such as gzraxywl.lol and htcjkpzb.lol.
  • The set of gTLD registrars most frequently associated with domains used for phishing content continues to be dominated by retail-grade providers, with the top five found to be NameSilo, GoDaddy, GMO d/b/a. Onamae, PublicDomainRegistry, and NameCheap. Normalising the figures by the total numbers of domains under management, the top five most frequently abused are found to be NiceNIC, URL Solutions, Aceville, WebNic, and OwnRegistrar. The first of these has seen exceptional levels of abuse, with 45% of their gTLD portfolio reported for phishing.

A new basis for quantifying domain-name threat?

The use of fixed-length random strings for phishing domains (as mentioned above in the case of the .lol examples) raises the possibility for a new methodology for identifying such strings, clustering together related findings, and providing an additional input into general algorithms for quantifying the potential level of threat posed by registered domain names[5].

Previous studies[6,7] have explored the use of a metric known as domain name entropy - essentially, a measure of the number and variability of characters within the domain-name string - as an indicator of automated registrations. However, although this idea may be useful in cases where the registration scripts generate very long domain names, it is not really very effective for the shorter names described here. This is because a string such as 'gzraxywl' will have an identical entropy value to any other string consisting of eight distinct characters, including dictionary words (as may correspond to other / legitimate registrations). Instead, it may be preferable to make use of phonotactic analysis. This concept has previously been explored in the context of identifying unregistered domains which may be attractive from a 'brandability' point of view[8]. In that case, strings producing a low 'phonotactic violation' score[9] (i.e. those which are most readable or 'word-like') are preferred. Conversely, however, when identifying the (pseudo-)random strings generated by automated registration scripts, those producing the highest scores may be the most likely candidates.

As an example, I consider the set of 8-character alphabetical .lol domains (i.e. the dataset including the examples referenced previously). As of October 2024, there are 78,446 such domains. The distribution of phototactic scores across this dataset is shown in Figure 1.

Figure 1: Distribution of phonotactic violation scores across the set of 8-character alphabetic .lol domains

These scores range up to a value of 73.06 (tlbtwxil.lol), with the remainder of the top five found to be mslpjbpw.lol (66.20), rfmtgliz.lol (66.13), pzvuznnj.lol (64.73), nzktgzhv.lol (64.57) (noting that 7,275 domains in the dataset do not generate a valid score - shown as the bar at a value of -1 in Figure 1 - many of which will also be random or pseudo-random strings).

Considering the top 1,000 domains (all of which achieve scores greater than 33 and do comprise strings which appear visually random), 983 are privacy-protected domains registered through GMO Internet Group Inc. d/b/a Onamae.com (one of the high-threat registrars referenced above) with alidns.com nameservers, and all were registered between 19-Mar-2024 and 08-Aug-2024. Within this set, there are some even more obvious (sub-)clusters, with 50 domains all registered on 23-Apr, 55 on 22-May, 558 on 18-Jul, 52 on 31-Jul, and 261 on 08-Aug (Figure 2). It seems highly likely that these groups do indeed represent coordinated registration events by one or more specific entities. The group of 08-Aug registrations do not, as of the date of analysis (27-Oct), generally resolve to any live site content, but it is not uncommon for phishing sites to be used for just a short period of time before being deactivated.

Figure 2: Distribution of registration dates for the top 1000 8-character alphabetic .lol domains (by phonotactic violation score), by registrar

Summary and key points

The statistics highlight the significant continuing scale of phishing activity, and the importance of proactive programmes of monitoring and enforcement by brand owners. The apparent evolution in methodology by infringers, away from a basis of the use of branded domain names, shows that monitoring needs to encompass not only domain monitoring (covering exact matches and brand-name variants) but must also address general Internet content and make use of additional data sources (such as spam traps, webserver log monitoring and customer abuse reports). This is especially true given the mix of TLDs utilised in phishing domains, some of which may not have zone-file data readily available.

Analysis of the TLDs which are popular with infringers also serves other purposes, including:

  1. Helping to inform domain registration policies for brand owners[10], as part of an initiative to secure key brand terms across high-risk extensions as defensive registrations, in order to prevent them being registered and utilised by fraudsters.

  2. Informing the construction of algorithms to assess the likely future level of threat which may be posed by new identified domain registrations[11]. Similar comments are also true regarding intelligence on those registrars which are most commonly associated with abusive registrations, especially in view of the 2024 amendments to registrars' obligations to implement more robust Domain NameSystem (DNS) abuse mitigation, including the suspension of domains and disabling of phishing websites[12,13].

  3. Enhancing algorithms (via the use of phonotactic analysis techniques) for quantifying domain threat and clustering together related results - which can itself help to lend efficiency to the overall takedown process.

References

[1] https://interisle.net/insights/phishing-landscape-2024-an-annual-study-of-the-scope-and-distribution-of-phishing

[2] https://www.linkedin.com/pulse/phishing-2024-what-domain-owners-brands-need-know-forum-adr-mxx3c/

[3] https://web.archive.org/web/20240213203456/https://www.freenom.com/en/freenom_pressstatement_02122024_v0100.pdf

[4] https://www.icann.org/uploads/compliance_notice/attachment/1219/hedlund-to-zuubier-9nov23.pdf

[5] see also 'Patterns in Brand Monitoring' by D.N. Barnett (Business Expert Press, 2025), Chapter 5: 'Prioritisation criteria for specific types of content'

[6] https://www.linkedin.com/pulse/investigating-use-domain-name-entropy-clustering-results-barnett/

[7] https://circleid.com/posts/20230703-an-overview-of-the-concept-and-use-of-domain-name-entropy

[8] https://circleid.com/posts/20240903-unregistered-gems-identifying-brandable-domain-names-using-phonotactic-analysis

[9] https://linguistics.ucla.edu/people/hayes/BLICK/

[10] https://www.iamstobbs.com/opinion/strategies-for-constructing-a-domain-name-registration-and-management-policy

[11] https://circleid.com/posts/20230117-the-highest-threat-tlds-part-2

[12] https://www.icann.org/resources/pages/global-amendment-2024-en

[13] see also 'Patterns in Brand Monitoring' by D.N. Barnett (Business Expert Press, 2025), Chapter 1: 'Overview of online brand protection'

This article was first published on 11 November 2024 at:

https://www.iamstobbs.com/opinion/phishing-trends-2024-and-a-look-at-some-new-data-for-domain-threat-quantification

Thursday, 7 November 2024

"It’s beginning to look a lot like...": Domain patterns in the approach to the holiday shopping season 2024

Introduction 

As we approach the start of this year's holiday shopping season, dominated by the Chinese-focused Singles' Day (11/11) and the western Black Friday and Cyber Monday events (this year on 29-Nov and 02-Dec) - but also including platform-specific promotions such as Amazon Prime Day(s) and the general ramp-up in spending towards December - we conduct a revisit of last year's analysis[1] looking at the registration of related domain names.   

The holiday shopping period provides an opportunity for brand owners and infringers alike, to take advantage of increased levels of online spend and related searches to drive consumers to their own content. As part of this initiative, many will register specific domain names related to the events in question, and in this study, we consider the landscape of such domains.  

Landscape data overview and deep-dive 

As of the date of analysis (11-Oct-2024), zone-file searches revealed the existence of 6,667 active registered domains with names containing 'black(-)friday', 'cyber(-)monday' or 'singles(-)day' (hereafter referred to as 'holiday shopping' domains). The analysis also focuses only on gTLD domains, likely to be most relevant to the landscape of potential infringements (in view of their typical lower cost and lower levels of registration restrictions). Of these, 519 were disregarded from further analysis on the basis of being registered via enterprise-level corporate registrars and thereby most likely to be representing legitimate brand promotions.   

Within the remainder of the dataset[2], a range of domain ages are represented, with the oldest registrations dating back to 2001. However, the striking regional cycle of activity noted in the previous study continues to be apparent, with the vast majority of the domains registered in the latter half of each year, in the run-up to the season in question (Figure 1). The apparent drop-off and smaller size of the 2024 peak is likely to be an artefact of the fact that the analysis was carried out early in October, and so the final data point represents a (significantly) incomplete month; the numbers for the previous months do actually show a year-on-year increase (08-2023 = 33; 08-2024 = 36; 09-2023 = 91; 09-2024 = 98). 

Figure 1: Numbers of active holiday shopping domains (as of 11-Oct-2024), by original month of registration 

Considering specific indicators of the likely nature of activity of the domains within the dataset we find that:  

  • 1,887 of the set of 6,148 (i.e. 28%) return some sort of live website response 
  • 1,031 (i.e. 55% of the live sites) include at least one high-risk keyword or other term ('login', 'shop', 'store', 'discount', 'replica', or 'cheap') indicating that the primary focus of the website is (potentially non-legitimate) e-commerce  
  • 2,461 (i.e. 40%) feature active MX records - indicating that the domain has been configured to be able to send and receive e-mails - meaning that, even in the absence of a live website, the domain may be associated with phishing or other types of e-mail-based scams 
  • Considering the domain extensions within the dataset (i.e. the top-level domains, or TLDs), six of the top ten are new gTLD extensions, which have previously been noted as disproportionately being associated with infringing use (#2 .site (387 domains), #4 .shop (208), #5 .today (153), #6 .online (148), #8 .click (90), #10 .xyz (73)).  

Conducting a deeper dive into the dataset, it seems to be the case that a smaller proportion of the sites are directly targeting specific individual brands (either through the inclusion of brand terms in the domain names themselves, or of brand references in the site content) than in previous years, although some such examples were identified (Figure 2).  

Figure 2: Examples of holiday shopping domains resolving to apparently infringing websites targeting specific brands 

Much more common are examples of generic e-commerce sites, in some cases targeting multiple different brands ('multi-brand' sites) (Figure 3), examples of sites giving general shopping or product information, linking to specific marketplaces (presumably as part of affiliate promotions), or referencing potentially unofficial coupon or voucher codes.  

Figure 3: Examples of holiday shopping domains resolving to multi-brand e-commerce sites

A striking new emergence this year - perhaps a reflection of the current economic landscape - are the large numbers of websites using the holiday period to promote their own 'payday-loan'-style offerings (Figure 4). 

Figure 4: Examples of holiday shopping domains resolving to websites offering 'payday loans'

Summary and key points 

Overall, this data review shows that there continues to be a significant amount of illegitimate online activity targeting consumers and, in many cases, abusing trusted household brands. At times of increased numbers of infringements, it becomes all the more important for brand owners to monitor the landscape and conduct proactive programmes of takedowns against egregious findings, as part of a comprehensive brand protection initiative. Of course, domain registrations are only part of the picture; as the boundaries between online channels become increasingly blurred, monitoring initiatives must take also account of a range of platforms, including e-commerce marketplaces (including the increasingly large numbers of product- and region-specific examples), social media, mobile apps, and other general Internet content. This will help brands to protect consumers from infringement types such as counterfeiting and phishing, including examples making use of trending techniques such as hidden links[3]. Brand protection teams may wish to bear such issues in mind when deciding where and how much resource to allocate this coming holiday shopping season. 

References

[1] https://www.iamstobbs.com/opinion/web-dot-coms-but-once-a-year-holiday-shopping-activity-part-1-black-friday-domains

[2] Considering those where domain registration dates are available via an automated whois look-up

[3] https://circleid.com/posts/20220510-breaking-the-rules-on-counterfeit-sales-the-use-of-hidden-links

This article was first published on 7 November 2024 at:

https://www.iamstobbs.com/opinion/its-beginning-to-look-a-lot-like-domain-patterns-in-the-approach-to-the-holiday-shopping-season-2024

Saturday, 19 October 2024

What degree of variability might be covered within a colour-mark protection framework?

Introduction

The concept of specific colours being protectable as brand-specific trademarks (or as components of broader or more complex marks) is now well-established, but colour-mark protection is not enormously robust, and lacks specific definition of the degree to which 'nearby' colours should also be protected (beyond the vague statement that protection should cover variants such that the difference between the shades is 'barely noticeable').

My recent series of articles on the subject[1,2,3,4] have outlined a series of potential definitions for specifying the similarity between colours, and have included suggestions as to how a more formalised protection framework could theoretically be constructed. In this framework, colours are specified according to their RGB (red/green/blue component) values, with each component expressed as an integer from 0 to 255, providing a colour 'universe' (ranging from [0,0,0] (black) to [255,255,255] (white)) of 16.8 million colours, which can be considered as points within a 3D colour space. From this, a geometric distance (d) (in RGB units) between any two colours can be calculated, from which a similarity score (Scol) can be defined. 

Furthermore, it was proposed that it might be appropriate for the protection for a specific colour (within appropriate goods and services classes) to cover not only that colour exactly, but also a sphere of points (representing nearby similar colours) surrounding it in colour space (to account for - for example - variations in printing and digital display processes), up to a specified radius (of the order of, say, d = 10 RGB units). The assertion is that a maximum distance of order 10 units would encompass minor variations, whilst still covering a space of points which are all nominally 'more-or-less the same colour'. 

In this article, I consider the visualisation of the degree of variability which would be encompassed by a framework of this nature. 

Analysis

As defined above, the set of points covered by a protected 'bubble' (sphere) of radius d would sit wholly inside a cube of side-length 2d, with the edges of the sphere just touching the central points of the faces of the cube (Figure 1). For d = 10 units, the sphere would contain approximately 4,189 (i.e. ⁴⁄₃ π 3) points (the number of individual protected colours), and the cube 8,000 (i.e. (2d)3) - i.e. the upper limit of the portion of colour space which would need to be searched to identify all protected variant - although the total numbers would be lower if the colour at the centre ([Rcentral,Gcentral,Bcentral]) was near an edge or corner of the overall colour space (as the components of any colour cannot be less than 0 or greater than 255). 

Figure 1: Schematic of a protected 'bubble' (sphere) (of radius d) of points within colour space

As an illustration, we can first consider a colour near the overall centre of the colour space (say, [128,128,128], a shade of mid-grey). In this case, there are actually 4,169 distinct colours within a  sphere of radius d = 10 units (noting that R, G and B can only take integer values) (i.e. ranging from [118,128,128], [128,118,128] and [128,128,118] to [138,128,128], [128,138,128] and [128,128,138]). The amount of variability contained within this set of colours is shown in Figure 2, in which (for convenience of manual review), the colours are sorted by their H (hue) values (i.e. the position within the spectrum of their dominant / 'base' colour, and neglecting saturation (intensity), value (darkness) and luminosity))[5]

Figure 2: Visualisations of the single colour [128,128,128] (left), and (shown as vertical bands) of the range of colours surrounding it up to a distance, d, of 10 RGB units (sorted by H (hue) values, left to right, then top to bottom) (right)

As mentioned above, for other colours near to (i.e. less than the distance d) an edge of the colour space, there will be a smaller number of possible variants contained within the (truncated) sphere of radius d = 10 units. The statistics and visualisations for some other basic colours are shown below in Table 1 and Figure 3.

Table 1: A selection of basic colours and the number of distinct colours in RGB space within a surrounding (truncated) sphere of radius d = 10 units in each case

Figure 3: Visualisations of the colours shown in Table 1 (left) and, in each case, (shown as vertical bands) the range of colours surrounding it up to a distance, d, of 10 RGB units (sorted by H (hue) values, left to right, then (where more than one row shown) top to bottom) (right)

Conclusions

The visualisations provided in this article provide an indication that the proposed value of d = RGB 10 units to encompass the protection offered by a colour mark registration, does seem to be reasonable, in terms of allowing a small amount of variability as may arise from brand- and product- production and visualisation processes, whilst still covering only a range of colours which subjectively appear nominally similar. The application of a quantitative framework along the lines suggested in these studies does offer the potential for objective comparisons between colour marks, and for the removal of subjective descriptions of degrees of difference. 

It is, however, important to note that the suggested value of 10 units is somewhat arbitrary, and would certainly be up for discussion if a specific value was to be adopted as part of a formalised protection framework. The association of colour with branding incorporates a number of psychological considerations - for example, a previous study by Kumar (2017)[6] found that colour increases brand recognition by 80%, and accounts for between 62% and 90% of a consumer's initial judgement of a product. Furthermore, recent comments by Lord Clement-Jones, following on from the Influence at Work / Stobbs study 'The Psychology of Lookalikes'[7], have highlighted the importance of considering psychological and behavioural analyses in IP disputes, particularly in relation to brand lookalikes[8]. Accordingly, if any such framework were to be adopted, it would likely require a foundation based on research into the impact of colour variations on subjective perceptions of brand association.

Finally, in order to construct an effective basis for a colour-mark protection framework, it would also be necessary to incorporate additional considerations. The degree of overlap of the areas of the goods and services of two potentially competing marks would be likely to be highly relevant, and the thresholds may need to vary depending on whether single colours or colour combinations were being protected, for example. 

References

[1] https://www.linkedin.com/pulse/measuring-similarity-marks-overview-suggested-ideas-david-barnett-zo7fe/

[2] https://circleid.com/pdf/similarity_measurement_of_marks_part_1.pdf

[3] https://circleid.com/posts/further-developing-a-colour-mark-similarity-measurement-framework-building-a-database

[4] https://circleid.com/pdf/similarity_measurement_of_marks_part_4.pdf

[5] https://circleid.com/pdf/similarity_measurement_of_marks_part_6.pdf

[6] https://www.semanticscholar.org/paper/The-Psychology-of-Colour-Influences-Consumers%E2%80%99-%E2%80%93-A-Kumar/f7c3b2a780a7a3bf907ef807085b86a63f0d8d0a?p2df

[7] https://www.iamstobbs.com/the-psychology-of-lookalikes

[8] https://www.linkedin.com/posts/geoff-steward-20404015_good-to-know-that-the-psychology-of-lookalikes-activity-7183447412542181377-6omH

This article was first published on 19 October 2024 at:

https://www.linkedin.com/pulse/what-degree-variability-might-covered-within-david-barnett-ajyoe/

Thursday, 17 October 2024

Measuring the similarity of marks: an overview of suggested ideas

Introduction

A comparison of the similarity between pairs of marks is a key component of many intellectual property disputes. A key point to note is that the overall assessment of similarity needs to take account of a number of components, several of which involve subjective determinations. These components might typically include: 'inherent' characteristics of the marks in question; the meaning of any terms (i.e. conceptual analysis); their distinctiveness, strength and degree of renown; the influence of any associated logos, imagery or mark stylisation; the degree of overlap of associated goods and services; documented evidence of actual confusion; and the degree of attention paid by a typical consumer - many of which may vary between different geographical regions. All of these factors contribute to the overall assessment of the likelihood of confusion between the marks. 

Nevertheless, there are certain characteristics (generally falling under the 'inherent' category referenced above) of some types of marks which do lend themselves to a quantitative, objective measurement of similarity. The most obvious such examples are colours, and the spelling and pronunciation of word marks (which contribute to visual and aural similarity, respectively). Whilst any measurement of such characteristics cannot provide a quantification of overall mark similarity, the associated algorithms can provide a useful tool to be utilised in the assessment process. 

Algorithms to measure colour similarity, and visual and aural similarity for word marks, have a number of obvious applications. Firstly, they offer the potential for greater consistency (and greater granularity) in the assessment of the respective types of similarity across dispute cases, and secondly, they offer the potential to be able to specify quantifiable thresholds up to which IP protection might apply. In addition, they have other applications, such as the option to post-process results from trademark watching services, to (better) sort and prioritise the findings and assist with the review process

A group of suggested frameworks for formulating algorithms of this type was set out in a series of six articles[1] recently published on the CircleID website. This overview presents a summary of the key ideas from the series.

Similarity of colours

Colours occupy a unique position in the set of mark types, due to the fact that the specification for a colour can be exactly defined. One of the most common frameworks (particularly in the context of digital display systems) is the RGB framework, where a colour is defined in terms of its red, green and blue components, each expressed as an integer value from 0 to 255 (giving 2563 or 16.8 million definable colours in total). This 'universe' of possible colours can therefore be visualised as a three-dimensional cube (or colour 'space'), with red increasing along one axis, green along another, and blue along the third, with each distinct colour occupying a unique point within the space.

Using this framework, the (degree of) difference between any two colours can be expressed in terms of their geometric distance (d) from each other in the colour space. From this, a difference score (Dcol) can be formulated (by expressing d as a proportion of the maximum possible distance between two colours - i.e. between [0,0,0] (black) and [255,255,255] (white)), and from this, a colour similarity score (Scol) (equal to 1 (or 100%) minus Dcol).

The concept of a colour similarity measurement metric makes most sense (in the context of disputes) if there were to exist a framework in which the protection granted under a colour mark (within appropriate categories of goods and services) registration covered not just that colour exactly, but also very similar colours (up to a specified threshold). Current guidelines suggest that protection should cover variants such that the difference between the shades is 'barely noticeable', but the use of a numerical score would provide the potential to put in place a more explicit threshold and avoid ambiguity. 

Within a framework of this type, it would also be possible (and might be convenient) to maintain a database of protected colours (or the colours of elements within broader protected marks), to help determine the existence of possible clashes between existing or proposed new protected colours. A mock-up of how this might look is shown in Table 1, for a series of colours associated with well-known brands. In the figure, the individual colours are sorted by their hue (H) values (part of an alternative (to RGB) framework for specifying colours), which orders them according to the position within the spectrum of their dominant colour, which can assist with visual review of data of this type.

Table 1: Mock-up of a database of protected colour marks, sorted by their H values

For context, the shades of orange used by Reese's and Home Depot (objectively the most similar pair of colours in the above table) have a similarity score of 95.9%.

Visual and aural similarity of word marks

For word marks (even just considering similarity in spelling (visual) and pronunciation (aural)), the situation is rather more complex. The frameworks suggested in the previous studies propose a separate similarity score for each of these two components (Svis and Saur, respectively), and an overall score (Swor) reflecting both types of word similarity, which is most simply calculated as the mean of the two components (but can be differently weighted if required).

The proposed algorithm for quantifying visual (spelling) similarity is itself composed of two components (i.e. utilises two distinct metrics), reflecting different aspects of the similarity in spelling. The first metric ('fuzz.ratio') is based on a measurement called Levenshtein distance, which quantifies the number of 'edits' (i.e. character insertions, deletions, or substitutions) necessary to transform one string into the other), but with the metric also including normalisation factors to take account of the length of the strings, and the second (Jaro-Winkler similarity) is more complex, including an element which takes account of the proximity of the variations to the start of the strings (where, for example, a consumer might be more likely to be aware of any differences). 

For aural (pronunciation) similarity, the calculation is carried out by first implementing an analysis process which converts each string to its IPA (International Phonetic Alphabet) representation, and then using the 'fuzz.ratio' metric to quantify the similarity between these representations.

For illustration, the similarity score values for a range of pairs of marks which were the subject of past disputes is shown in Table 2. 

Table 2: Pairs of marks and their visual, aural and overall similarity scores

Conclusion

The algorithms proposed for quantifying the similarity of colour marks, and the visual and aural similarity of word marks, do seem to perform reasonably well, and (in the case of the word mark metrics) aligns with what might be subjectively be reckoned according to manual analysis. The formulations are, of course, just one possible option, and it would certainly be possible to 'tune' the algorithms according to specific requirements.

Algorithms of this type do offer the potential for a more granular, continuous, repeatable and quantifiable expression of similarity and, with appropriate adoption into case analysis, offer a possible route towards greater consistency in dispute decisions. 

However, it is important to reiterate the statement made in the introduction, that such metrics cannot fully assess the overall similarity between marks, or replace the existing nuanced and multi-faceted approach of considering the full range of subjective factors which contribute to an assessment of the likelihood of confusion, but can provide a useful tool to be applied in such analyses and in other contexts.

Reference

[1] For colour marks:

For word marks:

This article was first published on 17 October 2024 at:

https://www.linkedin.com/pulse/measuring-similarity-marks-overview-suggested-ideas-david-barnett-zo7fe/

Further developing a colour mark similarity measurement framework - Part III: A method for sorting colours

Introduction

In my previous articles[1,2,3] looking at a framework for analysis of colour marks, I considered the use of the RGB definition for colours (specifying their individual red, green and blue components each as integer values between 0 and 255), and how this can be used to specify the 'distance' (d) between any two colours[4] and, equivalently, a similarity score (Scol)[5] for the pair. 

However, when considering colour marks, it can also be helpful to have an algorithm for sorting a list of colours into a convenient order for visual review. The obvious choice would be an ordering which resembles a spectrum of colours. This is distinct from a simple ordering based on just a sorting of (say) the numerical R values, followed by the G values and then the B values (as per Figure 1 in the previous article), which generates multiple, near-repeating series of coloured 'bands'.

The difficulty is that there is no simple algorithm for translating a 3D colour space into a (1D) linear series of colours in which all transitions are smooth and continuous. 

The situation can be appreciated by visualising the same set of 4,096 representative, 'regularly-spaced' colours as considered in the previous article (i.e. [8,8,8], [8,8,24], [8,8,40], … , [8,24,8], [8,24,24], [8,24,40], … , [24,8,8], … , [248,248,248]). 

Algorithms for sorting colours are frequently based on expression of the colours in HSV, rather than RGB, format[6]. This alternative representation also uses three components:

  • H (hue) - the 'base' colour (on a scale from 0 to 1 in 'spectral' order)
  • S (saturation) - the intensity of the colour
  • V (‘value’) - the darkness of the colour

Mathematical conversion of the RGB expression of a colour to its HSV equivalent involves a simple algorithm, and a number of pre-written library scripts[7] are available to implement it. 

The simplest method of sorting colours is just straightforwardly by their H (hue) values. For the set of 4,096 colours considered previously, this gives an ordering as shown in Figure 1 (left to right, then top to bottom).

Figure 1: Ordering of 4,096 colours occupying regularly-spaced positions in RGB space, according to their H (hue) values

What is less satisfactory about this ordering is that it takes no account of the other two parameters, so there are (for example) rapid alternations between dark and light shades, but there is no way to entirely smooth out these discontinuities without losing the smoothness of the transitions according to the other parameter(s). 

One other option is the use of an additional parameter, L (luminosity). (Perceived) luminosity can be derived directly from the RGB values of a colour[8]; on its own, it does not provide a good basis for sorting colours, but can be combined with the use of H to provide smoother transitions. One such option is to divide the H values into 'blocks' and then sort by L within each block. However, this still results in sharp transitions between adjacent colours at the ends of blocks, so does not really add much value in many cases. In the remainder of this article, therefore, sorting by (just) the H parameter is utilised.

Applications of colour sorting / ordering

The first point to note is that the H value (i.e. the position of the colour in an ordered spectrum) does not in itself provide the basis for an effective metric for comparing the similarity of colours (compared with the geometric distance (d) in colour space discussed previously), in part due to the disregarding of the other two components which affect a colour's visual appearance. 

As a related point, the relationship between hue (H) and colour distances (d) is complex, due to the distribution of colours in 3D space. As one illustration of this, it is instructive to visualise the numerical distance (d) of each of the 4,096 colours shown in Figure 1 from a fixed colour, as a function of the H value of the individual (variable) colour in each case. This relationship is shown in Figure 2, for three fixed colours: pure red ([255,0,0]), pure green ([0,255,0]) and pure blue ([0,0,255]).

Figure 2: Distances (d) of each of the 4,096 colours in Figure 1 from (pure) red, green and blue, as a function of their H (hue) values (0 = red; 1 = violet)

Figure 2 also reveals the 'circular' nature of the colour spectrum (when ordered according to hue), with both the 'red' and 'violet' ends of the spectrum 'close' to 'pure' red (i.e. [255,0,0]) - another reason why hue (H) is a less satisfactory basis (than d) for quantifying the proximity of colour pairs.

However, there are practical uses for sorting by H, predominantly where it is useful to be able to visually review sets of colours as part of the analysis process for marks (e.g. where assessing disputes and potential colour 'clashes').

For example, if maintaining a database of registered colour marks, it may be useful to have them sorted into a meaningful order, to be able visually review the proximity of similar marks and determine whether new proposed colour-marks are close enough to others to present a potential problem. For example, the set of colour marks considered in the 'Building a database' article in this series is again presented in Table 1, but here with the colours sorted by their H values, providing a much more preferable basis for manual review. (The table also illustrates how visually 'darker' shades are, in general, associated with lower L values.)

Table 1: Mock-up of a database of protected colour marks, sorted by their H values

In a second application, it might be helpful to be able to visualise (in an ordered form) the set of colours which are similar to a particular degree to another fixed colour, building on the idea of the similarity score (Scol) presented in the previous article.  

For example, taking an arbitrary colour somewhere near the centre of the colour space (say, [136,72,56], a shade of brown; Figure 3), it might be instructive to be able to visualise the set of colours (and the extent of their variability!) which would be deemed (by Scol) as being (for example) 75% similar (i.e. those colours sitting on the surface of a sphere in RGB space of appropriate radius - in this case, d = 110 RGB units - surrounding the colour in question). Such analyses might be informative in formulating guidelines regarding the thresholds up to which colour-mark protection might apply. Figure 4 shows the range of such colours, again sorted by H values, taken from the dataset of 4,096 colours considered previously. The examples range from [40,24,24], [232,24,24] and [232,104,104] (all H = 0.000) to [232,24,40] (H = 0.987).

Figure 3: A rectangle of colour RGB = [136,72,56]

Figure 4: Subset of the group of 4,096 colours occupying regularly-spaced positions in RGB space which are 75% (to the nearest percent) similar (according to Scol) to the colour [136,72,56], sorted by their H values

In summary, therefore, whilst the option for sorting colours into a meaningful order does not add much value to the framework for quantifying the degree of similarity between colours, it does provide a basis for being able to present colour information in a format which is more easily visually digestible. These ideas therefore could have applications in reviewing dispute cases, selecting options for new potential colour marks, and in formulating guidelines for IP protection thresholds.

References

[1] https://circleid.com/posts/towards-a-quantitative-approach-for-objectively-measuring-the-similarity-of-marks

[2] https://circleid.com/posts/further-developing-a-colour-mark-similarity-measurement-framework-building-a-database

[3] 'Further developing a colour mark similarity measurement framework - Part II: Defining a similarity score'

[4] d = √[(R1 – R2)2 + (G1 – G2)2 + (B1 – B2)2]

[5] Scol = 1 – [ d / √(3 × 2552) ]

[6] https://www.alanzucconi.com/2015/09/30/colour-sorting/

[7] e.g. https://github.com/python/cpython/blob/3.13/Lib/colorsys.py

def rgb_to_hsv(r, g, b):
    maxc = max(r, g, b)
    minc = min(r, g, b)
    rangec = (maxc-minc)
    v = maxc
    if minc == maxc:
        return 0.0, 0.0, v
    s = rangec / maxc
    rc = (maxc-r) / rangec
    gc = (maxc-g) / rangec
    bc = (maxc-b) / rangec
    if r == maxc:
        h = bc-gc
    elif g == maxc:
        h = 2.0+rc-bc
    else:
        h = 4.0+gc-rc
    h = (h/6.0) % 1.0
    return h, s, v

[8] L = √ [ 0.241 × R + 0.691 × G + 0.068 × B ]

This article was first published as a white paper on 17 October 2024 at:

https://circleid.com/pdf/similarity_measurement_of_marks_part_6.pdf

Further developing a word mark similarity measurement framework - Part II: Defining an improved similarity score

Introduction

My initial study on mark similarity measurement[1] focused on formulations for quantifying the objective similarity of pairs of marks, with particular focuses on colour- and word marks. As discussed in previous articles in this series, mark similarity assessment is a key part of the resolution of many intellectual property disputes, and a more objective approach could have a number of advantages, including the potential to provide definitions which could be built into case law, offer greater consistency across dispute decisions, and specify thresholds for IP protection.

However, it is important to reiterate the key point that any objective algorithms of these types should only ever be considered as tools to be used as part of the overall assessment process, which overall includes significant degrees of subjectivity. In the first instance, the algorithmic frameworks presented in this series for word marks focus only on visual (spelling) and aural (pronunciation - with a specific basis in American English) similarity, with no account taken of conceptual similarity (i.e. meaning) or the influence of any associated logos, imagery or mark stylisation. Overall, dispute decisions are often reliant on an assessment of the likelihood of confusion between the marks in question, which is generally also dependent on a range of other factors, including the distinctiveness, degree of overlap of associated goods and services, strength and degree of renown of the marks, documented evidence of actual confusion, and the degree of attention paid by a typical consumer - many of which may vary between different geographical regions[2,3]. Some of the factors generally considered for the components which can be measured algorithmically (such as typically putting greater weight on comparisons between elements appearing at the start of the marks in question, and greater emphasis on differences appearing within shorter marks[4]) can, and have, been built into the proposed algorithms wherever possible. 

The degree of similarity (of each type) between marks is often specified in dispute cases as 'high', 'medium' or 'low'; with this in mind, it seems reasonable (where constructing any measurement algorithm) to formulate the output as a similarity score (as proposed for colour marks in the previous article[5] in this series), which aligns broadly with this framework but offers a more quantitative basis for comparison (though keeping in mind that all of the above caveats also still apply!).

Formulation of the similarity score algorithm

The similarity score used for comparison of pairs of word marks (Swor), in both the previous study and this follow up, reflects both visual (spelling) and aural (pronunciation) similarity (only). 

As in the initial version, visual similarity between the marks (i.e. in terms of their spelling) is quantified using two distinct algorithms, each of which reflects different aspects of the similarity. The two algorithms (each of which generates a score which can be expressed as a percentage) are:

  • The fuzz.ratio metric (FLev), an algorithm implemented in the Python package 'fuzzywuzzy'[6], based on the concept of Levenshtein distance - a way of quantifying the number of edits required to transform one string into the other - but also taking account of other factors (including the length of the strings).
  • The Jaro-Winkler similarity algorithm (and score (simj)) (as implemented in the the Python package 'Levenshtein'[7]), which includes an element of consideration of the proximity of the matching / non-matching characters to the start of the strings. 

In the simplest formulation of the overall algorithm (and as retained here), the score component reflecting overall visual similarity (Svis) is expressed just as the simple mean of the above two scores (as below), although it would be possible to apply different weightings if required.

Svis = (FLev + simj) / 2

For aural similarity, the proposed calculation framework is based on the creation of a phonetic representation of the marks / strings in question, and then a comparison of these representations (again, using the fuzz.ratio metric). 

The initial formulation also made use of two distinct algorithms for generating the phonetic representations, based on the Soundex and NYSIIS (New York State Identification and Intelligence System) encodings. However, both of these have certain shortcomings, not least the poor handling of vowel sounds within the strings, and (in Soundex) the inability to encode any consonants beyond the first four.

In this improved version, therefore, I instead propose the use of the Phonemizer algorithm[8,9] for generating the phonetic versions of the strings, which utilises IPA (International Phonetic Alphabet)[10] encoding, and which was explored in the previous follow-up study[11] and appears to perform well (although some data 'cleansing' is required in some cases, to ensure that the algorithm interprets the string as intended). The aural similarity score (Saur) can then be calculated simply as the output of the fuzz.ratio metric applied to the IPA representations as given by Phonemizer, i.e.:

Saur = FPho

As in the previous formulation, the overall (word mark) similarity score can then most simply be expressed just as the mean of the two individual components, i.e.:

Swor = (Svis + Saur) / 2

Similarity scores for test-pairs of marks

As an illustration of the performance of this algorithm, I consider a set of approximately 200 pairs of word marks, mostly the subjects of recent trademark disputes (several of which were also considered in previous articles in this series), and with a primary focus on single-word marks (for simplicity). The full set of mark-pairs, and the calculated similarity scores, are presented in Appendix A.

The first point to note is that, generally, little pre-processing of the data is required in order to utilise the algorithm. All marks have been converted to lower-case, though this is generally a matter of choice, just to ensure that upper- and lower-case versions of the same letter are treated identically. The algorithms do also appear to correctly handle accented characters (albeit that the phonetic representations will generally reflect an English pronunciation). The only two modifications to the data required in these cases were a rewriting of 'OrangeryOS' as 'orangery-o-s' (to ensure that the pronunciation is rendered as 'oh-es') and (as in a previous study) of 'likeme' to 'like-me'. 

Elsewhere (as noted previously), the Phonemizer algorithm renders 'unreadable' strings as individual characters (e.g. 'immun44' as 'immun-four-four', '007' as 'zero-zero-seven', 'ch_t.' as 'see-aitch-tee', and 'mbfw' as 'em-bee-ef-doubleyu'), though these versions have been retained in an unmodified state in the analysis. Some of these representations may not be as originally intended when the marks were conceived, however - e.g. 'genv3rse' is rendered as 'genv-three-rse' (rather than the more likely 'genverse'), and 'm4tter' as 'em-four-tter' (rather than 'matter').

Overall, however, the algorithm does seem to provide a (subjectively) reasonable ranking of the mark-pairs by similarity. An attractive additional characteristic of this framework is that it is entirely repeatable, and unreliant on the number and types of pairs in the dataset (i.e. a particular word-pair will always give the same score), so it is always possible to compare like-with-like. Accordingly, it is instructive to consider some representative examples of word-pairs giving particular (approximate) scores (Swor), to provide a 'reckoner' of what the scores represent, i.e.:

  • Approx. 90%:
    • boss / bossi
    • billionaire / zillionaire
    • thermacare / thermocare
    • prinker / prink
    • intellicare / intelecare
    • chooey / chooee
    • mahendra / mahindra
  • Approx. 80%:
    • zara / zarzar
    • rabe / rase
    • retaron / retlron
    • createme / create.
    • spa / spato
    • thermomix / termomatrix
  • Approx. 70%:
    • kelio / kleeo
    • terry / terrissa
    • tygrys / tigris
    • nike / nuke
  • Approx. 60%:
    • nutella / mixitella
    • airbnb / francebnb
    • gallo / rampingallo
    • iphone / mifon
    • joy / bjoie
    • jd / jdyaoying
  • Approx. 50%:
    • zara / zorazone
    • quirón / quiromasté
  • Approx. 40%:
    • book / restaubook
    • h10 / motel 10

An additional attractive aspect of this approach is that it is also possible, if required, to consider the visual and aural similarity components separately. For example, the top pairs of marks by visual similarity score (Svis) (only) are fashiongo / fashionego (96.50%), configon / configo (95.25%) and casoria / castoria (95.04%), and by aural similarity score (Saur) (only) are sanytol / sanitol, testex / test-x, hobbit / hobbyt , kramer / cramer, kresco / cresco, and cylance / sylence (all 100%, i.e. deemed phonetically identical).

Discussion

Overall, (and again as noted previously) it would not be reasonable to expect any significant correlation between the similarity scores and the findings reached in the associated disputes, because of the significant additional (and subjective) points considered in the analysis, as discussed in the introduction to this article. For example, in the Initio / Vinicio case, the marks were found to have 'below average' visual similarity (despite the quantitative objective visual similarity score of 80.96%), with consideration having been given in the case to the differing impact of the various elements and the overall impression of the respective marks, which feature significant differences in visual presentation[12]

Nevertheless, the similarity score does offer a useful tool to consider the 'pure' visual and aural similarity (only) of the word marks, as part of an overall analysis (for example, in dispute cases), in a framework which is repeatable and qualitative, providing the potential for a consistent approach to assessment of these characteristics. It also aligns with the familiar terminological descriptions of 'degrees' of similarity, whilst offering a more granular and continuous scale. 

The algorithm does also offer additional possible use-cases, such as (for example) the ability to post-process the outputs from trademark watching services, so as to better sort the results by relevance (in cases where the sorting algorithm offered by the service performs less satisfactorily), and thereby aid in the review process.

It is also worth noting that there is also scope for possible future enhancements to the algorithms (some of which have been discussed previously), including (for example) assessments of the distinctiveness of the various elements or sub-elements (subsequences or substrings) of the marks, re-weighting the contribution of any trailing ‘s’, and so on. Distinctiveness and analysis of the 'types' of elements present in the marks may, in particular, be key to making a more meaningful overall assessment of similarity and, ultimately, likelihood of confusion. Relevant examples for consideration in the dataset include Cylance / Sylence (both 'clearly' allusions to the same common word ('silence')), Doctolib / Avocatlib (where the first portion of each mark makes reference to a profession), BMW / BMV (where the only difference is manifested as a pair of 'similar' letters), Immun44 / Immuno-19 (both featuring a similar root and, unusually, followed specifically by a number), iPhone / Mifon (with the similarity between 'I' and 'me' being of potential relevance), and Align / Clickalign (relevant because of the range of additional names cited by the latter party, suggesting the key point is the question of the distinctiveness of the term 'align' for the relevant goods and services).

Appendix A: Pairs of marks and their visual, aural and overall similarity scores

Mark 1
                                
Mark 2
                                
Vis. sim. score
(
Svis)
                                
Mark 1 (IPA)
                                
Mark 2 (IPA)
                                
Aur. sim. score
(
Saur)
                                
Overall word mark sim. score
(
Swor)
  casoria   castoria 95.04   kæsoːɹiə   kæstoːɹiə 95.00 95.02
  sanytol   sanitol 89.67   sænɪtɑːl   sænɪtɑːl 100.00 94.83
  testex   test-x 88.17   tɛstɛks   tɛstɛks 100.00 94.08
  hobbit   hobbyt 88.17   hɑːbɪt   hɑːbɪt 100.00 94.08
  replay   re:play 94.10   ɹiːpleɪ   ɹiː pleɪ 94.00 94.05
  kramer   cramer 85.94   kɹeɪmɚ   kɹeɪmɚ 100.00 92.97
  kresco   cresco 85.94   kɹɛskoʊ   kɹɛskoʊ 100.00 92.97
  cintra   citra 93.28   sɪntɹə   sɪtɹə 92.00 92.64
  dekton   deton 93.28   dɛktən   dɛtən 92.00 92.64
  free   freen 92.50   fɹiː   fɹiːn 91.00 91.75
  goddess   godless 89.67   ɡɑːdəs   ɡɑːdləs 93.00 91.33
  boss   bossi 92.50   bɔs   bɔsi 89.00 90.75
  billionaire   zillionaire 92.47   bɪliənɛɹ   zɪliənɛɹ 89.00 90.73
  thermacare   thermocare 91.89   θɜːmɐkɛɹ   θɜːməkɛɹ 89.00 90.44
  prinker   prink 88.64   pɹɪŋkɚ   pɹɪŋk 92.00 90.32
  intellicare   intelecare 90.18   ɪntɛlɪkɛɹ   ɪntɛlᵻkɛɹ 90.00 90.09
  chooey   chooee 88.17   tʃuːi   tʃuːiː 92.00 90.08
  dcsl   dcs 90.08   diːsiːɛsɛl   diːsiːɛs 90.00 90.04
  mahendra   mahindra 91.08   mæhɛndɹə   mæhɪndɹə 89.00 90.04
  lucite   luci 86.67   luːsaɪt   luːsaɪ 93.00 89.83
  george   georgine 90.50   dʒɔːɹdʒ   dʒɔːɹdʒɪn 89.00 89.75
  tropico   tropicazo 91.78   tɹɑːpɪkoʊ   tɹɑːpɪkɑːzoʊ 87.00 89.39
  demiegod   demigods 91.50   dɛmɪeɪɡɑːd   dɛmɪɡɑːdz 86.00 88.75
  mbet   m-bets 85.00   ɛmbɛt   ɛmbɛts 92.00 88.50
  fashiongo   fashionego 96.50   fæʃəŋɡoʊ   fæʃəniːɡoʊ 80.00 88.25
  cylance   sylence 75.98   saɪləns   saɪləns 100.00 87.99
  ping   pingke 86.67   pɪŋ   pɪŋk 89.00 87.83
  pikdare   pi-kare 89.19   pɪkdɛɹ   paɪkɛɹ 86.00 87.60
  mbfw   mvfw 80.00   ɛmbiːɛfdʌbəljuː   ɛmviːɛfdʌbəljuː 94.00 87.00
  joy   joyme 82.83   dʒɔɪ   dʒɔɪm 91.00 86.92
  configon   configo 95.25   kənfɪɡən   kənfɪɡoʊ 78.00 86.62
  prinz   prinse 81.17   pɹɪnts   pɹɪns 92.00 86.58
  lovello   lovelle 90.14   lʌvloʊ   lʌvl 83.00 86.57
  energeo   enerjo 83.98   ɛnɚdʒeɪoʊ   ɛnɚdʒoʊ 89.00 86.49
  trucool   turcool 90.86   tɹuːkuːl   tɜːkuːl 82.00 86.43
  carbon   mycarbon 88.83   kɑːɹbən   maɪkɑːɹbən 84.00 86.42
  consiglieri   consigliera 93.68   kənsɪɡlɪɹi   kənsɪɡliɛɹə 78.00 85.84
  starbucks   charbucks 81.59   stɑːɹbʌks   tʃɑːɹbʌks 90.00 85.80
  realme   realmz 88.17   ɹɛlmi   ɹɛlmz 83.00 85.58
  axis   traxis 84.44   æksɪs   tɹæksɪs 86.00 85.22
  youtube   u-tubes 75.98   juːtuːb   juːtuːbz 94.00 84.99
  bimbo   gimbo 83.33   bɪmboʊ   ɡɪmboʊ 86.00 84.67
  tiktok   tiktaktok 85.00   tɪktɑːk   tɪktɐktɑːk 84.00 84.50
  z-biome   biome 86.74   ziːbaɪoʊm   baɪoʊm 82.00 84.37
  bacchus   cacchus 85.46   bækəs   kækəs 83.00 84.23
  philips   philzops 86.07   fɪlɪps   fɪlzəps 80.00 83.04
  patter   yatter 85.94   pæɾɚ   jæɾɚ 80.00 82.97
  noughty   naughtea 73.59   nɔːɾi   nɔːɾiə 92.00 82.79
  yorxs   yorks 85.33   joːɹksz   jɔːɹks 80.00 82.67
  jarlsberg   jørnsberg 82.33   dʒɑːɹlsbɜːɡ   dʒoːɹnsbɜːɡ 83.00 82.67
  globe-trotter   globetrotter xc 90.23   ɡloʊbtɹɑːɾɚ   ɡloʊbtɹɑːɾɚɹ ɛkssiː 75.00 82.62
  treca   trea 92.17   tɹɛkə   tɹiə 73.00 82.58
  resolution   resolute 84.75   ɹɛzəluːʃən   ɹɛzəluːt 80.00 82.38
  olympéa   olympe 83.98   əlɪmpeɪə   əlɪmp 80.00 81.99
  ellesse   elliss 83.22   ɛlɛs   ɛlɪs 80.00 81.61
  hugo   hug-o 92.17   hjuːɡoʊ   hʌɡoʊ 71.00 81.58
  initio   vinicio 80.96   ɪnɪɾɪoʊ   vɪnɪsɪoʊ 82.00 81.48
  bimbo   bimbolea 84.75   bɪmboʊ   baɪmboʊliə 78.00 81.38
  burgerme   burgerly 82.50   bɜːɡɚm   bɜːɡɚli 80.00 81.25
  1link   link 91.17   wʌn lɪŋk   lɪŋk 71.00 81.08
  repevax   epvax 86.74   ɹᵻpɛvæks   ɛpvæks 75.00 80.87
  free   freepour 78.50   fɹiː   fɹiːpɚ 83.00 80.75
  zara   zarzar 86.11   zɑːɹɹə   zɑːɹzɑːɹ 75.00 80.56
  rabe   rase 80.83   ɹeɪb   ɹeɪz 80.00 80.42
  retaron   retlron 89.67   ɹᵻtæɹən   ɹᵻtlɹɑːn 71.00 80.33
  createme   create. 86.07   kɹiːeɪɾiːm   kɹiːeɪt 74.00 80.04
  spa   spato 82.83   spɑː   spɑːɾoʊ 77.00 79.92
  thermomix   termomatrix 84.24   θɜːməmɪks   tɜːməmeɪtɹɪks 75.00 79.62
  atma   atmaspa 82.21   ætmə   ætmæspə 77.00 79.61
  live   vive 79.17   laɪv   vaɪv 80.00 79.58
  cana   canya 92.17   kɑːnə   kænjə 67.00 79.58
  l'oreal   joreal 80.96   ɛloːɹiəl   dʒoːɹiəl 78.00 79.48
  seiko   seycos 65.50   seɪkoʊ   seɪkoʊz 93.00 79.25
  pockit   mypocket 76.47   pɑːkɪt   maɪpɑːkɪt 82.00 79.24
  bisleri   bilseri 91.10   baɪslɜːɹi   bɪlsɚɹi 67.00 79.05
  kikkoman   kikomand 91.08   kɪkɑːmən   kɪkəmænd 67.00 79.04
  fido   fiio 80.83   faɪdoʊ   fɪɪoʊ 77.00 78.92
  waken   wakeful 77.21   weɪkən   weɪkfəl 80.00 78.61
  nutravita   nootrovita 79.17   nʌtɹɐviːɾə   nuːtɹəviːɾə 78.00 78.58
  um bongo   ubongo! 84.11   ʌm bɑːŋɡoʊ   juːbɑːŋɡoʊ 73.00 78.55
  pyra   prya 83.75   pɪɹə   pɹaɪə 73.00 78.38
  ulma   luma 83.33   ʌlmə   luːmə 73.00 78.17
  fransa   fanza 78.50   fɹænsə   fænzə 77.00 77.75
  chef   chefchy 82.21   ʃɛf   ʃɛftʃi 73.00 77.61
  boss   bossvel 82.21   bɔs   bɔsvəl 73.00 77.61
  hanson   hansol 88.17   hænsən   hænsɑːl 67.00 77.58
  lucozade   glucos-aid 72.67   luːkəzeɪd   ɡluːkoʊzeɪd 82.00 77.33
  asos   asas 80.83   ɐsoʊz   ɐsæz 73.00 76.92
  iqos   niccos 67.50   aɪkoʊz   nɪkoʊz 86.00 76.75
  zemo   zoomo 67.11   ziːmoʊ   zuːmoʊ 86.00 76.56
  hyprr   hypernft 72.83   haɪpɚ   haɪpɚnft 80.00 76.42
  free   freeyoung 75.44   fɹiː   fɹiːjʌŋ 77.00 76.22
  bimbo   bimbys 81.17   bɪmboʊ   bɪmbiz 71.00 76.08
  uber   youber 84.44   juːbɚ   jaʊbɚ 67.00 75.72
  dune   dne 89.25   duːn   diːɛniː 62.00 75.62
  scaffeze   scaffx 80.08   skæfɛz   skæfks 71.00 75.54
  foltene   foltex 83.98   foʊltiːn   foʊltɛks 67.00 75.49
  abanca   abaca 93.56   ɐbæŋkə   æbɑːkə 57.00 75.28
  ch   ch_t. 70.50   siːeɪtʃ   siːeɪtʃ tiː 80.00 75.25
  suntech   suntank 69.93   sʌntɛk   sʌntæŋk 80.00 74.96
  hotpatch   patch 78.92   hɑːtpætʃ   pætʃ 71.00 74.96
  huracán   huracanrace 77.53   hjʊɹɹɐkɑːn   hjʊɹɹɐkænɹeɪs 72.00 74.76
  free   freetalk 78.50   fɹiː   fɹiːɾɔːk 71.00 74.75
  free   freeloop 78.50   fɹiː   fɹiːluːp 71.00 74.75
  intelect   entelec 77.90   ɪntɛlᵻkt   ɛntɛlɛk 71.00 74.45
  maplab   maplab.world 78.50   mæplæb   mæplæb wɜːld 70.00 74.25
  sacher   sachi 81.17   sæʃɚ   sætʃaɪ 67.00 74.08
  fanta   fantarifa 81.06   fæntə   fæntɑːɹɹɪfə 67.00 74.03
  fiorelli   fioretto 73.50   fɪoːɹɛli   fɪoːɹɛɾoʊ 74.00 73.75
  sherco   charco 72.39   ʃɜːkoʊ   tʃɑːɹkoʊ 75.00 73.69
  vidas   vidya 85.33   viːdəz   vɪdɪə 62.00 73.67
  gobox   g-box 84.00   ɡoʊbɑːks   dʒiːbɑːks 63.00 73.50
  idee   idee-home 75.44   ɪdiː   ɪdiːhoʊm 71.00 73.22
  starbucks   sardarbuksh 76.21   stɑːɹbʌks   sɑːɹdɑːɹbʌkʃ 70.00 73.11
  orange   orangery-o-s 78.50   ɔɹɪndʒ   ɔɹɪndʒɚɹioʊɛs 67.00 72.75
  free   freeyond 78.50   fɹiː   fɹiːjɑːnd 67.00 72.75
  free   freepods 78.50   fɹiː   fɹiːpɑːdz 67.00 72.75
  sanytol   savisol 67.07   sænɪtɑːl   sævɪsɑːl 78.00 72.54
  snuggledown   snugglemore 81.05   snʌɡəldaʊn   snʌɡəlmoːɹ 64.00 72.52
  pez   pezeeu 77.67   pɛz   pɛziːuː 67.00 72.33
  zirco   cozirc 77.61   zɜːkoʊ   kɑːzɜːk 67.00 72.31
  glenfiddich   inverfiddich 74.10   ɡlɛnfɪdɪtʃ   ɪnvɜːfɪdɪtʃ 70.00 72.05
  salio   saliogen 84.75   sælɪoʊ   sælɪədʒən 59.00 71.88
  vallformosa   fermosa 70.77   vælfoːɹmoʊsə   fɜːmoʊsə 73.00 71.88
  noughty   nouti 76.17   nɔːɾi   naʊɾi 67.00 71.58
  tesla   teslapimp 81.06   tɛslə   tɛslɐpɪmp 62.00 71.53
  live   life's 70.00   laɪv   laɪfz 73.00 71.50
  e-bulli   bullit 80.96   iːbʊli   bʊlɪt 62.00 71.48
  bimbo   bims 75.92   bɪmboʊ   bɪmz 67.00 71.46
  genie   genai 85.33   dʒiːni   dʒɛnaɪ 57.00 71.17
  lakme   like-me 70.32   lækmi   laɪkmiː 71.00 70.66
  kelio   kleeo 70.25   kɛlɪoʊ   kliːoʊ 71.00 70.62
  terry   terrissa 74.00   tɛɹi   tɛɹɪsə 67.00 70.50
  tygrys   tigris 73.50   tɪɡɹiz   taɪɡɹɪs 67.00 70.25
  nike   nuke 80.00   naɪk   nuːk 60.00 70.00
  007   skx007 58.50   ziəɹoʊziəɹoʊ sɛvən   ɛskeɪɛks ziəɹoʊziəɹoʊ sɛvən 81.00 69.75
  geneverse   genv3rse 85.28   dʒɛnɪvɜːs   dʒɛnv θɹiː ɑːɹɹɛsiː 53.00 69.14
  lego   solego 76.11   lɛɡoʊ   sɑːliːɡoʊ 62.00 69.06
  perry   perryhome 81.06   pɛɹi   pɛɹɪhoʊm 57.00 69.03
  kadawe   kademae 80.89   kædɔː   keɪdmiː 57.00 68.94
  acutil   accudis 70.84   ɐkjuːɾɪl   ɐkjuːdiz 67.00 68.92
  bru   bruys 82.83   bɹuː   bɹaɪz 55.00 68.92
  bimbo   wimko 66.67   bɪmboʊ   wɪmkoʊ 71.00 68.83
  cazoo   carkoo 79.39   kæzuː   kɑːɹkuː 57.00 68.19
  doctolib   avocatlib 75.78   dɑːktəlɪb   ævəkætlɪb 60.00 67.89
  boss   kissboss 62.67   bɔs   kɪsbɔs 73.00 67.83
  bmw   bmv 74.61   biːɛmdʌbəljuː   biːɛmviː 61.00 67.81
  marca   plusmarca 57.35   mɑːɹkə   plʌsmɑːɹkə 78.00 67.68
  mdh   mhs 61.28   ɛmdiːeɪtʃ   ɛmeɪtʃɛs 74.00 67.64
  align   clickalign 60.17   ɐlaɪn   klɪkɐlaɪn 75.00 67.58
  ajona   avoma 68.00   ædʒoʊnə   ævoʊmə 67.00 67.50
  zara   zaraphora 75.44   zɑːɹɹə   zæɹɐfoːɹə 59.00 67.22
  levi's   levigo 76.83   lɛviz   lɛvɪɡoʊ 57.00 66.92
  zara   zareus 71.25   zɑːɹɹə   zɛɹəs 62.00 66.62
  zara   zareus 71.25   zɑːɹɹə   zɛɹəs 62.00 66.62
  naturli'   natureal 82.50   neɪɾɜːli   neɪtʃɚɹiəl 50.00 66.25
  moncler   northcler 70.29   mɔŋklɚ   nɔːɹθklɚ 62.00 66.14
  airbnb   airbrick 70.17   ɛɹbnb   ɛɹbɹɪk 62.00 66.08
  resolva   consolva 69.15   ɹᵻzɑːlvə   kənsɑːlvə 63.00 66.08
  sanytol   sanatio 78.83   sænɪtɑːl   sæneɪʃɪoʊ 53.00 65.92
  moncler   montec 73.39   mɔŋklɚ   mɔntɛk 57.00 65.19
  apiretal   a'peal 77.38   ɐpaɪɚɾəl   ɐpiːl 53.00 65.19
  very   veryco 86.67   vɛɹi   vɜːɹɪkoʊ 43.00 64.83
  bimbo   vibo 72.67   bɪmboʊ   viːboʊ 57.00 64.83
  head   headoniste 72.50   hɛd   hɛdəniːst 57.00 64.75
  saypha   shaype 73.50   seɪfə   ʃeɪp 55.00 64.25
  helios   delio 77.61   hɛlɪoʊz   dᵻliːoʊ 50.00 63.81
  coversyl   covixyl-v 69.94   kʌvɚsɪl   kɑːvɪksɪlviː 57.00 63.47
  simoniz   permanize 58.60   sɪmənɪz   pɜːmənaɪz 67.00 62.80
  vfh   vfhonline 67.22   viːɛfeɪtʃ   viːɛfhɑːnlaɪn 58.00 62.61
  rolex   dermarollex 49.03   ɹoʊlɛks   dɜːmɚɹoʊlɛks 76.00 62.52
  apple   alpineapple 62.89   æpəl   ælpɪniːpəl 62.00 62.45
  thermomix   zaubermix 63.19   θɜːməmɪks   zɔːbɚmɪks 60.00 61.59
  magnavox   multivox 58.33   mæɡnɐvɑːks   mʌltivɑːks 64.00 61.17
  nutella   mixitella 68.83   nuːtɛlə   mɪksaɪtɛlə 53.00 60.92
  airbnb   francebnb 59.65   ɛɹbnb   fɹænsɛbnb 62.00 60.82
  curve   crv 81.50   kɜːv   siːɑːɹviː 40.00 60.75
  gallo   rampingallo 52.52   ɡæloʊ   ɹæmpɪŋɡæloʊ 67.00 59.76
  iphone   mifon 62.50   aɪfoʊn   mɪfɑːn 57.00 59.75
  joy   bjoie 59.44   dʒɔɪ   bjɔɪ 60.00 59.72
  jd   jdyaoying 57.63   dʒeɪdiː   dʒeɪdaɪeɪɑːiɪŋ 61.00 59.31
  bally   ballyclare 78.50   bɔːli   bælɪklɛɹ 40.00 59.25
  swift   microswift 55.17   swɪft   maɪkɹoʊswɪft 63.00 59.08
  bloo   bluuwash 45.67   bluː   bluːwɑːʃ 71.00 58.33
  head   superhead 53.69   hɛd   suːpɚhɛd 62.00 57.84
  trek   gotrekfeel 68.50   tɹɛk   ɡɑːtɹɪkfiːl 47.00 57.75
  blippi   bbibbi 58.33   blɪpi   biːbɪbi 57.00 57.67
  immun44   immuno-19 73.70   ɪmʌn foːɹɾi foːɹ   ɪmjuːnoʊ naɪntiːn 40.00 56.85
  rolex   relxhome 57.17   ɹoʊlɛks   ɹᵻlkshoʊm 56.00 56.58
  kpn   opn 72.39   keɪpiːɛn   ɑːpən 40.00 56.19
  mc   macbeans 58.75   ɛmsiː   məkbiːnz 53.00 55.88
  ape   apecessories 61.25   eɪp   eɪpɪsɛsɚɹiz 50.00 55.62
  airbnb   marseillebnb 57.17   ɛɹbnb   mɑːɹseɪlɛbnb 53.00 55.08
  facebook   motherbook 60.08   feɪsbʊk   mʌðɚbʊk 50.00 55.04
  alaïa   azzaia 64.00   ɐlæiːə   æzeɪə 46.00 55.00
  puma   coma 58.33   puːmə   koʊmə 50.00 54.17
  bimbo   amorbimbi 55.17   bɪmboʊ   ɐmoːɹbɪmbaɪ 53.00 54.08
  azure   azurity 77.21   æʒɚ   æzjʊɹɹᵻɾi 29.00 53.11
  bimbo   binbokplay 65.83   bɪmboʊ   baɪnbɑːkpleɪ 40.00 52.92
  zara   zorazone 54.86   zɑːɹɹə   zoːɹɐzoʊn 47.00 50.93
  matters   m4tter 81.71   mæɾɚz   ɛm foːɹ tiːtɜː 19.00 50.36
  quirón   quiromasté 59.44   kwɜːɹɑːn   kwɪɹəmɐsteɪ 38.00 48.72
  joy   joïsta 55.33   dʒɔɪ   dʒɑːiːstə 40.00 47.67
  louboutin   lubov 61.74   laʊbaʊtɪn   luːbɑːv 33.00 47.37
  we   wecotton 60.00   wiː   wɛkəʔn̩ 33.00 46.50
  mcdonalds   mcsweet 44.13   məkdɑːnəldz   məkswiːt 48.00 46.07
  md   intimd 25.00   ɛmdiː   ɪntɪmdiː 67.00 46.00
  sane   cbdsane 36.50   seɪn   siːbiːdiːseɪn 53.00 44.75
  book   restaubook 28.50   bʊk   ɹᵻstaʊbʊk 57.00 42.75
  h10   motel 10 18.00   eɪtʃ tɛn   moʊtɛl tɛn 60.00 39.00
  coco   kokomarina 42.83   koʊkoʊ   kɑːkəmɚɹiːnə 30.00 36.42
  mi   lovmi 28.50   maɪ   lʌvmi 40.00 34.25

References

[1] https://circleid.com/posts/towards-a-quantitative-approach-for-objectively-measuring-the-similarity-of-marks

[2] https://bowmanslaw.com/insights/degrees-of-similarity-put-to-the-test/

[3] https://www.taylorwessing.com/en/insights-and-events/insights/2021/03/were-confused-how-the-general-court-decides-when-trade-marks-are-confusingly-similar

[4] https://guidelines.euipo.europa.eu/1803468/1787906/trade-mark-guidelines/3-5-conclusion-on-similarity

[5] https://circleid.com/pdf/similarity_measurement_of_marks_part_4.pdf

[6] https://pypi.org/project/fuzzywuzzy/

[7] https://rapidfuzz.github.io/Levenshtein/levenshtein.html#jaro-winkler

[8] M. Bernard and H. Titeux (2021). 'Phonemizer: Text to Phones Transcription for Multiple Languages in Python', J. Open Source Software, 6(68), p.3958.

[9] https://pypi.org/project/phonemizer/

[10] https://www.internationalphoneticassociation.org/content/ipa-chart

[11] https://circleid.com/posts/further-developing-a-word-mark-similarity-measurement-framework

[12] Stobbs CaseFest #16, London, 02-Oct-2024

This article was first published as a white paper on 17 October 2024 at:

https://circleid.com/pdf/similarity_measurement_of_marks_part_5.pdf

Phishing trends 2024 - and a look at some new data for domain threat quantification

Overview This year's annual phishing report by Internet technology consultants Interisle [1] has provided a number of key insights into...