Wednesday, 15 January 2025

Unregistered Gems Part 6: Phonemizing strings to find brandable domains

Introduction

The UnregisteredGems.com series of articles explores a range of techniques to filter and search through the universe of unregistered domain names, in order to find examples which may be compelling candidates for entities looking to select a new brand name (and its associated domain). The previous instalment of the series[1] looked at the categorisation of candidate names according to the phonetic characteristics of their constituent consonants, using a simple one-to-one mapping between each consonant and a corresponding phonetic group.

In this study, I explore the use of a more formal phonetic representation of each string, involving its conversion to its IPA (International Phonetic Alphabet) representation[2]. This has a number of advantages over the previous approach, including the ability to properly handle differences in pronunciation of particular characters according to their context, handling of character combinations, and the ability to generalise the approach to strings of arbitrary consonant/vowel patterns and length.

Framework

As in the previous study, the strings are classified according to the phonetic categories of their constituent consonants, but with all vowel sounds just combined into a single group. This approach follows from the assertion that the consonants comprise the core 'structure' of the word, and avoids having to handle the more complex nature of vowel sounds (such as the presence of vowel diphthongs, variations in length (i.e. 'long' vs 'short' sounds), and the impact of the accent of the speaker (noting that the IPA conversion tool used is based on American English)).

The consonant sounds are divided into the following groupings, again following the framework used in the previous study, and with the phoneme symbols taking their usual IPA meanings[3,4].

Top-level group
                                
Group
                
Type
                                                
Consonant
phonemes
                        
  1 (plosive) 1A   Bilabial plosive   b, p
  1 (plosive) 1B   Alveolar plosive   d, t, ɾ
  1 (plosive) 1C   Velar plosive   ɡ, k
  2 (nasal) 2A   Bilabial nasal   m
  2 (nasal) 2B   Alveolar nasal   n
  2 (nasal) 2C   Velar nasal   ŋ
  3 (fricative) 3A   Labiodental fricative   f, v
  3 (fricative) 3B   Dental fricative   θ, ð
  3 (fricative) 3C   Alveolar fricative   s, z
  3 (fricative) 3D   Postalveolar fricative   ʃ, ʒ
  3 (fricative) 3E   Glottal fricative   h
  4 (approximant) 4A   Labial-velar approximant   w
  4 (approximant) 4B   Retroflex approximant   ɹ, r[5]
  4 (approximant) 4C   Palatal approximant   j
  5 (lateral approximant) 5A   Alveolar lateral approximant   l
  6 (affricate)[6] 6A   Postalveolar affricate   ʧ, ʤ

Table 1: Groupings assigned to individual consonant phonemes as used in the analysis

Any string can then be represented as a 'code' (the 'word type'), comprising the top-level group numbers of the consonants (and with any vowel sounds, or sequences of consecutive vowel sounds, denoted simply as a 'V'), expressed in the order in which they appear in the string.

For example, therefore, the string 'rolex' is encoded in IPA representation as 'ɹoʊlɛks' which is assigned word type 4V5V13.

Analysis

By analogy with the previous study, it is informative to again consider the same set of 2,000 most popular 5-character (by second-level domain name, or SLD - i.e. the part of the domain name to the left of the dot) names offered for sale on the domain marketplace Atom.com[7] (by virtue of which inclusion they have independently been deemed to be attractive from a brandability point of view), to determine any patterns or common word types within this dataset.

There are actually 627 distinct word patterns represented in this dataset (noting that there are 7 distinct groups into which the phonemes can be assigned (cf. 6 in the previous study), and that there is here no upper limit to the total possible length of the word 'code' representation), of which the top ten are shown in Table 2.

Word type
                                
No. domains
                                
  3V13V 62
  3V3V 62
  1V13V 48
  1V3V 47
  3V1V 35
  4V3V 33
  4V13V 32
  3V35V 31
  3V23V 30
  1V1V 29

Table 2: Top ten word types represented in the dataset of 2,000 most popular 5-character (made-up, up to two syllable) names on the Atom.com domain marketplace

Accordingly, there are 62 of the 2,000 domains whose (SLD) names fit the (joint) most common word-type pattern (3V13V) represented amongst this set of popular domains, which are listed below.

Word type 3V13V:

  • vodzy         
  • vebsy         
  • hauxa         
  • fexie         
  • hixxi
  • xaxor
  • zetza
  • suxxo
  • xaxxy
  • huxxa
  • vudzi
  • sedza
  • hydso
  • vitvy
  • phexy
  • cipza
  • votvy
  • xuxxo
  • zuxxa
  • cexxi
  • zeexo
  • zogzy
  • zepvi
  • ciexa
  • soaxy
  • vapzy
  • vycci
  • fudfy
  • vybsy
  • veexy
  • foxxu
  • vodvi
  • fiexa
  • vuxxy
  • vauxa
  • fabvy
  • zotvo
  • cerxa
  • zatva
  • zepfy
  • vapzi
  • hoxor
  • serxa
  • huxey
  • vegvy
  • vuxoo
  • fotvi
  • vuxxi
  • xoxxy
  • cixxa
  • suxxa
  • vibsi
  • hooxo
  • fauxo
  • zopzy
  • zabvi
  • virxi
  • huxee
  • voixi
  • huxxo
  • zirxo
  • zopvi

Discussion

As discussed in the previous instalment, this type of analysis may allow steps towards the development of a set of 'guidelines' as to which types of word types (i.e. sound patterns) might constitute the most preferred names from a brandability point of view. If so, these ideas could be used as a basis for filtering large datasets to identify possible candidate names of interest. One downside to this approach is that, as with the use of phonotactic analysis[8], the framework presented here involves the conversion of each string to a phonetic representation, which is computationally relatively slow. However, unlike phonotactic analysis, this new methodology provides a basis for a more granular clustering of candidate names, and potentially (providing the preferred word types are correctly selected) may provide a more effective 'mapping' between candidate names and their potential desirability.

If (for example) we assume that word type 3V13V is a 'good' pattern for brandable names, it is informative to investigate its use as a filter. For illustration, we can consider the set of unregistered .com names of the form CVCCV ('C' = consonant, 'V'= vowel) from the original study in this series, using the subset beginning with the letter 's' (a 'group 3'-type sound) as an example. There are 6,044 such names. Of these, 567 (9.4%) are found to be of word type 3V13V[9], and it might be reasonable to assume that (at least some of) these may be candidates for brandability which are at least as credible as the names taken from Atom.com listed above. Some examples of the names in this new filtered dataset include sagsy, sedsi, sicsy, sodsy, sudci, suqsy, sybzi, sycci, sygzy, syksi and sytzo.

References

[1] https://circleid.com/posts/unregistered-gems-part-5-using-groupings-to-find-brandable-domains

[2] The conversion is carried out using the Python module Phonemizer, as was also used in a previous study on the analysis of strings for the purposes of mark similarity quantification: 

[3] https://home.cc.umanitoba.ca/~krussll/phonetics/articulation/describing-consonants.html

[4] https://www.dyslexia-reading-well.com/44-phonemes-in-english.html

[5] Technically, the 'r' phoneme represents a (voiced) alveolar trill, but is grouped together in this analysis with the 'ɹ' sound due to the similarity/ambiguity between the two. 

[6] The Phonemizer module actually outputs these symbols each as two distinct characters ('tʃ' and 'dʒ', respectively), so they are first converted to single characters ('ʧ' and 'ʤ') wherever they appear in the IPA representations, to ensure they are treated as single phonemes ('ch' and 'dg', respectively) in the subsequent analysis. 

[7] https://www.atom.com/premium-domains-for-sale/all/length/5%20Letters

[8] https://circleid.com/posts/20240903-unregistered-gems-identifying-brandable-domain-names-using-phonotactic-analysis

[9] Actually, this is the second most common word type in the dataset, after 3V11V (651 instances), though there are actually 94 distinct word types represented in the sVCCV dataset. 

This article was first published on 14 January 2025 at:

https://circleid.com/posts/unregistered-gems-part-6-phonemizing-strings-to-find-brandable-domains

No comments:

Post a Comment

Unregistered Gems Part 6: Phonemizing strings to find brandable domains

Introduction The UnregisteredGems.com series of articles explores a range of techniques to filter and search through the universe of unregis...