Introduction
The UnregisteredGems.com series of articles explores a range of techniques to filter and search through the universe of unregistered domain names, in order to find examples which may be compelling candidates for entities looking to select a new brand name (and its associated domain). The previous instalment of the series[1] looked at the categorisation of candidate names according to the phonetic characteristics of their constituent consonants, using a simple one-to-one mapping between each consonant and a corresponding phonetic group.
In this study, I explore the use of a more formal phonetic representation of each string, involving its conversion to its IPA (International Phonetic Alphabet) representation[2]. This has a number of advantages over the previous approach, including the ability to properly handle differences in pronunciation of particular characters according to their context, handling of character combinations, and the ability to generalise the approach to strings of arbitrary consonant/vowel patterns and length.
Framework
As in the previous study, the strings are classified according to the phonetic categories of their constituent consonants, but with all vowel sounds just combined into a single group. This approach follows from the assertion that the consonants comprise the core 'structure' of the word, and avoids having to handle the more complex nature of vowel sounds (such as the presence of vowel diphthongs, variations in length (i.e. 'long' vs 'short' sounds), and the impact of the accent of the speaker (noting that the IPA conversion tool used is based on American English)).
The consonant sounds are divided into the following groupings, again following the framework used in the previous study, and with the phoneme symbols taking their usual IPA meanings[3,4].
Top-level group |
Group |
Type |
Consonant phonemes |
---|---|---|---|
1 (plosive) | 1A | Bilabial plosive | b, p |
1 (plosive) | 1B | Alveolar plosive | d, t, ɾ |
1 (plosive) | 1C | Velar plosive | ɡ, k |
2 (nasal) | 2A | Bilabial nasal | m |
2 (nasal) | 2B | Alveolar nasal | n |
2 (nasal) | 2C | Velar nasal | ŋ |
3 (fricative) | 3A | Labiodental fricative | f, v |
3 (fricative) | 3B | Dental fricative | θ, ð |
3 (fricative) | 3C | Alveolar fricative | s, z |
3 (fricative) | 3D | Postalveolar fricative | ʃ, ʒ |
3 (fricative) | 3E | Glottal fricative | h |
4 (approximant) | 4A | Labial-velar approximant | w |
4 (approximant) | 4B | Retroflex approximant | ɹ, r[5] |
4 (approximant) | 4C | Palatal approximant | j |
5 (lateral approximant) | 5A | Alveolar lateral approximant | l |
6 (affricate)[6] | 6A | Postalveolar affricate | ʧ, ʤ |
Table 1: Groupings assigned to individual consonant phonemes as used in the analysis
Any string can then be represented as a 'code' (the 'word type'), comprising the top-level group numbers of the consonants (and with any vowel sounds, or sequences of consecutive vowel sounds, denoted simply as a 'V'), expressed in the order in which they appear in the string.
For example, therefore, the string 'rolex' is encoded in IPA representation as 'ɹoʊlɛks' which is assigned word type 4V5V13.
Analysis
By analogy with the previous study, it is informative to again consider the same set of 2,000 most popular 5-character (by second-level domain name, or SLD - i.e. the part of the domain name to the left of the dot) names offered for sale on the domain marketplace Atom.com[7] (by virtue of which inclusion they have independently been deemed to be attractive from a brandability point of view), to determine any patterns or common word types within this dataset.
There are actually 627 distinct word patterns represented in this dataset (noting that there are 7 distinct groups into which the phonemes can be assigned (cf. 6 in the previous study), and that there is here no upper limit to the total possible length of the word 'code' representation), of which the top ten are shown in Table 2.
Word type |
No. domains |
---|---|
3V13V | 62 |
3V3V | 62 |
1V13V | 48 |
1V3V | 47 |
3V1V | 35 |
4V3V | 33 |
4V13V | 32 |
3V35V | 31 |
3V23V | 30 |
1V1V | 29 |
Table 2: Top ten word types represented in the dataset of 2,000 most popular 5-character (made-up, up to two syllable) names on the Atom.com domain marketplace
Accordingly, there are 62 of the 2,000 domains whose (SLD) names fit the (joint) most common word-type pattern (3V13V) represented amongst this set of popular domains, which are listed below.
Word type 3V13V:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Discussion
As discussed in the previous instalment, this type of analysis may allow steps towards the development of a set of 'guidelines' as to which types of word types (i.e. sound patterns) might constitute the most preferred names from a brandability point of view. If so, these ideas could be used as a basis for filtering large datasets to identify possible candidate names of interest. One downside to this approach is that, as with the use of phonotactic analysis[8], the framework presented here involves the conversion of each string to a phonetic representation, which is computationally relatively slow. However, unlike phonotactic analysis, this new methodology provides a basis for a more granular clustering of candidate names, and potentially (providing the preferred word types are correctly selected) may provide a more effective 'mapping' between candidate names and their potential desirability.
If (for example) we assume that word type 3V13V is a 'good' pattern for brandable names, it is informative to investigate its use as a filter. For illustration, we can consider the set of unregistered .com names of the form CVCCV ('C' = consonant, 'V'= vowel) from the original study in this series, using the subset beginning with the letter 's' (a 'group 3'-type sound) as an example. There are 6,044 such names. Of these, 567 (9.4%) are found to be of word type 3V13V[9], and it might be reasonable to assume that (at least some of) these may be candidates for brandability which are at least as credible as the names taken from Atom.com listed above. Some examples of the names in this new filtered dataset include sagsy, sedsi, sicsy, sodsy, sudci, suqsy, sybzi, sycci, sygzy, syksi and sytzo.
References
[1] https://circleid.com/posts/unregistered-gems-part-5-using-groupings-to-find-brandable-domains
[2] The conversion is carried out using the Python module Phonemizer, as was also used in a previous study on the analysis of strings for the purposes of mark similarity quantification:
- https://circleid.com/pdf/similarity_measurement_of_marks_part_3.pdf
- https://www.linkedin.com/pulse/measuring-similarity-marks-overview-suggested-ideas-david-barnett-zo7fe/
[3] https://home.cc.umanitoba.ca/~krussll/phonetics/articulation/describing-consonants.html
[4] https://www.dyslexia-reading-well.com/44-phonemes-in-english.html
[5] Technically, the 'r' phoneme represents a (voiced) alveolar trill, but is grouped together in this analysis with the 'ɹ' sound due to the similarity/ambiguity between the two.
[6] The Phonemizer module actually outputs these symbols each as two distinct characters ('tʃ' and 'dʒ', respectively), so they are first converted to single characters ('ʧ' and 'ʤ') wherever they appear in the IPA representations, to ensure they are treated as single phonemes ('ch' and 'dg', respectively) in the subsequent analysis.
[7] https://www.atom.com/premium-domains-for-sale/all/length/5%20Letters
[9] Actually, this is the second most common word type in the dataset, after 3V11V (651 instances), though there are actually 94 distinct word types represented in the sVCCV dataset.
This article was first published on 14 January 2025 at:
https://circleid.com/posts/unregistered-gems-part-6-phonemizing-strings-to-find-brandable-domains