Thursday, 14 August 2025

Further explorations in clustering - use of Google advertising tracking links

Part of the 'Patterns in Brand Monitoring: Brand Protection Data is Beautiful' series of articles[1,2,3,4]

Introduction

'Clustering' in brand protection is the process of discovering features shared in common between distinct findings (such as websites), as a means of establishing a connection between those results. In general terms, this type of analysis is beneficial as it allows for the identification of the most significant infringements (i.e. those associated with extensive networks of activity) and can provide investigative insights into the underlying entity(ies) (i.e. the owners / administrators of the content in question).   

Our previous discussion on 'clustering' analysis considered the case of e-mail addresses as a potential basis for establishing a link[5], and it is similarly possible to use other features, such as telephone numbers (though this is made more complicated by the wide range of formats in which the details can be formatted) or hyperlinks to (for example) associated social-media pages. It is also worth noting that the general process of establishing data clusters is one compelling potential application for AI functionality, theoretically able to address issues such as being able to interpret data which may be presented in a wide range of different ways and contexts[6].

In this new article, we consider the case of Google tracking links as a suitable feature for establishing connections between websites. The analysis in this study is based specifically on the Google Tag Manager system, which is used for functionality relating to website tracking and marketing / advertising, and utilises links incorporating identity ('ID') codes unique to the account of the owner of the website[7,8]. Previous analysis has established that many infringers tend to utilise the same ID code across large numbers of sites under their operation, to monitor the performance of their portfolio, rather than using a unique code / Google account for each site. Accordingly, the ability to identify the same tracker code on multiple different sites provides a means for a definitive determination of a connection between these sites to be established.

The analysis consists of utilising 'scraper' functionality to identify these links in the HTML source code of websites of potential interest (for those examples utilising the Google tracking functionality) and extracting the user-specific tracker-ID codes from them. We consider links of the general form googletagmanager[.]com/***?id=XXXXX (where '***' is an arbitrary string of characters, and 'XXXXX' is the tracker ID-code, written as an alphanumeric string).

Furthermore, open-source databases such as that offered by publicwww[.]com make it possible to carry out wider searches for other appearances of the same code, and therefore build a bigger picture of infringer activity.

Analysis

The analysis considers the same set of around 4,500 websites considered in the previous article; these are brand-specific domain names resolving to live web content, pertaining to particular a fashion brand.

In this new study, Google tracking links were identified on around 900 of the sites in question, and 50 of the identified tracking-code IDs were found on multiple (i.e. more than just one) sites, thereby providing criteria for establishing clusters. 

Upon deeper analysis, certain clusters turn out not to reveal any significant insights - for example, one of the tracking codes, which actually appears on 167 distinct sites, seems just relate to a particular web-hosting service provider (whose parking page appears in association with many of the domain names in question), rather than actually pertaining to the underlying website owners

However, many of the clusters do seem to reveal meaningful links, such as a group of 14 sites (the largest other cluster in the dataset) all featuring the same tracking code, but which would not otherwise easily have been known to be linked (Figures 1 and 2). Information from publicwww[.]com shows that this same code actually appears on over 232,000 distinct websites across the wider Internet.

Figure 1: (Redacted) examples of websites from the cluster of 14 all determined to be linked by virtue of the use of the same tracking code

Figure 2: (Redacted) website source code snippet present on all sites shown in Figure 1

In order to carry out deeper dives into the data, the dataset can be processed in a range of different ways to reveal and visualise the nature of the clusters. One convenient first step is the production of an 'adjacency matrix' (Figure 3) for the sets of sites (vertical axis) and distinct tracking codes (horizontal axis) in the dataset, in which a row/column intersection is marked with a '1' (red highlighting) if the code appears on the site in question, and '0' otherwise. Even from this raw data, some insights can be drawn, such as the identification of the large cluster associated with the tracking code shown third from the right in the screenshot, for which many of the row entries (corresponding to distinct associated websites) are highlighted in red.

Figure 3: Screenshot of the 'adjacency matrix' for the distinct sites and tracking codes present within any of the clusters in the dataset

This matrix can then be used as the basis for creating further visualisations of the data. For example, a number of standard Python libraries[9] can be used for the creation of visual 'networks' showing the connections within the dataset (Figures 4 and 5). These types of clusters show us that the websites in question in each case (represented by the nodes in blue) are all associated with each other, and could potentially be addressed in single bulk enforcement actions, thereby building efficiencies into the takedown process.

Figure 4: (Obfuscated[10]) visualisation of the cluster of 14 sites from which the examples in Figure 1 were taken (websites shown as blue nodes, tracking codes as green nodes)

Figure 5: (Obfuscated) examples of other clusters within the dataset which are of particular interest because of the presence of multiple interconnections between the sites / tracking codes in question (websites shown as blue nodes, tracking codes as green nodes)

Conclusion

The concept of clustering is a key component of the analysis process for websites and other results identified through a programme of brand monitoring. As part of a holistic brand protection initiative, it can help identify key infringers for prioritised action and enforcement, and help identify other linked content, through which a fuller picture of the underlying entities and their associated activities can be established.

The use of Google advertising tracking codes is a compelling basis for identifying connections, as they are generally specific to a particular user, are frequently utilised across multiple different sites in the portfolio, appear to be relatively ubiquitous across web content generally, can be readily extracted from the source code of webpages, and can often be tied to additional related material through the use of insights drawn from open-source databases.

References

[1] https://www.linkedin.com/pulse/brand-protection-data-beautiful-david-barnett-c66be/

[2] https://www.linkedin.com/pulse/brand-protection-data-still-beautiful-part-1-year-domains-barnett-juwhe/

[3] https://www.linkedin.com/pulse/brand-monitoring-data-niblet-5-law-firm-scam-websites-david-barnett-ap5de/

[4] https://www.iamstobbs.com/insights/notorious-ip-addresses-and-initial-steps-towards-the-formulation-of-an-overall-threat-score-for-websites

[5] https://www.iamstobbs.com/insights/e-mail-address-extraction-from-webpages-a-quick-case-study-in-result-clustering

[6] https://circleid.com/posts/braive-new-world-part-1-brand-protection-clustering-as-a-candidate-task-for-the-application-of-ai-capabilities

[7] https://support.google.com/tagmanager/answer/6102821?hl=en

[8] https://www.analyticsmania.com/post/google-tag-manager-vs-google-analytics/

[9] The figures in this study utilise the Python libraries NetworkX (https://networkx.org/; A.A. Hagberg, D.A. Schult and P.J. Swart (2008). "Exploring network structure, dynamics, and function using NetworkX". In: Proceedings of the 7th Python in Science Conference (SciPy2008), G. Varoquaux, T. Vaught and J. Millman (Eds.) (Pasadena, CA USA), pp. 11–15.) and Matplotlib (https://matplotlib.org/). 

[10] In the visualisations of the clusters, the brand name (as it appears in the domain names) is replaced by the string '[brand]', and an encoded ('hashed') form of the tracking codes (which generally exist in the raw data in the form 'GTM-XXXXX', 'G-XXXXX', 'AW-XXXXX' or 'UA-XXXXX', where 'XXXXX' is an alphanumeric string) is shown in each case.

This article was first published on 14 August 2025 at:

https://www.iamstobbs.com/insights/further-explorations-in-clustering-use-of-google-advertising-tracking-links

No comments:

Post a Comment

Further explorations in clustering - use of Google advertising tracking links

Part of the 'Patterns in Brand Monitoring: Brand Protection Data is Beautiful' series of articles [1,2,3,4] Introduction 'Cluste...