Revisiting 2 of the 5 docs from the Snowden leaks that mention 'cookies'.

GCHQ 2009 on 'target detection identifiers':
https://snowden.glendon.yorku.ca/items/show/188/

NSA 2011 on 'selector types':
https://snowden.glendon.yorku.ca/items/show/172

...featuring cookie/browser IDs from Google/Doubleclick, Facebook, Microsoft and many more.

It's breathtaking how the surveillance marketing industry has still managed to claim for many years that unique personal identifiers processed in the web browser are 'anonymous', and sometimes still does.

Target Detection Identifiers · Snowden Archive

Browser-based personal identifiers aka 'target detection identifiers' are 'unique and persistent for a user/machine', and they are a 'SIGINT standardised code', according to GCHQ (2009).

Ryan Gallagher reported on Snowden docs that mention browser/cookie IDs aka 'target detection identifiers' in 2014 and 2015:
https://theintercept.com/2015/09/25/gchq-radio-porn-spies-track-web-users-online-identities/
https://theintercept.com/2014/12/13/belgacom-hack-gchq-inside-story/

From Radio to Porn, British Spies Track Web Users’ Online Identities

Top-secret documents from whistleblower Edward Snowden expose UK eavesdropping agency GCHQ's attempts to create world's largest mass surveillance system.

The Intercept

This slide shows that the GCHQ stored 'all TDIs seen in last 6 months' in 'bulk' in their 'MUTANT BROTH' system.

'TDI <-> website correlations' were stored in the 'KARMA POLICE' system, which was reported on a lot and has an extra Wikipedia entry:
https://en.wikipedia.org/wiki/Karma_Police_(surveillance_programme)

They also stored 'TDI correlations' in another system they referred to as 'AUTO ASSOC'. I wonder has anyone ever reported on this? Was this about linking identifiers associated with the same person/entity to each other?

Karma Police (surveillance programme) - Wikipedia

Not sure whether this report was referring to data on a single person?

In addition to Facebook's famous 'datr' identifier, it shows 'utma' and 'utmz' identifiers processed by Google Analytics on YouPorn.

It also shows non-persistent session IDs, which were possibly not very useful.

The GCHQ even specified in detail whether browser-based personal identifiers refer to a user account or device, and whether they are processed in a web request or in a browser cookie.

Ironically, I also spent quite some time doing the same work - categorizing browser and cookie identifiers - much later, starting in 2018/19. And I spent less than 250 hours per ID 🤖

In 2009, they 'discovered' only '70 distinct TDI types', though. Today, there are thousands of distinct browser/cookie IDs.

Another slide explains that they collected 18 billion 'target detection identifiers' in the period between 25 Dec 2007 and 20 Jun 2008.

This was certainly a large number in 2007/8. There's a much larger number of digital identifiers per person available today.

And yep, these revelations led to most of today's HTTP traffic being encrypted. Still, many entities have access to browser/cookie IDs, and some of them are now accessible via trillions of RTB bid requests in digital marketing, for many.

And then there's this perhaps better known 2011 NSA slide on 'selector types'.

A few major cookie IDs.

'Browser tags', was this related to malware/adware?

For mobile phones, they accessed only IMEIs and 'Apple UDID' in 2011.

Google AAID and Apple IDFA were introduced later (2014/16), and then became the most pervasive digital identifiers for digital tracking and profiling across a myriad of entities. Google/Apple have never been held accountable till today.

And there was Bluetooth, already.

Not least, the GCHQ had so much fun discussing 'target detection identifiers' (TIDs) and mass surveillance, oh my 🥴

There's another Snowden doc from 2011 that provides more information about the GCHQ's AUTO ASSOC system, which calculates correlations between 'target detection identifiers' (TDIs):
https://maths.ed.ac.uk/~tl/docs/Problem-Book-Redacted.pdf

I think the doc was first published by
@pluralistic in 2016:
https://boingboing.net/2016/02/02/doxxing-sherlock-3.html

Seems like the GCHQ operated a kind of probabilistic ID graph that aims to link cookie IDs, device IDs, email addresses and other TDI identifiers based on communication, timing and geolocation behavior.

In the doc, they think about how to improve the ID graph by adding further data sources ("SIGINT truthed datasets").

This includes potential internal collaborations that aim to "automatically determine the relationship between entities based on communication content" and even based on finding "triggers in audio content".

Btw. What inspired me to revisit these docs is @byrontau's excellent book Means of Control, which not only details how US defense, intelligence and law enforcement buy commercial data from digital marketing but also provides deep historical context, tracing back to early-2000s debates on Total Information Awareness (TIA).

While I was casually following debates on mass surveillance from Echelon and TIA to Snowden, I didn't seriously start to investigate commercial data practices until 2014/15.

Unfortunately, I discovered that there are many more Snowden docs that address TDI identifiers and data from digital marketing/advertising.

I really shouldn't dive deeper into this rabbit hole of historical docs, probably most or all of their contents have been reported already, but... 😬🤯🤖

This table on 'query focused datasets' from a 2012 CGHQ doc confirms that the AUTO ASSOC system served as an ID graph, helping to find "other TDIs" belonging to a "target".

It also addresses other TDI queries for traffic, similar-timed website visits etc:
https://assets.aclu.org/live/uploads/document/foia/GCHQAnalyticCloudChallenges.pdf

Another slide from the same 'CGHQ Analytic Cloud' doc explains that they collected 50bn 'events' per day. I guess this refers to or includes http requests?

35% had a TDI present.

The doc also lists example analyses including website visits and search terms per user/device.

Another CGHQ 2009 doc explains how TDI and HTTP data is extracted from the 'data feed', flows into BLACK HOLE flat file storage and is then put into 'query focused datasets' (QFD), which can be used e.g. to find 'other' identifiers for a person via AUTO ASSOC.

Once again, the doc emphasizes that TDIs include cookie identifiers.

(link to the doc cited above: https://christopher-parsons.com/wp-content/uploads/2023/01/gchq-black-hole-analytics.pdf)

Next, a joint NSA/CGHQ doc from 2011 addresses the threat of HTTPS/TLS encryption (which actually really materialized after the Snowden leaks):

"Which of our presence and communications metadata (including TDIs] will be lost under encrypted channels?
...
Are there other TDIs or "TDI-like" identifiers we can develop that can identify users or machines of interest, despite our targets using https?"

https://eff.org/files/2015/02/06/20141228-spiegel-nsa_-_gchq_crypt_discovery_joint_collaboration_activity_0.pdf

What's really interesting is revisiting this 2010/11 doc on how the GCHQ was exploiting data from Blackberry, Windows, iOS and Android smartphones, with a focus on how to identify IDs transmitted from apps to mobile advertising firms like Flurry and AdMob.
https://cdn.prod.www.spiegel.de/media/716e3462-0001-0014-0000-000000035670/media-35670.pdf

The Intercept reported on the doc in 2015 (https://theintercept.com/2015/01/26/secret-badass-spy-program/), although without addressing the TDI graph context, and it has been cited by mobile app security researchers.

It's hilarious to see how the GCHQ in 2010/11 made fun of Flurry, an early mobile ad/data company later acquired by Yahoo, because of its claims of collecting 'anonymous usage statistics' and data that cannot 'identify an individual', only to point out how several device IDs in the traffic can actually serve as TDIs for government surveillance.

Note the "how they know that" bubble 🤖

Here's how the GCHQ aimed to identify device identifiers (IMEI, Android ID) and GPS coordinates in data sent from 15,000 different apps to the early mobile ad exchange Mobclix.

The GCHQ emphasizes that Mobclix collects phone IDs, tracks location and matches its data with Nielsen data.

Similarly, the GCHQ identified (hashed) identifiers for iOS, Android and Windows phones in data sent from apps to Google's mobile ad company AdMob and others.

Back in 2010/11, all the data was being transmitted unencrypted, and I guess there was no mobile app 'SDK' but just apps including code snippets that triggered HTTP requests.

(but just because passive listening to HTTP traffic doesn't work anymore doesn't mean this commercial data cannot be exploited. Many companies have access...)

The doc is all about pseudonymous device IDs that can or might be used as 'target detection identifiers' (TIDs).

Windows Phone is not a thing anymore, but they not only tried to exploit Windows phone IDs, but also app IDs, MSN ad IDs and appstore/marketplace IDs, for example, in order to match them as as 'mobile TDIs' to Microsoft Live accounts.

@wchr i think only 1% got released - something like that

@wchr @byrontau
I don’t care how silly I look in a #mask, because I understand respiratory contagion.

I don’t care how tinfoil I appear for using #tor #linux and #monero, because I know how #internet protocols work.

@wchr I got this a few days ago, about my phone soon (already ?) scanning for local BT devices and reporting them to Google "as a service to me" and I thought "no fucking thanks"
@jab01701mid Also seen that, still didn't have time to look into it.
@wchr When I read it correctly the German government is about to install a central system “to manage Cooke consent” which would mean the same data you talk about in the thread will be commercially collected by law 🥳🤢
@lennybacon I'm very critical about the law, but AFAIK it does not mandate the centralized collection of actual IDs that are used for tracking/profiling in any way. The IDs maintained to store consent 'choices' could be exploited, though.
@wchr If it's helpful to understand why companies (and agencies) are so motivated to do this and how much money is at stake for them, there are publications on why matching people across multiple devices matters, especially to the advertising industry. It comes down to that they want to measure the impact of ads and such on people, not on browser/app instances, so matching a person to all the devices they use makes a huge difference in revenue. Here's one example:
https://research.facebook.com/publications/people-and-cookies-imperfect-treatment-assignment-in-online-experiments/
People and Cookies: Imperfect Treatment Assignment in Online Experiments - Meta Research | Meta Research

Identifying the same internet user across devices or over time is often infeasible. This presents a problem for online experiments, as it precludes person-level randomization. Randomization must instead be done using imperfect proxies for people, like cookies, email addresses or device identifiers.

Meta Research