Earlier today I learned that pip includes a bunch of telemetry data in the HTTP User-Agent header for every request it makes, and has for >10 years (with increasing amounts of info): https://github.com/pypa/pip/blob/545eda389c41478e2f99d23212254d757d8c2cef/src/pip/_internal/network/session.py#L109
Not only is this not opt-in (as
any telemetry should be), but there isn't even an opt-out. I'm still shocked and not sure what conclusions to draw from this, except: This is not okay! ​

I remember there was quite an uproar when Go tried to add opt-out telemetry a while back, and rightly so. How did I never hear about Python doing this before? Sure, less details, but still sending telemetry without ever asking for consent.

I like
#Python, I want to keep using it, but can I if core tooling ignores user consent like this? And what other key development tools (Python or otherwise) have things like that and I just haven't noticed yet?
pip/src/pip/_internal/network/session.py at 545eda389c41478e2f99d23212254d757d8c2cef · pypa/pip

The Python package installer. Contribute to pypa/pip development by creating an account on GitHub.

GitHub
Somehow wrote Referer instead of User-Agent, fixed now.
Oh "great", uv does the same thing, referencing the pip code: https://github.com/astral-sh/uv/blob/1723ed00d6e6961abcf05d09abe59aaee005a6af/crates/uv-client/src/linehaul.rs#L61-L63
Added after someone who seems to be a
#PyPA member filed an issue requesting it: https://github.com/astral-sh/uv/issues/1958

This seems to run deep…
#Python
uv/crates/uv-client/src/linehaul.rs at 1723ed00d6e6961abcf05d09abe59aaee005a6af · astral-sh/uv

An extremely fast Python package and project manager, written in Rust. - astral-sh/uv

GitHub

@airtower This is in line with the PyPI Privacy Notice that's part of the terms of service.

https://policies.python.org/pypi.org/Privacy-Notice/

Privacy Notice - Python Software Foundation Policies

@ambv That says they collect User-Agent strings and evaluate them for statistical purposes. It makes no mention that pip and uv (and possibly other tools?) add a lot of telemetry data to their User-Agent strings. That's definitely not what a technically knowledgeable reader without knowledge of pip/uv code would expect. And the documentation for those tools doesn't mention it either.

Definitely a breach of user consent & trust (like
any telemetry without consent), regardless of legality (which seems questionable to me under GDPR, but I'm no lawyer).
@airtower The scary telemetry you’re mentioning is OS, Python version, and core native libs that Python depends on. Similar to what browsers put in their User Agent.
@ambv I have seen what's in there. Regardless of how sensitive, it's collecting telemetry without user consent. That's a fundamental issue, even if the specific data isn't that sensitive.

@airtower @ambv

Thank you for raising this discussion, even if we don't agree.

Do you agree that the data is useful for PyPI to have and openly share with the community (for many different reasons)?

If so, how do you propose it could be collected?

If not, would your opinion on this data collection change if we shared some important use cases for the data? Or is it more of a principles issue where usefulness isn't part of the equation?

@danzin @ambv As I wrote above: Any telemetry without user consent is not okay, no matter how useful the data may be. So, collection that I would consider acceptable would necessarily include all of the following:

* Consent: Make sending telemetry or not configurable, defaulting to off. The usual: A config file option (virtual environments should inherit from user config, if any), environment variable, not sure if a command line option would make sense (not that it'd hurt either, but I assume most people would want their decision, whichever it is, to stick until further notice, not set it per run).

* Information: Clearly document what will be sent if telemetry is enabled. Any additions to the dataset must be mentioned in release notes. If enabled, log telemetry data being sent at appropriate log levels.

Things that might additionally make sense:

* Show a prompt when running interactively with telemetry not explicitly configured, and store the answer in user config (yes/no, not the Silicon Valley disease of "yes/maybe later"). The prompt, if any, should include what will be sent if telemetry is enabled (or at least the option to show it) and how to find documentation (obvious option: include a link).

* Explaining in detail what the gathered data is used for and who benefits how would likely increase the probability of people agreeing. So far the best explanation I've seen (somewhere in a discussion on Github) is "helps decide what goes into manylinux images".

* Put the telemetry data (if enabled) into a dedicated header (e.g.
X-PyPI-Telemetry) so it's clear and obvious and documentation easy to find by searching for the name. Adding undocumented telemetry data to the User-Agent feels like an attempt to hide it, whether that actually was the intention or not.

Opt-in telemetry is what Go
ended up doing, by the way. And they are collecting a lot more details from those who decide to opt in. Maybe that could make opt-in more appealing to the PyPA people, too: Maybe you'll get fewer data points with opt-in, but you can get more details from the people who opt in without ethical or legal issues.
Go Telemetry - The Go Programming Language

@airtower @ambv

I see, thank you for your response.

I think it might be the case that the information sent doesn't even register as telemetry for the developers involved. Like my browser UA, "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:147.0) Gecko/20100101 Firefox/147.0" has a lot of information that isn't usually considered telemetry,

What, besides the tool version, would be acceptable information in the User Agent that isn't considered telemetry, if any? Python version would be a must IMO

@danzin @ambv In principle, the User-Agent is supposed to describe the tool used to make the request, e.g. curl will give something like curl/8.18.0. That's the expected thing I consider reasonable to do by default.

Rust's Cargo
does just that, with the option for the user to override it. A Cargo version is bound to a Rust version, so effectively it's equivalent to including the Python version in a pip UA.

So: something like
pip/25.3 python/3.13 seems like a reasonable default to me. Whether to omit the patch version is debatable, most modern browsers do it (your example has "Firefox/147.0", unless you still haven't gotten last Friday's update you're probably actually running 147.0.1). It's a bit of security-by-obscurity (which is good to question), on the other hand unnecessarily handing details to would-be attackers isn't good security practice either. The Cargo example which includes a shortened commit ID goes all the way in the other direction.

Browser UAs are a historically grown mistake, as you can see by the "Mozilla/5.0" for example. Pretty much any modern browser has that, because at some point too many websites started serving different content by UA. An unfortunate remnant of the past that's hard to change now, because if browsers were to cut down now badly made sites might break and users would complain (or maybe all that looked for "Mozilla/5.0" are gone by now, but who dares to try?).

And while (at least in Firefox) you can override the UA, from a privacy point of view that's tricky, too: Sure, you could remove information no website should need (like OS and architecture), but unless
a lot of people do the exact same thing you'd make your browser more unique and easier to track. If you want to hide UA information in your browser, it's probably better to copy a very common one. But there are other fingerprinting means to consider, too. It's a mess, definitely not a good example to follow. ​
Registry Web API - The Cargo Book

@airtower @ambv

Thank you again for your response.

I'm not a privacy maximalist nor against collection of useful data to improve the freely provided services that empower much of the community.

But must I agree the amount and kind of information pip sends seems to err towards telemetry.

I support discussing whether we could at least minimize the amount of info collected, being explicit in privacy notices, making part of the collection opt-in, and allowing opting out of even basic reporting.

@airtower @ambv

I believe the Python community at large wouldn't demand any changes to the data collection currently happening.

The amount and kind of data collected seem reasonable to me, even if a bit unexpected and telemetry-y.

However, I understand that the point is supporting those who do have concerns about this data being collected from their systems.

Python is for everyone, so IMO this is a valid concern. Thank you for bringing it up.

@airtower @ambv There has been some discussion and proposals to both clarify and make information sent in the User-Agent header configurable. Here are some related links:

https://github.com/astral-sh/uv/issues/15464

https://github.com/astral-sh/uv/issues/8474

https://github.com/pypa/pip/issues/11506

https://github.com/pypa/pip/issues/13559

https://github.com/pypa/pip/pull/13651

https://github.com/psf/policies/pull/47

https://pip.pypa.io/en/latest/

The gathered data allows important usage analysis by the community, so getting rid of if or making it opt-in seems like a no-go. How about a DPO post?

Include explicit privacy statement in docs · Issue #15464 · astral-sh/uv

Summary The documentation (and maybe also refs in README.md and/or installer spew) should include a privacy statement and disclosure. It would be good practice to be explicit about what isn't colle...

GitHub

@airtower @ambv

Also it occurs to me that, unlike with Go (not sure about Rust), downloading wheels from PyPI already gives some information about Python version and OS. I think the libc information is used to select wheels too.

I've found a interesting discussion in DPO:

https://discuss.python.org/t/pre-pep-user-agent-schema-for-http-requests-against-remote-package-indices/104006

pre-PEP: User-Agent schema for HTTP requests against remote package indices

(This is my first attempt to propose a packaging standard in this forum. I am basing this off the instructions at PyPA Specifications — PyPA documentation. Those instructions seem to indicate that a PR against GitHub - pypa/packaging.python.org: Python Packaging User Guide should be provided at the same time, but I’m not seeing many examples of that being done for in-progress PEPs, so I am assuming this is the appropriate first stop for potential new PEPs. I also could not find a standard format...

Discussions on Python.org
@danzin @ambv I found that discussion while initially digging into the topic, and quickly got frustrated with it: While the initial post mentions the consent issue and at least proposes an opt-out (at this point you won't be surprised I consider that insufficient, but it would be an improvement), and the author later mentions it can be a security issue too (malicious index serving targeted payloads), that quickly got dropped from the discussion. Very few of the replies pick up the question at all, and none of them seem to like the idea of providing even an opt-out. That does not paint a pretty picture either.

Sure, which binary wheels someone downloads might let someone with access to the logs infer some information about what kind of system they're probably running. But:

a) If anything, that's an argument for putting effort into
minimizing what's sent, and making it as hard as possible to infer more. There's a difference between "someone malicious could infer some of that information" and "we're actively collecting and sending a lot more than they could infer".

b) If logs are really collected in anonymized form only (the PyPI privacy policy suggests IPs are replaced with country by geo IP), it should not be possible to correlate multiple downloads from the same system/network from the public data set (raw logs are likely a different matter).

And keep in mind: With the current telemetry none of the data points is necessarily identifying information as such (though it could be, the kernel version can contain strings chosen in the build config – probably few people who identify their confidential prototype builds that way or simply have fun with cutesy names expect it to end up in PyPI telemetry!), but the specific combination can be a kind of fingerprint, depending on how common or unusual someone's setup is (standard containers in CI probably not, someone's customized dev system might well be).

@danzin @ambv Those are less than soothing to me. The uv ones simply deflect to pip ("We're following them here", "iirc pip doesn't allow opt-out either, it's standard telemetry pypi expects"). The pip and policy PRs you linked are very interesting in what they avoid saying:

https://github.com/psf/policies/pull/47 mentions collecting User-Agent info for statistics, which sounds innocent enough, but carefully avoids mentioning PyPA-controlled tools adding telemetry data to the UA.

And
https://github.com/pypa/pip/pull/13651 matches that perfectly:

Pip does not collect any telemetry, however, it will send non-identifying environment information (Python version, OS, etc.) to any remote indices used, who may choose to retain such information.The "etc" carries a lot of weight here. If a CI environment was detected, CPU architecture, distro & its version, C library and version, OpenSSL version, rustc version, setuptools version, even the exact kernel version (with pip 25.3). Why not write "will send telemetry information […] to any remote indices used"? Sure, pip doesn't collect that information, but facilitates others doing it. Was someone worried people might understand what's getting sent?

Frankly, that's the kind of phrasing I'd expect from a Silicon Valley company that's trying not to tell people what they're doing with their data while having a not-obviously-unlawful privacy policy.
The gathered data allows important usage analysis by the community, so getting rid of if or making it opt-in seems like a no-go.If it's so important to the community, it shouldn't be hard to convince the community (or at least a statistically sufficient part of it) to opt-in with a good explanation, right?

Include explicit privacy statement in docs · Issue #15464 · astral-sh/uv

Summary The documentation (and maybe also refs in README.md and/or installer spew) should include a privacy statement and disclosure. It would be good practice to be explicit about what isn't colle...

GitHub
@airtower You really are a nice and kind person for not publicly shaming the responsible person. I truly like and respect that about you.

But I am not that nice and kind and I
do believe in… publicly crediting people for their work in cases like this 😈:

It was
this commit, where this kind of stuff was added for the first time, submitted by a Donald Stufft[email protected]⟩, a software dev in Philadelphia who seems to be a major contributor to Pypi and a bunch of other projects; “crates” is one of the orgs that he lists on his github and who still seems to list Xitter has his main account (@ dstufft).

Maybe it’s time to ask him for what the fuck he was thinking there…

(And before anyone says anything: No, this isn’t doxxing, all of this information is publicly available in the git-history, his github-profile and his own website to which he links from his github. I didn’t even have to throw his name into a search engine!)

#pip #python
Switch to a JSON user-agent with more information · pypa/pip@6cbc08d

The Python package installer. Contribute to pypa/pip development by creating an account on GitHub.

GitHub

@airtower From what I can tell this is parsed here: https://github.com/pypi/linehaul/blob/main/linehaul/ua/parser.py (via https://github.com/pypi/warehouse/blob/main/warehouse/events/models.py#L197)

Not sure if that's the only place where anything is done with this, but at least in this instance it seems to ignore any of the more privacy invasive and non UA-fitting info anyway. 🤔

(Like a UA saying "I'm pip x.y on python 3.z" seems somewhat reasonable to me, the rest not so much.)

The place to ask for more info/clarification about this would probably be https://discuss.python.org/t/about-the-packaging-category/365?

linehaul/linehaul/ua/parser.py at main · pypi/linehaul

ARCHIVED, replaced by https://github.com/pypa/linehaul-cloud-function/ - pypi/linehaul

GitHub
@airtower (Maybe rather just "pip vX.Y" would be enough. The python version this is running under isn't really of anyone's concern that would encounter the UA...)
@Bubu Yeah, tool and version should be fine, that's to common expectation of what's in a User-Agent. Certainly debatable if it was ever a good idea, but it's from times when the internet seemed less dangerous (whether it was… not sure).

I've seen some
documentation on how to access the data (requires a Google account, though), but the thing that's bothering me is that they collect it without consent in the first place. And there isn't even an opt-out (which I'd still consider problematic, but at least it'd show they considered user privacy).
Analyzing PyPI package downloads - Python Packaging User Guide

@airtower Don't have a google account and don't really want to make one just to look at this. 🫠
@Bubu Me neither! That documentation is accessible without, but if you click the link there to see the database schema (let alone data) you get a login prompt, so that's where I stopped. ​
@airtower This just confirms that you should always just fork all of the external dependencies into one of your internal tools where you can version and track them yourself. (You know the basics for software supply chain management)
@agowa338 That's another topic I have a lot of opinions and ideas on. I really don't want to fetch from PyPI with rare exceptions, but common Python tooling does not make this easy (lacks an equivalent to Cargo's [patch] for example).

But what upsets me here is that PyPA members (with their key roles in Python in general) apparently see no problem with this kind of sneaky data gathering, and this apparently hasn't gotten much attention otherwise.
Configuration - The Cargo Book

@airtower

It's not really another topic. That's why I have not encountered this problem so far. I work a lot in air gapped environments. So none of this telemetry gathering would be possible there...

@agowa338 In terms of immediate security impact I agree. I'm focusing on trust in the people who develop key tools for Python, which is damaged by this either way, even if no telemetry User-Agent had ever been sent from any of my systems (probably not true).
@agowa338 And: Can I still in good conscience recommend Python to people who don't yet understand all these intricacies?

@airtower

Yep, don't get me wrong. I do NOT disagree. It's just - you know - at some point you start to expect shit like this...

@agowa338 I'd expect this kind of thing from Google, Microsoft, Apple, etc. Where there is no trust it can't be broken. The PSF is supposed to act in public interest, and so far I was under the impression they broadly do. Not that I agree with everything they do, but unlike those companies I didn't consider them user-hostile. Hence the shock and anger.
@airtower
you know after almost 10 years of this shit I do not expect anything other from any tech project that has a significant amount of programmers from the US...
@agowa338 I guess I'm not quite that jaded yet, even though I've been in the business for longer than that… ​

But that makes me wonder if there might really be a cultural difference and many developers from the US are so used to intrusive systems they don't even think to question it? I hope not, but maybe?

@airtower

Well sadly not anymore. All of the "move fast and break things" generations of programmers are basically indoctrinated with all of this overarching shit.

Many do not even know how little information the metrics even provide and how relying upon them without asking the users WHY they do certain things can lead you astray...