To improve the account creation process of any FOSS tool would require simple data collection about where they drop out during the sign up process. There is no account for this so there is nothing to log that is attached to a person.

Yet.... Most FOSS people are adamant that this data should not be collected. Why? There is zero personal info. It's meant to help the project.

Is the reluctance basically a "slippery slope"? E.g. never give an inch?

People have brought up IP but that shouldn't be logged either. The point is to do this in a squeaky clean manner, with the source code inspectable so the process is safe, transparent, and verifiable.

But too many projects would never consider this from fear of the rabble hoards coming after them.

This needs to be possible, it's how you get better UX.

By reframing this as "site stats" and not user stats, I hope we can move forward. If there are zero stats "of any kind" about the user collected, maybe we could do this without freaking out?

If there are some hidden user stats collected, then we should remove them. The point is to figure out how to do this safely, not assume it can never be done.

@scottjenson ethically it feels like any kind of stats collection even anonymized should be an opt in. Each user might have a very different risk tolerance. E.g. anonymized list of apps on your phone can pretty much uniquely identify you as a user, for some that's ok, for others not acceptable. Similar might arise in the aggregate collection of individually safe bits. Some tools just have no expectation of "phoning home", an explicit optin allows to make that clear.

@Aurimas but that's my point, if there's is zero risk, there is nothing to ask if the user. I appreciate you may be skeptical but that's the whole point of this conversation.

Assume we can only collect "site stats" that significantly changes the ask.

@scottjenson what would be an example of a zero risk stat? It might help understand your point of view.
@Aurimas which web pages were visited. No time stamps no users no IPs. Basically a heat map of the site
@scottjenson that does seem like an innocent stat. Then the remaining question in the infrastructure of collection and users trust that the receiving end will not put time stamps and IPs back. I think maybe that's where some might get hung up on.

@Aurimas that's why foss can actually make inroads here as the code can be inspected. Anyone doing something sneaky can't hide.

You can't trust companies of course because it's only the honor system. But foss is foundationally different.

@scottjenson you still have to trust FOSS is deployed without any diff.
@Aurimas but if you really believe that you couldn't use ANY foss software

@scottjenson I'm not saying this is my position, more that I can empathize with this.

Having no "phone home" makes it non ambiguous.

I do agree with you that the best way to improve tools is to know how they are being used. E.g. when I was on chrome one thing we found is that folks using chrome for Android use open tabs as bookmarks, which is not something we expected at all.

@Aurimas but even that "phone home" framing is incorrect. NOTHING from my machine is going home. What I'm logging is no different than disk utilization numbers. These are server stats that have zero connection to any user.

We have to break out of these assumptions that all data is dangerous. I get it, this has been abused for decades, we should be skeptical but in such clear cut cases, we shouldn't be afraid of our shadow.

@scottjenson I did misunderstand you. The phasing "any FOSS tool" made it sound to me that this is happening on a client side FOSS tool (e.g. git) collecting anonymous stats and sending them to a FOSS server side for analysis.

If you are talking about purely server side tool, I am 100% in agreement that stat collection there for safe stats makes a lot of sense and should be done to improve the project.

@scottjenson we had a lot of push back to add analytics to the php.net website for those same reasons. But we did manage to get it accepted and voted through via our usual RFC process. [1] And the main result from that was to tell us that one of the most visited, and inadequate pages was the download page. So some community members spent some time afterwards to improve it, and everyone benefits from it!

[1] https://wiki.php.net/rfc/phpnet-analytics

PHP: rfc:phpnet-analytics

@Girgias exactly my point! Good for you!

@scottjenson Part of a good experience is trusting that my privacy is respected without having to personally read the code myself.

Trust erodes when every piece of FOSS is collecting. I'm talking systemic collapse. We've mostly reached that stage for the world wide web, it's barely usable due to data collection. Nobody wants this, so there's pushback.

@txtx but that's the whole point, you're privacy would be protected as NOTHING about you is being stored.

We have to get past this belief that "any data is bad". You wouldn't be upset at all if my server collected disk utilization starts for my server right? What I'm suggesting is no different. It's not "your" data I'm talking about

@scottjenson Maybe I misinterpreted your last question ('Is the reluctance basically a "slippery slope"?') as being more general in nature?

I'm coming from a HCI design & research background since the late 1990s. But only since I started to design and write my own open source software did I come to appreciate a more stubborn reluctance in design decisions.

I think we need new UX design approaches that are a better fit for FOSS. So much UX thinking is built around profit oriented engineering.

@txtx We seem to be agreeing? My point is that this "slippery slope" concern is misplaced. As FOSS projects we can 'prove' that we're not collecting personal data (unlike corps which just have to trust). But more importantly, we shouldn't gather individual personal data to minimize the risk further.

From a UX POV, this data is extremely helpful to our users. But with this community, it historically has had a 'no f'ing way' attitude. We can transcend that.

@scottjenson UX design has suffered because of an over-reliance on metrics.

All the big corporate software is bloated junk because of it.

@scottjenson Are there folks pushing back on structuring this? Mastodon already logs tons of stuff and of course there is the regular inbound web server log.
@Lee_Holmes this is a generic question as it comes up so frequently
@scottjenson Yes, I think it's a lot about the remaining ability of correlation. Any kind of log would leave timestamps behind, potentially in IP or other identifier that keeps track of a single user session. (these are automatically pseudoyms and therefore can account for personal data.) That in combinations with companies and organisations dropping the ball on this too often and messing things up for profit, results in this behaviour of being cautious about any kind of tracking.

@sheogorath fair enough but storing the IP isn't necessary. This data could be nothing more than a heat map of the sign up process. It would be noisy and not as helpful but if you wanted to be extra safe, it would still be useful.

But I think you're correct, too many companies have botched this so there is a general distrust, which I appreciate.

The fact that a FOSS projects code would be visible should help alleviate these concerns though?

@scottjenson If it runs on my machine and requires me to set-up an account with an external party doesn’t the whole construct is more a service than a tool? And then we have contract/business relationship in play as well with an arbitrarily vague privacy policy. And when we’re there it feels not far off to having trust issues with telemetry in general.

I’m struggling to find examples of FOSS tools with account setups that are not service connected. So my thoughts might go a totally wrong direction

@smndtrl your making lots of assumptions there! I was imagining something very simple likely built on top of local web logs: simple, transparent, and identifier free.

The purpose of this exercise isn't to say "log all the things!" But to ask what is safe simple and possible?

@scottjenson with the parameters that were given I was trying to find examples of tools I had used before and somehow all of them fit to what I wrote. Especially the sign-up/account process led to it. If I’d ignore that and look at just telemetry for local installable FOSS I can say from experience that the definition of “no personal data is transmitted” was often just incorrect and proven wrong by looking at the data.

Taking a step back and looking at what I would be able to add to my software and consider “simpel” (to implement) and transparent is presenting the data before explicitly letting the user decide to submit. That is ofc not a very easy approach for the end-user who would need to be able to make sense out of the data so they can judge if it’s appropriate for them.

I think it’s incredibly hard for local software to transmit any data out of the system while being transparent and trustworthy without being explicit. Even reaching out to external systems when it’s not necessary for a user action is something I’d consider a violation of trust.

@smndtrl my point is that we're over thinking this. If we're gathering "site stats" that have nothing to do with the user we didn't need to scare them by asking them with a link to log data. That's crazy as literally no user would know how to navigate that. Just do it!

I realize this scares the forking shirtballs of many people but that's why I want to have the conversation.

Do it right, do it safe.

@scottjenson Let’s say that we come up with a set of “events” for which we know they don’t need anything that can be used to identify a user or their system. Let’s say I procure that software from a Repo of my choice who’s policy it is to only distribute software leveraging said events for telemetry and we solve the IP address issue with sth like OHTTP we would be able to extend the trust people put into their Repo/distribution provider onto the collection of the metrics.

The repo would compile or configure their public OHTTP gateway as target into the applications they distribute and the source IP leakage is addressed. For non-repo based distribution the trust relationship is already different and maybe the software using the “events” takes a user configurable gateway from the system.

In the end it all comes down to: What’s there to make me believe product x uses telemetry in a privacy preserving way? Just the word of the creator, a party I already trust to deliver me safe software… or do I have the means to prevent (potentially) non-privacy preserving telemetry completely.

@scottjenson Sort of, yes. Also, an IP addreas is considered PII in many jurisdictions - like the GDPR.
@troed but IP isn't required to be stored...

@scottjenson Agree, but the person who has had bad experiences from corp. surveillance has a distrust towards everyone. Not having yet created an account doesn't then matter since their IP already (potentially) identifies them.

Stating up front that the IP isn't saved would likely help, but the notification itself is signup friction ...

@troed I want to be sensitive to these issues but it's hard not to feel this is an impossible situation. This type of non ip statistics has literally zero risk, the project is open source so the logging method can be verified.

I totally appreciate the world is a shitty place and corporations did horrible things but can't we do better?

@scottjenson I think you're phrasing the solution. Collect the statistics, the/a link mentioning that it's done also links to the relevant source where it can be verified that nothing that can connect to a person is saved.

The image of the project does play a role though. Even if Firefox is open source I'm quite confident Mozilla aren't seen as fully trusted.

@troed and that scepticism is fair. I'm bringing this up as I've talked to a few popular foss projects that would like to gather this info but they are scared shirtless that the community will turn on them, no matter how careful they are.

@scottjenson FOSS non-profit auditing organization it is then. "Our data collection is audited by ...". Has to be set up with some well known figures with lots of existing clout in relevant communities.

Not sure it's _doable_ - but it would go a long way.

@troed that would be the gold standard but I'd argue making the code available goes a very long way to building trust and avoids a bottleneck. The community would eviscerate any project that tried to be sneaky.
@scottjenson I'd say it mostly paranoia and ignorance. Any and all telemetry discourse is poisoned by people shouting "PLEASE DONT SPY ON ME PLEASE DO NOT COLLECT MY PERSONAL INFORMATION" no matter what is being proposed, even if its opt-in and collects only general informations. Fedora's telemetry proposal is a perfect example.

@tragivictoria that is exactly why I'm having this conversation. People incorrectly assume any data is personal data which is, as you say, ignorant.

I get it, data collection has been absolutely abused and people should be mad and skeptical. The whole point of FOSS is to be the antidote not more of the same. We can do better.

@scottjenson It’s a long learned and reinforced trust issue. Whenever commercial software asks for consent it’s mostly to fuck users over. The choice used to be spelled out fairly clearly but the industry gradually went for stealthier copy and darker patterns so users are now conditioned to refuse anything even remotely like that and raise a stink. How many times have we been told the data is perfectly anonymous only to find it a lie? Not sure what the solution is in your legitimate case. Trust?
research!rsc: Transparent Telemetry

@scottjenson As someone else pointed out, the constant siege of surveillance capitalism has encouraged defensive defaults. I also suspect that there's a free rider mindset that assumes that someone else will do it. Oh, and "How can I trust that PII isn't going to leak?" is going to lurk in the back of my mind. It's also generally unbounded, in that I say yes today but can expect it to continue forever.

To those ends, I wonder if how the consent request might be framed to affect opt-in rates?

@scottjenson Example 1, rotating shifts:

It’s your turn to help us improve. To avoid mass surveillance, we only collect data from a small, rotating group of users. You’ve been randomly selected for a 30-day shift.

The Goal: We need 500 active participants this month.
The Promise: Data is PII-free, externally audited, and the toggle automatically flips OFF in 30 days.

[ Start my 30-day shift ] [ Skip this rotation ]

@earth2marsh but my point is that if it is only collecting "site stats" with ZERO user data, then don't get optin for the simple reason there is nothing to opt in to! They aren't giving away anything!

The default assumption is that ANY data is personal data and that's simply not true.

@scottjenson I think ANY data coming from you is by definition personal data.

That includes data created locally by trusted software. It gets transmitted at some point, right? The transmission itself would be problematic even if it was just a simple ping.

It would be ignorant to ignore this, just based on how quickly others promise to throw away that data (time, IP, + more...). How can anybody possibly claim they are the only ones having it, and nobody else ever will?

I see your point, unfortunately this looks like an almost unsolvable problem consisting of two components: physics & trust.

@mray

If a website tracks each page load and creates a heat map of the site it's not tracking anyone. This is just "site tracking".

That's my point. We assume if the app is collecting data it must be personal data. That isn't always the case.

It's not a risk if the data isn't "your data". Is this a slippery slope? Of course. I'm just saying there's lots of reasonable things to track that are 100% user data free and we should be able to talk about it.

@scottjenson your example sounds like data collected is generated on the server. Similar on how you can track CPU load, and instead you measure "page load". That would be fine, I guess.

But, my understanding is that even for counting page loads on a server you need to compare IPs, meaning you have to have them.

Let's assume you actually succeed to generate 100% anonymous data on a client - how would you

A ) make sure it gets transmitted without exposing data on you. (like IP, time, ...)

B ) protect from de-anonymizing by cross referencing with other data.

A+B might be possible, but way too hard.

Do you have an example of "reasonable things to track that are 100% user data free"?

Because I actually LOVE the idea behind your goals, as that data could be really very useful.

@mray Good questions, that's the whole point of this thread. But just because my server knows your IP doesn't mean we MUST log it.

I'm not saying "we can log user data"

I'm saying lots of helpful data isn't user data at all. So can we log IP-less page views (as an example) and not get knives in the back?

@scottjenson Any data (even non-user data) will have to be sent at some point to make sense to anybody.

If it makes sense to anybody it tells something about me.

Here is the tragic issue: I can TRUST in projects to behave well, and believe they try to discard all identifying data immediately. But I know humans make errors, and I have to assume mistakes happen and thinks will get hacked/leaked at some point.

That is why in conclusion I TRUST in free software to always act in my interest and not take a risk without consent. If they stop doing that – they lose my trust, and I won't even consent in giving anything.

The only projects I trust are the ones that understand I can't trust them.

That's a total bummer, and I support your insisting nature.

I'm on your side. 😩

@mray I'm glad we're having this conversation. I can see you discussing this in good faith. What I keep believing is that this data can be gathered in a way that has NOTHING to do about you.

I understand there is a some element of trust but the strong belief that data always gives away something person is something needs to be debunked.

The problem seems to be it's far too easy to grab the wrong data (e.g. IP). My point is that OS projects make it very easy to trust.

@mray
What I'm proposing is two fold:
* Make it clear what non-user data you're using
* Your source code acts as a confirmation.

It's not perfect but if you really think any OS project would run nefarious software then you couldn't use ANY software, ever! Worse, having a strong stance against data collection provides zero protection (as the project would just lie).

I'm trying to create standards (w/ verification) to increase trust

@scottjenson Here is the problem in my eyes: All data coming from you is personal. Just to a varying degree. You seem to disagree, which is why I was honestly curious for examples that are 100% harmless.

I share your expectation that OS projects tend to gather "less" personal data in general, but you can't draw a line easily.

Who interprets the line and can vouch that data crossing it gets discarded? One may trust many OS projects intentions, but their infrastructure and security practices might be sub-par. And that is OK!! Breaches clearly happen with multi million dollar companies all the time – so who can you blame?

Handling user data is a bit like handling toxic but valuable chemicals: a liability, or an asset.

That is why OS avoids it, and big tech seeks it.

@scottjenson In order to make progress I would love to find even one single exception to the rule and invite people to share that! But the way I see it, even a metric like "how often do you start the application?" clearly hits my threshold of: personal data.

@mray Sure, that's a reasonable request. My example was "page views" on a website. I don't want IP, I don't want timestamps, I just want a count to know which pages are used.

This is VERY helpful as it allows creating a "heat map" of the web site and it would show clearly when people (in agregate) abandon any sequential process.

This should have zero data about any user. There is no personal information gathered: no IP and no time stamps

@scottjenson As long as it is plain "page views" on a server I can actually imagine how this could be implemented in a safe way, so I agree there!

Anything beyond that (like what page was visited before) easily gets problematic I think. My guess would be that all tooling to gather data was designed to do more, inviting more criticism. Tools for collecting server-side information certainly exist, not sure if there are any dedicated to only collect "safe" data.

It would probably be like using a full blown word processor to create a plain txt file :P

I feel I'm not educated enough to make assumption what is technically feasible on a server side, though.

Just a clarification question: Are you looking into server info exclusively or are you generally interested in clients as a source?

@mray at this point I'm just starting a conversation. I'm choosing page views as a starting example as it's so obviously "personal data free" and even then, people reject the idea.

My point isn't to suggest anything specific but to say "we can talk about this!" And find reasonable patterns to follow

@scottjenson well for an honest conversation I think it is inevitable to address the elephant in the room: where are we heading? Gathering data not on client but on server side ignores transmission of data, and limiting yourself to just see how often pages have been served is fair game.

This is describing where the line starts, but where does it end in your eyes?

If we can't define a "harmless" way of gathering data, everybody should remain suspicious of what is going on. A key value in FOSS world is "just trust me" won't cut it. For us it is a problem, but in general this is AMAZING.

If only we could just do it and let PR department deal with the backlash 😂

@scottjenson Example 2, frame it as an opportunity to support the project:

Open source projects are more sustainable when users support them. Two ways you could help today:

Option A: Share Insight. Enable anonymous telemetry so we can build better features.
Option B: Contribute financially with a one-time $5 donation.

[ Share Anonymous Data ] [ Donate $5 instead ] [ Use without contributing ]

@scottjenson Fedi has gone through waves of high sign-up activity in the past (see the fedi.db stats). That suggests people can complete the signups but then stop using their fedi accounts for whatever reason (I could list a few).

I will grant there is a FOSS/Linux dynamic at work here that prevents recognition of systemic UX issues. Some of the issues ought to be obvious to (capable) designers without resorting to user tracking.

@tasket fortunately, what I'm asking for doesn't require user tracking