On Monday I'm presenting "Mining Software Repositories While Respecting Privacy" at the Educational Track of #MSR2020 #MSREducation https://2020.msrconf.org/details/msr-2020-Education/1/Mining-Software-Repositories-While-Respecting-Privacy Video & slides already available. Short teaser thread ahead...
Mining Software Repositories While Respecting Privacy (MSR 2020 - Education) - MSR 2020

This year educational track will feature three kinds of submissions. In addition to tutorials and collection of educational resources launched last year, this year we also introduce educational posters. Tutorials: the track chairs will invite several researchers to address topics of broad interest for community. Shared educational resources. The goal of this activity is to create a hub of community educational collaboration and curation of educational resources relevant to Mining Software Repositories. Educational resources can be lessons, MOOCs, tools, educational datasets, tutorial ...

If you're retrieving data from repositories related to software development (@github, @gitlab, Bugzilla, Gerrit, @StackOverflow, or even plain git repos), you're collecting identities of real persons, and information linked to it. That has implications for their privacy.
Implications are ethical & legal. From the legal point of view, you are subject to #GDPR, and similar regulations in California, Brazil, Japan, South Korea & other countries. #GDPR affects you if you're in EU, but also if your dataset includes EU citizens https://gdpr.eu/companies-outside-of-europe/
Does the GDPR apply to companies outside of the EU? - GDPR.eu

Under certain conditions, the GDPR applies to companies that are not in Europe. In this article, we’ll explain when and how the GDPR applies outside the EU. The European...

GDPR.eu
From the ethical point of view, read "Ethical Mining – A Case Study on MSR Mining Challenges", also in #MSR2020, by Nicolas Gold & @JensKrinke
---
RT @JensKrinke
Nicolas Gold and I (@uclcs) will be discussing our work on Ethical Mining of Software Repositories in the Visions & Reflections Session on Tuesday at 16:00 UTC at @msrconf. We should discuss ethics openly and often! Talk/paper at http://bit.ly/msr-ethics #msr20trailers #msr2
https://twitter.com/JensKrinke/status/1276138988840108037
Ethical Mining – A Case Study on MSR Mining Challenges (MSR 2020 - Technical Papers) - MSR 2020

The Mining Software Repositories (MSR) conference is the premier conference for data science, machine learning, and artificial intelligence in software engineering. The goal of the conference is to improve software engineering practices by uncovering interesting and actionable information about software systems and projects using the vast amounts of software data such as source control systems, defect tracking systems, code review repositories, archived communications between project personnel, question-and-answer sites, CI build servers, and run-time telemetry. Mining this information can ...

Ethics is not only important by itself (which of course is). It is also enforced by our institutions and research funding bodies. For example, if you're applying to be funded by the European Comission, have a look at https://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/ethics/h2020_hi_ethics-data-protection_en.pdf
And we're in high-risk territory: there are many risks associated to the data we're collecting, processing and publishing. Look at this checklist (from https://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/ethics/h2020_hi_ethics-data-protection_en.pdf ) Yellow: likely in your data, red: almost for sure in your data
When mining software repositories, one of the main problems is that we don't have a-priori informed consent from "participants" (all persons with identities in our dataset), and usually we cannot get it a-posteriori. So, we need to find other mechanisms. Which ones?
First of all, we need a lawful reason for the analysis. Usually, research is in the public interest, and that gives us a path. But we still need to carefully deal with identities, minimizing the risks as much as possible. Some useful techniques are...
Techniques for minimizing risks in data from software repositories can be summarized in an approach: "Data protection by design (DPbD)"
In short: remove identities (anonymize data) as soon as possible in he pipeline, pseudo-anonymize if you need (but be careful), keep real identities only for a very good reason, have a data processing policy in place
Remember: *anonymizing* means removing personal data, *pseudo-anonymizing* means hiding it, so that persons are not re-identifyable. Anonymization should be resistant to differential privacy, pseudo-anonymization should be really hard to reverse... but this both are tricky
Main techniques for pseudonymization are coding and hashing. But both have trouble, and are tricky.
An example: MD5 is a usual technique for hashing identities, thus pseudonymizing them. But it can be easily reversed if you have a pool of email addresses to check. Like reversing 450 billion hashes/sec in a EC2 instance https://freedom-to-tinker.com/2018/04/09/four-cents-to-deanonymize-companies-reverse-hashed-email-addresses/
Four cents to deanonymize: Companies reverse hashed email addresses

  Your email address is an excellent identifier for tracking you across devices, websites and apps. Even if you clear cookies, use private browsing

Freedom to Tinker
MD5 with salt, HMAC, coding (or other encryption techniques) are more difficult to reverse, but have their own problems. Attackers can also use related public information to help in reversing the process (like the @SWHeritage API, with all public commits, ever)
The fact that we are dealing with public data (open source, public data about software development, etc.) is not a waiver. For legal and ethical purposes, @github @gitlab @StackOverflow etc. are quite similar to social networks for developers, so regulations on their data apply
Respecting privacy is in fact a major problem for reproducible research. Sharing datasets so that other researchers can reproduce a study may break privacy, to some extent. Plain open data is maybe not possible. We need tested standards for sharing our datasets...
On the other hand, in the case of free, open source software, data about development is usually shared with the full intention that it can be analyzed (open development). Researchers & developers should talk, to learn what is acceptable use.
The pre-recorded video of the presentation is available in YouTube, have a look at it for more details on mining software repositories while respecting privacy https://www.youtube.com/watch?v=O6er2YpE8XQ #MSR2020 Slides (with even more details): https://jgbarah.github.io/presentations/research-privacy/slides.pdf
Mining Software Repositories While Respecting Privacy

YouTube
If you are attending #MSR2020, come to the session in the Educational Track https://2020.msrconf.org/details/msr-2020-Education/1/Mining-Software-Repositories-While-Respecting-Privacy on Monday 29th, 14:30 - 15:00 CEST. We'll discuss all of this & more ;-)
Mining Software Repositories While Respecting Privacy (MSR 2020 - Education) - MSR 2020

This year educational track will feature three kinds of submissions. In addition to tutorials and collection of educational resources launched last year, this year we also introduce educational posters. Tutorials: the track chairs will invite several researchers to address topics of broad interest for community. Shared educational resources. The goal of this activity is to create a hub of community educational collaboration and curation of educational resources relevant to Mining Software Repositories. Educational resources can be lessons, MOOCs, tools, educational datasets, tutorial ...