If you had code on GitHub at any point it looks like it might be included in a large dataset called “The Stack” — If you want your code removed from this massive “ai” training data go here:

https://huggingface.co/spaces/bigcode/in-the-stack

I found two of my old Github repos in there. Both were deleted last year and both were private. This is a serious breach of trust by Github and @huggingface.

Remove all your code from Github.

CONSENT IS NOT OPT-OUT.

Edit — thanks for all the replies. More context here: https://hachyderm.io/@joeyh/112105744123363587

Also the repos i found of mine i’m sure were private, but even if they were public at some point, for a brief time, in the past that isn’t my consent to use them for purposes beyond their intent.

---
Edit 2 -- I see this made it to HN, which is a level of attention I do not want nor appreciate....

For all those wondering about the private repo issue -- No, I am not 100% sure that these ancient repos weren't at some point public for a split second before I changed it. I do know that they were never meant for this and that one of them didn't even contain any code.

If my accidentally making a repo public for a moment just so happened to overlap with this scraping, then I guess that's possible. But it in no way invalidates the issues, and the anger that i feel about it.

Am I in The Stack? - a Hugging Face Space by bigcode

This app lets you check if your GitHub repositories are part of the The Stack dataset. Enter your GitHub username and select the dataset version to see if your code is included. If you want your da...

@emenel ROFL. In their project description: "The licenses we consider permissive are listed <here>."

The "here" link:

@galaxis @emenel Appropriate, considering my private repo they ingested had no (F)OSS license at all.
@galaxis @emenel found most of my public repos on there, no private one tho
@Zylann I found non of my private repos.
@emenel oh wow, there are 5 repos from the account I had and deleted like 4 four years ago 
@compudanzas yep, exactly. the data pre-dates co-pilot and other similar things... which means there was no chance to opt-out. which is already infuriating since anything like this should absolutely unequivocally be opt-in.
bigcode/the-stack-v2 · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

@emenel @huggingface

I work for github. This post was shared in our slack, on a non work related channel. We don't think it's us.

https://huggingface.co/datasets/bigcode/the-stack-v2

They say they get data from SoftwareHeritage, a website that archives repos from github. If your repos were ever open, they might have been archived there even after you deleted from github.

That's my best guess.

bigcode/the-stack-v2 · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

@correcthorse @emenel I think this is accurate, this thread has links to where SWH talks about their involvement

https://hachyderm.io/@joeyh/112105744123363587

see shy jo (@joeyh@hachyderm.io)

I am disappointed in Software Heritage. They made this statement on using their archive as an AI training dataset: https://www.softwareheritage.org/2023/10/19/swh-statement-on-llm-for-code/?ref=openml.fyi These seem like good principles. But they are not actually sufficient to respect our work. And the third is too weak, and appears to be providing a figleaf for extractive behavior.

Hachyderm.io
@technomancy @correcthorse thanks for the additional info. I’m pretty sure the repos of mine were private. If it were only me i could be misremembering something, but i have heard from a number of people that they also found repos in this dataset that were private. So i’m not sure what to think or how to explain it.

@emenel @technomancy

Me neither. But if something shady like that was happening here me and many other people wouldn't be working here.

I can't vouch for everything the company does. But me and my coworkers would be very surprised if public repos were going public, and most of us wouldn't stand for it.

I'm not in sales, I don't care what you use. But I used gh before working for them and I'm comfortable using them even after I leave.

@correcthorse @emenel @huggingface it listed repos i had that were initially created as private, and i *paid* to have them as private. this is a *massive fucking breach of trust*.

update edit: it appears the repo was set to public somehow later on. see later replies for info.

@dietinghippo @emenel @huggingface

The code couldn't have leaked from anywhere else? I'm telling you, if this was a thing there would be a riot at github.

@correcthorse @emenel @huggingface hey apologies about the initial harshness, for some reason that repo is now public and i don't recall setting it as public ever. i'm now trying to figure out when that happened because it was a private collaboration repo.

i'm still not ecstatic that any of my code was included in this dataset, but github's off the hook in my ledger on directly providing code.

@dietinghippo @emenel @huggingface

No apology needed im just a random person who has a little more info. I just wanted to shed some light.

I can't promise the company I work for is not shady, I can only say that if turned out they were a lot of us would leave.

@dietinghippo @correcthorse @huggingface as i mentioned in my additions to the post (via edits) -- i can't be 100% certain that my repos at-issue weren't for some moment public either by accident or because i needed to share something momentarily. they are also now long gone, and both are quite old. that doesn't make them fair game for use beyond their intent and license. especially since one of them wasn't even code ....
@correcthorse Is there any way to see if a repo was made public at some point, and if so, how long? One of mine is on there, and I do not recall ever making it public. I certainly didn't put an open license on it.

@correcthorse @emenel @huggingface Do we know if Software Heritage just uses regular public access for their archives, or do they have some special arrangement for bulk archiving? If the latter, one could imagine unintentional permission breakage. Leaking private repos would still be BAD, but in that case an SWH / Github problem rather than the AI thing

Github is apparently an SWH sponsor, and mentions partnering with them in https://archiveprogram.github.com/

@correcthorse @emenel @huggingface this is not true, the repo's for my company are private and have always been, eventho the repos have an Apache Licence in them they are contractually required to be private, and always have been. They have never been public, because that would be bad.

@emenel @huggingface

Hell yes !
I found some crappy repositories of mine ( with and without this username) that are fully cursed ( not finish, with security breach for training etc )

Good luck for the people with generated code inspired by that.

@Aedius @emenel @huggingface I'm not saying anyone should create a bunch of repos of non-functional or vulnerable code specifically for the "benefit" of these models, but public repos are free so… ¯\_(ツ)_/¯
Ah yes, @emenel, because GitHub itself totally doesn’t use all hosted code to train its large language model. TBH given its parent corporation and the hypocrisy of providing vendor locked-in service based on proprietary software for free software hosting and project management, I’m surprised there’s any trust to be breached to begin with.
GitHub Copilot litigation · Joseph Saveri Law Firm & Matthew Butterick

GitHub Copilot litigation

@emenel @huggingface

Remove all your code from GithubI strongly agree with this statement, but how will that help this particular situation? You said a repository which was private and deleted still got included in this dataset, so removing the code clearly doesn't do much.

I'm also guessing this is completely legal for
#GitHub to do, as you waive all rights to your code the moment you push anything there.

I’m also guessing this is completely legal for #GitHub to do, as you waive all rights to your code the moment you push anything there.

Not everything you push there, @tyil: you can’t waive rights that aren’t yours, but laws doesn’t apply to Microsoft anyway. Remember when the mofo casually blackmailed the 6th largest economy?

Cc: @emenel

Microsoft’s UK Blackmail Showcases Big Tech’s Threat to Democracies Worldwide - Microsoft’s UK Blackmail Showcases Big Tech’s Threat to Democracies Worldwide

The American Economic Liberties Project released the following statement after Microsoft CEO Brad Smith lambasted UK regulators for their decision to block Microsoft’s $69 billion acquisition of Activision Blizzard, warning the decision would likely undermine Microsoft’s willingness to grow its business — which includes cybersecurity support — in the United Kingdom. 

American Economic Liberties Project
@emenel @huggingface there are issues that have been open for a year and yet they're doing nothing. is there something more effective than opening an issue that they'll just ignore?
@emenel @huggingface this is serious: “private” is being redefined.
@emenel @huggingface @chockenberry Important to note: if you ever changed your github username you need to check them all. I found different repos based on whether I checked my old name vs my new one. (look out @lisamelton !)

@donw Thanks for the heads up! 😬

@emenel @huggingface @chockenberry

@lisamelton I'm doing a bunch of DVD ripping right now so you and your repo were already near the top of my mind :)

Someday I will get around to organizing my thoughts/issues about bluray rips and captions so I can harass you for a little assistance.

@donw Ha! Sounds good and I look forward to it. 🥰
@donw @emenel @huggingface @chockenberry @lisamelton I love that the only thing I have in this is when I was messing around with exporting recipes from Paprika. Will the AI tell people to knead their code in a stand mixer with the dough hook for six minutes?
@donw @emenel @huggingface @chockenberry @lisamelton Also FWIW, if you want to see if they've used your code, it looks like you need to check all the versions in the "Stack version" dropdown, I had one repo that appeared only in 1.0, not any of the later ones. Though you can only opt out of *future* use, so it's not clear whether this mean much
@donw @emenel thanks for that, will do so now although I think I should be fine anyways
@donw @emenel yeah I had 11 repos under my old username and 9 under my new one 😱

@emenel @huggingface How...

Excuse me... how did somebody figure out HOW TO ENSHITTIFY BEING SCROOGLED?!

@emenel @huggingface good to know it only includes my shitty old repos from like 2019 🔥 🔥 🔥
@emenel it’s worth noting the username field is case-sensitive on their search tool!
@emenel @huggingface@mas.to When I put code on GitHub, it's quite clear that I'm consenting for it to be cloned/scraped by others - indeed that's the whole point of the platform. So long as 'The Stack' aren't violating the licensing terms and they make a point of telling users of the dataset not to they aren't actually doing anything different from me using your code. Open source disclosure of IP runs with it the risk of unanticipated use cases you don't like.

The concerning part here is the use of private repos, though what I can't find is any evidence of GitHub themselves being involved in the project which would be necessary for that to happen without illegal hacking of their platform. So this needs to be clarified and is potentially very serious.
@alastair @emenel @huggingface if your repo was public cloned at some point, and then switched to private… you are out of luck. But that’s on you. It could be in the Internet archive or any other archive locations.

@emenel @huggingface Disappointing that private repos aren't. I could imagine all kinds of things were potentially exposed in this way.

I try to remind myself, whenever I'm using Google or anything that has no cost to me, that I'm the product and not the customer. Whoever is actually footing the bills is in charge.

@emenel @huggingface

It's helpful to know that the search-by-GitHub-username is case sensitive. I searched all-lowercase and it found nothing. I searched by my case-sensitive GH username and all my public repos were in The Stack.

@emenel @huggingface It's even worse than that: opting out doesn't even seem to work. I filed https://github.com/bigcode-project/opt-out-v2/issues/46 (as per their own process) almost 6 month ago. My repos still show up in "Am I in The Stack?" if I type my GitHub username, and as you can tell, there has been no visible activity on the issue in all that time.
Opt-out request for babolivier · Issue #46 · bigcode-project/opt-out-v2

I request that the following data is removed from The Stack: Commits GitHub issue babolivier/clouzytweet babolivier/clouzytweets babolivier/cozy-pass babolivier/cozy-tweets babolivier/infra baboliv...

GitHub
@emenel @huggingface it's worse. If you had a repo and it was forked all separate forks need to ask for removal.
@emenel @huggingface I expected better from you, hugging face, considering some of the people in your employ.
@emenel @huggingface I mean, I don’t know why I expected better.
@emenel @huggingface “ One of our goals in this project is to give people agency over their source code by letting them decide whether or not it should be used to develop and evaluate machine learning models, as we acknowledge that not all developers may wish to have their data used for that purpose.”

… “and if anyone is dead, or lost access to their account, or just plain old not paying attention,
yoink”.
@emenel @huggingface Are you able to find your repos in the Software Heritage archive using the search feature here? https://archive.softwareheritage.org/browse/search/?q=simonw%2F&with_visit=true&with_content=true
Search software origins to browse – Software Heritage archive

@emenel @huggingface they scraped some of my bottom-of-the-barrel throw-away shell & python scripts and I'm not gonna remove them. I want AI to learn the bullshit I write when I'm in a hurry. It'll be great.

@gabrielesvelto @emenel @huggingface

*This dataset is derived from the Software Heritage archive, the largest public archive of software source code and accompanying development history. Software Heritage is an open, non profit initiative to collect, preserve, and share the source code of all publicly available software, launched by Inria, in partnership with UNESCO. We acknowledge Software Heritage for providing access to this invaluable resource."

@emenel you have to check each of the stack versions too. I have different repo lists for each show up.
@emenel another thing, many of the repos in my list are forks, so even if you request deletion they’ve still got any forks of your repos
@emenel @huggingface According to their site my code isn’t included, but can I trust them to tell me that, possibly not if they are thinking consent is opt-out.
@orbitalmartian just a comment here from myself that on my first search my repos were not in there. But, then I realized my phone auto-capitalized my username. I retried with the username all lowercase and then all of my repos showed up. The search was literally case-sensitive on the username.
@hugo Yeah, I couldn’t remember whether my username was capitalised or not so tried with and without capitalisation and still nothing.
@emenel @huggingface Only see some old public repos with basic and crappy code.
@emenel @huggingface I just built a tool you can use to search the GH Archive of events to help check if your repository was ever public in the past: https://observablehq.com/@simonw/github-public-repo-history
GitHub Public repo history

This tool looks up events from your account that were stored in the GH Archive (dating back to around 2011) that may indicate a public repository - either pushes you made to a public repo, or times when you switched a repo from private to public. For background: https://til.simonwillison.net/clickhouse/github-public-history

Observable
@simon @huggingface thanks, this seems very helpful
@emenel @huggingface Remember folks: Github is microsoft's bitch now. You couldn't REALLY expect anything else from them at this point...