If you had code on GitHub at any point it looks like it might be included in a large dataset called “The Stack” — If you want your code removed from this massive “ai” training data go here:

https://huggingface.co/spaces/bigcode/in-the-stack

I found two of my old Github repos in there. Both were deleted last year and both were private. This is a serious breach of trust by Github and @huggingface.

Remove all your code from Github.

CONSENT IS NOT OPT-OUT.

Edit — thanks for all the replies. More context here: https://hachyderm.io/@joeyh/112105744123363587

Also the repos i found of mine i’m sure were private, but even if they were public at some point, for a brief time, in the past that isn’t my consent to use them for purposes beyond their intent.

---
Edit 2 -- I see this made it to HN, which is a level of attention I do not want nor appreciate....

For all those wondering about the private repo issue -- No, I am not 100% sure that these ancient repos weren't at some point public for a split second before I changed it. I do know that they were never meant for this and that one of them didn't even contain any code.

If my accidentally making a repo public for a moment just so happened to overlap with this scraping, then I guess that's possible. But it in no way invalidates the issues, and the anger that i feel about it.

Am I in The Stack? - a Hugging Face Space by bigcode

This app lets you check if your GitHub repositories are part of the The Stack dataset. Enter your GitHub username and select the dataset version to see if your code is included. If you want your da...

@emenel @huggingface

I work for github. This post was shared in our slack, on a non work related channel. We don't think it's us.

https://huggingface.co/datasets/bigcode/the-stack-v2

They say they get data from SoftwareHeritage, a website that archives repos from github. If your repos were ever open, they might have been archived there even after you deleted from github.

That's my best guess.

bigcode/the-stack-v2 · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

@correcthorse @emenel @huggingface it listed repos i had that were initially created as private, and i *paid* to have them as private. this is a *massive fucking breach of trust*.

update edit: it appears the repo was set to public somehow later on. see later replies for info.

@dietinghippo @emenel @huggingface

The code couldn't have leaked from anywhere else? I'm telling you, if this was a thing there would be a riot at github.

@correcthorse @emenel @huggingface hey apologies about the initial harshness, for some reason that repo is now public and i don't recall setting it as public ever. i'm now trying to figure out when that happened because it was a private collaboration repo.

i'm still not ecstatic that any of my code was included in this dataset, but github's off the hook in my ledger on directly providing code.

@dietinghippo @emenel @huggingface

No apology needed im just a random person who has a little more info. I just wanted to shed some light.

I can't promise the company I work for is not shady, I can only say that if turned out they were a lot of us would leave.

@dietinghippo @correcthorse @huggingface as i mentioned in my additions to the post (via edits) -- i can't be 100% certain that my repos at-issue weren't for some moment public either by accident or because i needed to share something momentarily. they are also now long gone, and both are quite old. that doesn't make them fair game for use beyond their intent and license. especially since one of them wasn't even code ....