If you had code on GitHub at any point it looks like it might be included in a large dataset called “The Stack” — If you want your code removed from this massive “ai” training data go here:

https://huggingface.co/spaces/bigcode/in-the-stack

I found two of my old Github repos in there. Both were deleted last year and both were private. This is a serious breach of trust by Github and @huggingface.

Remove all your code from Github.

CONSENT IS NOT OPT-OUT.

Edit — thanks for all the replies. More context here: https://hachyderm.io/@joeyh/112105744123363587

Also the repos i found of mine i’m sure were private, but even if they were public at some point, for a brief time, in the past that isn’t my consent to use them for purposes beyond their intent.

---
Edit 2 -- I see this made it to HN, which is a level of attention I do not want nor appreciate....

For all those wondering about the private repo issue -- No, I am not 100% sure that these ancient repos weren't at some point public for a split second before I changed it. I do know that they were never meant for this and that one of them didn't even contain any code.

If my accidentally making a repo public for a moment just so happened to overlap with this scraping, then I guess that's possible. But it in no way invalidates the issues, and the anger that i feel about it.

Am I in The Stack? - a Hugging Face Space by bigcode

This app lets you check if your GitHub repositories are part of the The Stack dataset. Enter your GitHub username and select the dataset version to see if your code is included. If you want your da...

@emenel ROFL. In their project description: "The licenses we consider permissive are listed <here>."

The "here" link:

@galaxis @emenel found most of my public repos on there, no private one tho
@Zylann I found non of my private repos.