Mastodawn

emenel Mar 17, 2024

If you had code on GitHub at any point it looks like it might be included in a large dataset called “The Stack” — If you want your code removed from this massive “ai” training data go here:

https://huggingface.co/spaces/bigcode/in-the-stack

I found two of my old Github repos in there. Both were deleted last year and both were private. This is a serious breach of trust by Github and @huggingface.

Remove all your code from Github.

CONSENT IS NOT OPT-OUT.

Edit — thanks for all the replies. More context here: https://hachyderm.io/@joeyh/112105744123363587

Also the repos i found of mine i’m sure were private, but even if they were public at some point, for a brief time, in the past that isn’t my consent to use them for purposes beyond their intent.

---
Edit 2 -- I see this made it to HN, which is a level of attention I do not want nor appreciate....

For all those wondering about the private repo issue -- No, I am not 100% sure that these ancient repos weren't at some point public for a split second before I changed it. I do know that they were never meant for this and that one of them didn't even contain any code.

If my accidentally making a repo public for a moment just so happened to overlap with this scraping, then I guess that's possible. But it in no way invalidates the issues, and the anger that i feel about it.

Am I in The Stack? - a Hugging Face Space by bigcode

This app lets you check if your GitHub repositories are part of the The Stack dataset. Enter your GitHub username and select the dataset version to see if your code is included. If you want your da...

Show thread

correcthorse Mar 20, 2024

@emenel @huggingface

I work for github. This post was shared in our slack, on a non work related channel. We don't think it's us.

https://huggingface.co/datasets/bigcode/the-stack-v2

They say they get data from SoftwareHeritage, a website that archives repos from github. If your repos were ever open, they might have been archived there even after you deleted from github.

That's my best guess.

bigcode/the-stack-v2 · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Show thread

technomancy Mar 20, 2024

@correcthorse @emenel I think this is accurate, this thread has links to where SWH talks about their involvement

https://hachyderm.io/@joeyh/112105744123363587

see shy jo (@[email protected])

I am disappointed in Software Heritage. They made this statement on using their archive as an AI training dataset: https://www.softwareheritage.org/2023/10/19/swh-statement-on-llm-for-code/?ref=openml.fyi These seem like good principles. But they are not actually sufficient to respect our work. And the third is too weak, and appears to be providing a figleaf for extractive behavior.

Hachyderm.io

Show thread

emenel Mar 20, 2024

@technomancy @correcthorse thanks for the additional info. I’m pretty sure the repos of mine were private. If it were only me i could be misremembering something, but i have heard from a number of people that they also found repos in this dataset that were private. So i’m not sure what to think or how to explain it.

Show thread

correcthorse

@emenel @technomancy

Me neither. But if something shady like that was happening here me and many other people wouldn't be working here.

I can't vouch for everything the company does. But me and my coworkers would be very surprised if public repos were going public, and most of us wouldn't stand for it.

I'm not in sales, I don't care what you use. But I used gh before working for them and I'm comfortable using them even after I leave.