Mastodawn

"GitHub’s Copilot will use you as AI training data, but you can opt out"

"...if you’ve used the code completion in Visual Studio Code, asked Copilot a question on the GitHub website, or used another related AI feature, your interactions and code snippets could be harvested...."

https://www.howtogeek.com/githubs-copilot-will-use-you-as-ai-training-data-but-you-can-opt-out/

#ai #microsoft #copilot

GitHub’s Copilot will use you as AI training data, but you can opt out

That includes the Copilot features in Visual Studio Code.

How-To Geek

Show thread

You S Blues

@ai6yr I’ve assumed all along that ALL the code stored in GitHub has been used to train their LLM. Does anyone believe that is not the case?

Show thread

Jeff 1d ago

@patmikemid
@ai6yr
Research shows it doesn't take a lot of documents to poison the model. If they are training on all code, then doesn't that sound like the model is very risky?

https://arxiv.org/abs/2510.07192

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples

Poisoning attacks can compromise the safety of large language models (LLMs) by injecting malicious documents into their training data. Existing work has studied pretraining poisoning assuming adversaries control a percentage of the training corpus. However, for large models, even small percentages translate to impractically large amounts of data. This work demonstrates for the first time that poisoning attacks instead require a near-constant number of documents regardless of dataset size. We conduct the largest pretraining poisoning experiments to date, pretraining models from 600M to 13B parameters on chinchilla-optimal datasets (6B to 260B tokens). We find that 250 poisoned documents similarly compromise models across all model and dataset sizes, despite the largest models training on more than 20 times more clean data. We also run smaller-scale experiments to ablate factors that could influence attack success, including broader ratios of poisoned to clean data and non-random distributions of poisoned samples. Finally, we demonstrate the same dynamics for poisoning during fine-tuning. Altogether, our results suggest that injecting backdoors through data poisoning may be easier for large models than previously believed as the number of poisons required does not scale up with model size, highlighting the need for more research on defences to mitigate this risk in future models.

arXiv.org

Show thread

AI6YR Ben 1d ago

@evilotto @patmikemid So you're saying I should check all my buggy code into Github? 🤪

Show thread

Mike Sheward 1d ago

@patmikemid @ai6yr it was always the case because early versions of copilot would happily suggest other peoples api keys etc if they had been accidentally committed

Show thread

AI6YR Ben 1d ago

@SecureOwl @patmikemid LOL there are so many keys in github. I imagine people are already automatically scraping them for nefarious purposes.

Show thread

J4ck 1d ago

@ai6yr @SecureOwl @patmikemid

furiously writes new rules to always discard API keys when it pushes

Show thread

Mike Sheward 1d ago

@ai6yr @patmikemid oh yeah 100% - when i was running security for an IoT platform (yes, we had security), i used to scrape defensively as well and reach out to people who committed api keys to our platform by accident before they could be used by bad actors

github has a program that will autodetect them too but you have to commit to using a unique key format so they can have more reliable regex

Show thread

Aprazeth 1d ago

@SecureOwl @ai6yr @patmikemid

It is called "secret scanning" on Github (and Azure DevOps)

You can replicate a similar and perhaps better result with gitleaks and a git precommit setup.

If your environment has pipelines/runners, ALSO add a job (or w/e your variant calls it) that triggers on commits to run gitleaks.

That won't stop them from being being committed but you'll get a warning that there are secrets being stored.

Show thread

Jeff 1d ago

@ai6yr
@SecureOwl @patmikemid
They are, and they're very fast, sometimes faster than the alerts that you inadvertently pushed keys.

https://www.helpnetsecurity.com/2024/12/02/revoke-exposed-aws-keys/

The shocking speed of AWS key exploitation - Help Net Security

Publicly exposed AWS access keys are being scraped and misused by attackers and organizations are failing to revoke them in time.

Help Net Security

Show thread

AI6YR Ben 1d ago

@evilotto @SecureOwl @patmikemid Wow!