Mastodawn

Elias Griffin Jun 15, 2024

AI Loophole #1; Your GitHub README.md

https://lemmy.world/post/16572074

AI Loophole #1; Your GitHub README.md - Lemmy.World

I used to be the Security Team Lead for Web Applications at one of the largest government data centers in the world but now I do mostly “source available” security mainly focusing on BSD. I’m on GitHub but I run a self-hosted Gogs (which gitea came from) git repo at Quadhelion Engineering Dev [https://quadhelion.dev]. Well, on that server I tried to deny AI with Suricata, robots.txt, “NO AI” Licenses, Human Intelligence (HI) License links in the software, “NO AI” comments in posts everywhere on the Internet where my software was posted. Here is what I found today after having correlated all my logs of git clones or scrapes and traced them all back to IP/Company/Server. Formerly having been loathe to even give my thinking pattern to a potential enemy I asked Perplexity AI questions specifically about BSD security, a very niche topic. Although there is a huge data pool here in general over many decades, my type of software is pretty unique, is buried as it does not come up on a GitHub search [https://github.com/search?q=bsd%20security&type=repositories] for BSD Security for two pages which is all most users will click, is very recent comparitively to the “dead pool” of old knowledge, and is fairly well recieved, yet not generally popular so GitHub Traffic Analysis is very useful. The traceback and AI result analysis shows the following: 1. GitHub cloning vs visitor activity in the Traffic tab DOES NOT MATCH any useful pattern for me the Engineer. Likelyhood of AI training rough estimate of my own repositories: 60% of clones are AI/Automata 2. GitHub README.md [http://README.md] is not licensable material and is a public document able to be trained on no matter what the software license, copyright, statements, or any technical measures used to dissuade/defeat it. a. I’m trying to see if tracking down whether any README.md [http://README.md] no matter what the context is trainable; is a solvable engineering project considering my life constraints. 3. Plagarisation of technical writing: Probable 4. Theft of programming “snippets” or perhaps “single lines of code” and overall logic design pattern for that solution: Probable 5. Supremely interesting choice of datasets used vs available, in summary use, but also checking for validation against other software and weighted upon reputation factors with “Coq” like proofing, GitHub “Stars”, Employer History? 6. Even though I can see my own writing and formatting right out of my README.md [http://README.md] the citation was to “Phoronix Forum” but that isn’t true. That’s like saying your post is “Tick Tock” said. I wrote that, a real flesh and blood human being took comparitvely massive amounts of time to do that. My birthname is there in the post 2 times, in the repo, in the comments, all over the Internet. You should test this out for yourself as I’m not going to take days or a week making a great presentation of a technical case. Check your own niche code, a specific code question of application, or make a mock repo with super niche stuff with lots of code in the README.md [http://README.md] and then check it against AI every day until you see it. P.S. I pulled up TabNine and tried to write Ruby so complicated and magically mashed, AI could offer me nothing, just as an AI obsucation/smartness test. You should try something similar to see what results you get.

Show thread

Elias Griffin Jun 16, 2024

The comments so far aren’t real people posting how they really feel. An agenda or automata. Does that tell you I’m over the target or what?

Look my post is doing really well on the cyberescurity exchanges. So to all real developers and program managers out there:

Recommend the removal of any “primary logic” functional code examples out of your README.md, that’s it.

PSA, Here to help, Elias

Show thread

bamboo Jun 16, 2024

Lmao you got some criticism and now you’re saying everyone else is a bot or has an agenda. I am a software engineer and my organization does not gain any specific benefits for promoting AI in any way. They don’t sell AI products and never will. We do publish open source work however, and per its license anyone is free to use it for any purpose, AI training included. It’s actually great that our work is in training sets, because it means our users can ask tools like ChatGPT questions and it can usually generate accurate code, at least for the simple cases. Saves us time answering those questions ourselves.

I think that the anti-AI hysteria is stupid virtue signaling for luddites. LLMs are here, whether or not they train on your random project isn’t going to affect them in any meaningful way, there are more than enough fully open source works to train on. Better to have your work included so that the LLM can recommend it to people or answer questions about it.

Show thread

AlexanderESmith Jun 16, 2024

you got some criticism and now you’re saying everyone else is a bot or has an agenda
Please look up ad hominem, and stop doing it. Yes, their responses are a distraction from the topic at hand, but so were the random posts calling OP paranoid. I'd have been on the defensive too.

[Our company] publish[s] open source work ... anyone is free to use it for any purpose, AI training included
Great, I hope this makes the models better. But you made that decision. OP clearly didn't. In fact, they attempted to use several methods to explicitly block it, and the model trainers did it anyway.

I think that the anti-AI hysteria is stupid virtue signaling for luddites
Many loudly outspoken figures against the use of stolen data for the training of generative models work in the tech industry, myself included (I've been in the industry for over two decades). We're far from Luddites.

LLMs are here
I've heard this used as a justification for using them, and reasonable people can discuss the merits of the technology in various contexts. However, this is not a justification for defending the blatant theft of content to train the models.

whether or not they train on your random project isn’t going to affect them in any meaningful way
And yet, they did it while ignoring explicit instructions to the contrary.

there are more than enough fully open source works to train on
I agree, and model trainers should use that content, instead of whatever they happen to grab off every site they happen to scrape.

Better to have your work included so that the LLM can recommend it to people or answer questions about it
I agree if you give permission for model trainers to do so. That's not what happened here.

Show thread

bamboo Jun 16, 2024

Why do you think they need your permission to use information you posted publicly to train their models? Copyright isn’t unlimited, and model training is probably fair use.

Show thread

AlexanderESmith Jun 16, 2024

"Your honor, we can use whatever data we want because model training is probably fair use, or whatever".

I don't know what's worse, the fact that you think creators don't have the right to dictate how their works are used, or that you apparently have no idea what fair use is.

This might help; https://copyright.gov/fair-use/

U.S. Copyright Office Fair Use Index

The goal of the Index is to make the principles and application of fair use more accessible and understandable to the public by presenting a searchable database of court opinions, including by category and type of use (e.g., music, internet/digitization, parody).

Show thread

Victoria Antoinette Jun 16, 2024

authors should have no say in how published works are used.

Show thread

AlexanderESmith Jun 16, 2024

I already replied to the essence of this in my reply to your other post about how "illegal downloads aren't theft because its a copy", but I'll mention here that this is even more evidence that you aren't a creator, and I suggest that your opinions on this subject aren't relevant, and you should avoid subjecting other people to them.

Show thread

Victoria Antoinette Jun 16, 2024

your attacks on my identity don’t undercut my claims at all.

Show thread

AlexanderESmith

"evidence suggests that you probably aren't a creator"
"As a result, I suggests that your opinions aren't relevant"

Aside from the fact that these are not character attacks, I encourage you to refute my assumptions. Otherwise, my points will stand on their own.

Show thread

Victoria Antoinette Jun 16, 2024

on the internet, no one knows you’re a dog. whether I have or not, saying so doesn’t prove it. what I said stands on its own merits and your inability to make an argument without attacking identity speaks to the strength of your argument, your understanding of the subject, and your ability (or willingness) to engage in good faith.