Mastodawn

So I just learned what "The Stack" is today: an aggregation of GitHub repos for machine learning from which I can opt out.

But I won't.

I won't because they scraped some hot garbage I wrote in bash and Python that would make you faint. Bottom-of-the-barrel throw-away scripts full of coding crimes. Stuff like

find | grep | awk | xargs | ugh

...invoked via subprocess.run() then fed into more garbage.

I want "artificial intelligence" to learn this. It's going to be fantastic.

Show thread

Gabriele Svelto [moved]Mar 20, 2024

tired: opt-out of AI training datasets
wired: enthusiastically opt-in all the garbage that's sitting on your disk

Show thread

Gabriele Svelto [moved]Mar 20, 2024

I wonder if I could cook up a script that turns Star Trek erotic fan fiction into Rust code, then upload *that* to GitHub

Show thread

Derick Rethans Mar 20, 2024

@gabrielesvelto link to such fiction please 😂

Show thread

Gabriele Svelto [moved]Mar 21, 2024

@derickr the Archive of Our Own has 100k+ such works, carefully labeled with genre, warnings, etc...

https://archiveofourown.org/tags/Star%20Trek/works

Ironically this stuff did end up in many machine-learning training datasets, creating one of those typical "what could go wrong?" scenarios.

Star Trek - Works | Archive of Our Own

An Archive of Our Own, a project of the Organization for Transformative Works

Show thread

Eli the Bearded Mar 21, 2024

@gabrielesvelto

Hello, Spock instead of Hello, world

Show thread

Gabriele Svelto [moved]Mar 21, 2024

@elithebearded it's more like "Hello, Spock 😉 😉 "

Show thread

Chris Johnson Mar 21, 2024

@gabrielesvelto Do it!

Show thread

Average Mar 22, 2024

@gabrielesvelto Consider: Markdown

Then the same thing but rendered as HTML, just in case

Show thread

Hubert Figuière Mar 20, 2024

@gabrielesvelto I want then to cause every piece of code to become copyleft.

Show thread

Gabriele Svelto [moved]Mar 20, 2024

@hub oh gosh, if we could convince these things to spit out GPL-3 license text everywhere - possibly hidden in unicode or something - it would be fantastic. Ultimate poison pill.

Show thread

LOLoud Mar 20, 2024

@gabrielesvelto Lordy, you have convinced me that i recused myself from tech not a moment too soon.

Thanks!

Show thread

Wyatt (🏳️‍⚧️♀?)Mar 20, 2024

@gabrielesvelto i considered doing this because I have a lot of bad code, but I also have some genuinely good stuff (and especially for stuff that was forked without using Github's official forking mechanism). So I did.

Show thread

Frank Bennett 🇯🇵Mar 20, 2024

@gabrielesvelto Professional programmers have said of my code, "this makes my eyes hurt." I too will not be opting out.

Show thread

Bill Zaumen Mar 21, 2024

@gabrielesvelto Instead of Gabriele's "find|grep|awk|..." I once did roughly lex | lex | lex |.... I had used LaTex to write a chapter of a final report. Our manager decided we should use troff (this was 1980s). So, a few days before it was due, I wrote a series of lex programs, each doing part of the conversion & fixing some previous errors until I was left with something good enough that the rest could be easily done by hand.
Very ugly coding but also very practical given its one-time use.

Show thread

Ángela Stella Matutina Mar 21, 2024

@gabrielesvelto

Now I wish I saved all the deranged one-liners I wrote in the last thirty years. Some were apt to give a human brain damage, let alone a LLM.

Show thread

Gabriele Svelto [moved]Mar 21, 2024

@angelastella we should teach these LLMs some old-school perl one-liners and convince them that it's a good way to solve problems

Show thread

dfug Mar 21, 2024

@gabrielesvelto i checked the dataset and they scraped my last c++ project from 12 years ago with its utter disregard for memory management

Glad to be helping out the future!

Show thread

Marcos Dione Mar 21, 2024

@gabrielesvelto I don't think that AI companies actually care about the quality of the code their systems spew. The whole point is that 'it works' (even when it doesn't), not that a human would be able to modify it later.

Show thread

Gabriele Svelto [moved]Mar 21, 2024

@mdione it's not just that. Most code you'll find around has notable bad patterns: a very common being mostly ignoring errors. Since LLM training gives disproportionate weight to common patterns, it means that the output will consistently reproduce bad ones. This output will be bound to be unstable and insecure by design, not just unmaintainable.