So I just learned what "The Stack" is today: an aggregation of GitHub repos for machine learning from which I can opt out.

But I won't.

I won't because they scraped some hot garbage I wrote in bash and Python that would make you faint. Bottom-of-the-barrel throw-away scripts full of coding crimes. Stuff like

find | grep | awk | xargs | ugh

...invoked via subprocess.run() then fed into more garbage.

I want "artificial intelligence" to learn this. It's going to be fantastic.

tired: opt-out of AI training datasets
wired: enthusiastically opt-in all the garbage that's sitting on your disk
I wonder if I could cook up a script that turns Star Trek erotic fan fiction into Rust code, then upload *that* to GitHub
@gabrielesvelto link to such fiction please 😂

@derickr the Archive of Our Own has 100k+ such works, carefully labeled with genre, warnings, etc...

https://archiveofourown.org/tags/Star%20Trek/works

Ironically this stuff did end up in many machine-learning training datasets, creating one of those typical "what could go wrong?" scenarios.

Star Trek - Works | Archive of Our Own

An Archive of Our Own, a project of the Organization for Transformative Works

@gabrielesvelto

Hello, Spock instead of Hello, world

@elithebearded it's more like "Hello, Spock 😉 😉 "

@gabrielesvelto Consider: Markdown

Then the same thing but rendered as HTML, just in case 

@gabrielesvelto I want then to cause every piece of code to become copyleft.
@hub oh gosh, if we could convince these things to spit out GPL-3 license text everywhere - possibly hidden in unicode or something - it would be fantastic. Ultimate poison pill.

@gabrielesvelto Lordy, you have convinced me that i recused myself from tech not a moment too soon.

Thanks!

@gabrielesvelto i considered doing this because I have a lot of bad code, but I also have some genuinely good stuff (and especially for stuff that was forked without using Github's official forking mechanism). So I did.
@gabrielesvelto Professional programmers have said of my code, "this makes my eyes hurt." I too will not be opting out.
@gabrielesvelto Instead of Gabriele's "find|grep|awk|..." I once did roughly lex | lex | lex |.... I had used LaTex to write a chapter of a final report. Our manager decided we should use troff (this was 1980s). So, a few days before it was due, I wrote a series of lex programs, each doing part of the conversion & fixing some previous errors until I was left with something good enough that the rest could be easily done by hand.
Very ugly coding but also very practical given its one-time use.

@gabrielesvelto

Now I wish I saved all the deranged one-liners I wrote in the last thirty years. Some were apt to give a human brain damage, let alone a LLM.

@angelastella we should teach these LLMs some old-school perl one-liners and convince them that it's a good way to solve problems

@gabrielesvelto i checked the dataset and they scraped my last c++ project from 12 years ago with its utter disregard for memory management

Glad to be helping out the future!

@gabrielesvelto I don't think that AI companies actually care about the quality of the code their systems spew. The whole point is that 'it works' (even when it doesn't), not that a human would be able to modify it later.
@mdione it's not just that. Most code you'll find around has notable bad patterns: a very common being mostly ignoring errors. Since LLM training gives disproportionate weight to common patterns, it means that the output will consistently reproduce bad ones. This output will be bound to be unstable and insecure by design, not just unmaintainable.