Good morning folks
It's a public holiday here, so I've done a smidge of gardening and now: time to clean some code a little

As expected, doing weeding has meant my back hurts.

Eep.

Let me start up #Barkscrolling for ya

(edit: it already has posts? Nice!)

And some paperbark

#Barkscrolling

Putting together my own LLM dataset(s) has been verrrry educational in a lot of directions.

First, the importance of not letting anything time sensitive in (X tv show is on now, Y thing just happened, even things like 'you can buy Z for $4') - this is waaaay more pervasive than you might expect.

(no, I'm not importing any conversational, etc from people, it's either hand written or gen'ed)
There's a few BIG erotica datasets, but... From what I can tell, they're formatted badly, the content mix swings harrrrd, and it's a nightmare to work with, so - probably no on those.

Also: 700+ mb of raw text, all in one file?

Fuuuuuuuuuck thaaaaaaaaat