Good morning folks

Time for a #poll.

Reviews are adverts.

Yes
0%
*makes a 'maybe' gesture
100%
No
0%
Poll ended at .
Rebuilding ALL the derived files for the dataset and training from scratch without the headers to see if it's the 'date' field that's causing this 'I will reply with a number, 20% of the time' behaviour in the model.
I'm also wondering if, ultimately, my finetune dataset is way too broad. Perhaps I should pull it back to just the one Persona and see if that helps.
If it did, I'd do a training run on each persona, so I'd basically have one model per persona, and that wouldn't be bad since I *can* run 6 or 7 of these models, concurrently

THIS IS NEW (and, counts):

🗣️ Prompt: user: [The user is an older man named {user}], scenario: [{char} has spent the day reading a book, {user} has just arrived.], conversation: [{user}: good evening, {char} good to see you.
{char}: hello, {user}] External data: [] Prompt: [What's up?]

[temp: 0.8 | persona: sarcastic teen] 🤖: The next bookmark

🗣️ Prompt: Do you ever dream in color?

[temp: 0.2 | persona: hivequeen] 🤖: The swarm does.

... this... also counts!

On the plus side: yes, I 'fixed' the random number bullshit it was giving me.

On the weird side, it seems obsessed with the words 'anarchy', and 'animal'

*reads*

*looks up at camera*

*reads again*

*crumples up paper and throws it over shoulder*

*picks up imaginary phone*

Hello, Irony? Yes, I'd like to direct your attention...

Lemme just hop on the OpenData subreddit, surely I can find some good datasets.

Nope, it's just people jerking off about how they believe this shit should be open. Not providing shit. Even the stuff they SAY is open and available, they're not linking.

Genuinely, I've been doing some pretty severe interrogating of various corners of the internet lately (lots of news sites, blogs, etc) outside my usual milieu, and all I'm fucking finding is people talking ABOUT things and not actually showing these things/linking to resourcces.

Now, to counterbalance my bitching:
https://exoplanetarchive.ipac.caltech.edu/

WOOO EXOPLANETS

#exoplanets #datasets

NASA Exoplanet Archive

Here's a hint: if you put this into your LLM dataset, alllllll you're gonna fuckin' get is random numbers as your output.

No, this is NOT what I did, I did something ELSE, thank you.

But I can learn from my mistakes, kthx.

@DarkestKale When newspapers were on paper, there was some excuse for this. Not now that you can just put in "and here is a link to the publication we're summarising".
@RogerBW here's the fucking worst thing about news websites: they change, and links break, constantly and consistently.

@RogerBW Our ABC (not the yank one), will CONSTANTLY change the urls of articles, the headlines, etc - up to eight times in one day, just republishing with slight (unflagged) 'updates'

There's no such thing as being able to reference an ABC article. It's fucking quicksand.

@DarkestKale Even without human fuckery, there is the sort of CMS that makes the master index to an article a bit of text based on its title, and the sort that makes it an arbitrary number.
@DarkestKale The Swindon of datasets? (Supposedly it was so close to the mean demographic of all England that it was used for test marketing all sorts of new products. After a few years they noticed that lots of things were doing well there but failing nationwide, and eventually realised that word had got round and people who liked trying new products had moved to Swindon to do so.)

@RogerBW huh. Our 'proving ground' used to be Tasmania. Especially for telecoms.

Smaller population, bounded by sea, so you can basically run good tests over there, etc - while keeping same currency, language and legality

@DarkestKale I've seen reviews that're such blatant adverts you can see the maker's fingers when the reviewer opens their mouth too wide.
I've also seen reviews that're warnings against toxic waste.