my stupid llm research is absofuckinglutely not going the way i was hoping.

ive spent like a fucking week trying to setup a testing harness to get local models to do the same test 100 times, aperture science style, to test the drift of their results

but 100% of the time, the model:
- emits tool calls incorrectly, so i see them
- ignores instructions
- falls into a loop
- says its gonna do stuff, then .. just doesnt
- intentionally deviates from instructions even when explicitly told not to

@Viss I've begrudgingly got to admit people who are really good at crafting queries can get better results than me. I guess it's kinda like how I Google and generally computer better than others. So they're still terrible, but there are elements of "just because I get bad results out of a tool doesn't necessarily mean NOBODY can use it better than me"

@JessTheUnstill sure - but like, you gotta see some of the shit that comes out of this research im doing. i can admit that on a few occasions an llm (gpt in most cases, anthropic in like 1 or 2) will give me something susprisingly useful

but it will take me HOURS OR DAYS to get to that result. Im no front-end dev, so making it do my css is helpful, it can do that heavy lifting then i can tweak on it after that.

gpt taught me a cool trick with bash that i now use all over the place

@JessTheUnstill but it took me literally days to get it to do that for me in a remotely functional way.

so the amount of effort is 'roughly the same' between 'me scouring search results to stumble across a thing' vs 'me hitting gpt in the neck over and over again with a tire iron until a zelda1 rupee comes out' - so its ... i dont wanna say 'feature parity', but like, its kinda the same amount of "work for me", but one is free and the other costs 20 bux a month