my stupid llm research is absofuckinglutely not going the way i was hoping.

ive spent like a fucking week trying to setup a testing harness to get local models to do the same test 100 times, aperture science style, to test the drift of their results

but 100% of the time, the model:
- emits tool calls incorrectly, so i see them
- ignores instructions
- falls into a loop
- says its gonna do stuff, then .. just doesnt
- intentionally deviates from instructions even when explicitly told not to

@Viss I've begrudgingly got to admit people who are really good at crafting queries can get better results than me. I guess it's kinda like how I Google and generally computer better than others. So they're still terrible, but there are elements of "just because I get bad results out of a tool doesn't necessarily mean NOBODY can use it better than me"

@JessTheUnstill sure - but like, you gotta see some of the shit that comes out of this research im doing. i can admit that on a few occasions an llm (gpt in most cases, anthropic in like 1 or 2) will give me something susprisingly useful

but it will take me HOURS OR DAYS to get to that result. Im no front-end dev, so making it do my css is helpful, it can do that heavy lifting then i can tweak on it after that.

gpt taught me a cool trick with bash that i now use all over the place

@Viss Yeah I spent a couple months arguing with the code completion in VSCode and finally gave up on it as useless and went back to intellisense. Sometimes the chat window helps, other times it doesn't. But I do figure if me, a smart engineer, can't figure this stuff out, it's going to have a rough adoption curve for other smart engineers
@JessTheUnstill its tricky. but even still, blindly taking shit out of these things and plopping it into an important place is an excercise in learning how wearing clothing made of c4 works
@Viss yeah, to make a bad analogy, it's wearing power armor before you learn how to fight and shoot. Or flying with autopilot without knowing how to do it with the stick.