my stupid llm research is absofuckinglutely not going the way i was hoping.

ive spent like a fucking week trying to setup a testing harness to get local models to do the same test 100 times, aperture science style, to test the drift of their results

but 100% of the time, the model:
- emits tool calls incorrectly, so i see them
- ignores instructions
- falls into a loop
- says its gonna do stuff, then .. just doesnt
- intentionally deviates from instructions even when explicitly told not to

this is nuts. keeping these things on track is a sisyphean task of pure crystalline futility.

how people think they're gonna lose their jobs to this fucking crap is nuts. the only way to make it work is to put it in trillion dollar datacenters, so that the 1% of output that isnt absolute trash actually makes it out of the system.

@Viss it's worse than running a child daycare
@morb @Viss Depends. So far LLMs didn't successfully paint the walls with their own shit.
On the other hand babies rarely burn billions of dollars.