my stupid llm research is absofuckinglutely not going the way i was hoping.
ive spent like a fucking week trying to setup a testing harness to get local models to do the same test 100 times, aperture science style, to test the drift of their results
but 100% of the time, the model:
- emits tool calls incorrectly, so i see them
- ignores instructions
- falls into a loop
- says its gonna do stuff, then .. just doesnt
- intentionally deviates from instructions even when explicitly told not to