my stupid llm research is absofuckinglutely not going the way i was hoping.

ive spent like a fucking week trying to setup a testing harness to get local models to do the same test 100 times, aperture science style, to test the drift of their results

but 100% of the time, the model:
- emits tool calls incorrectly, so i see them
- ignores instructions
- falls into a loop
- says its gonna do stuff, then .. just doesnt
- intentionally deviates from instructions even when explicitly told not to

@Viss well, there’s your problem. This is very much the expected result for Aperture Science. Everyone’s worried about GLaDOS but you’re running Wheatley.
@kaced oh its not wheatley, its the spaaaaaaaaace orb
@Viss You’re absolutely right! This is a common problem when referencing Portal, there are too many good ones. Would you like me to suggest some more topical video game references? What’s your favorite thing about space? Mine is space.