This thing must have been a nightmare to code… I use LLMs nearly daily for work, and their tendency for generating garbage is so strong i have to double check every answer and reestate my rules every time. Their memory isn’t worth shit for more complex tasks.
did you try opus 4.5? you still need to review the output etc, but it’s way way better than gpt 4.1. It also costs a lot more…
I normally use sonnet 4.5 and haiku for quick, simple or repetitive tasks; opus eats away all my tickets in days, if i turned off extra tokens i would go bankrupt