Vague LLM failure? Make a reproducible mini‑case: lock model + params, dump system+user messages and raw response, then delta‑debug the prompt (binary‑remove chunks/examples/lines) until you isolate the minimal trigger. Turn that into an automated regression test.