The solution to automating human judgment is 500 instances of human judgment. Automation has a sense of humor about its limitations.
https://wordoflore.ai/llm-as-a-judge-the-measurement-problem/
LLM-as-a-judge: the measurement problem
You've built something and you need to know if it works. So you do what's sensible—you ask an LLM to grade it. Factual accuracy, code quality, agent outputs. The machine judges the machine, and you get a number you can act on. Except that number is lying to you.