It's not a flaw in the test. Nor is is it an error of perception.
After the Chinese Room thought experiment, "AI" was split into two, "weak AI" and "strong AI." Weak AI seems intelligent (passes the Turing test), while strong AI is self-aware. We have no meaningful test of strong AI at the moment.
Such line of argument neglects the useful argument. People confuse weak AI for strong AI, and current LLMs for weak AI. Since strong AI is universally useful, so also must LLMs be. But it is perfectly possible to be below weak AI and be useful (quick sort, despite being very dumb, is very useful). It just means that LLMs and weak AI are not universally useful.
The discussion should not be whether AI is useful or not, or whether LLMs are weak AI or strong AI, but which cases LLMs are useful for. They have already proven them selves to be excellent for transcription of spoken text and pretty good at translations, but have other serious limits (and costs) that should also not be ignored.