https://infosec.exchange/@howelloneill/112492876811138145
All jokes aside, but this has essentially two implications:
Either Google is structurally too incompetent to actually test edge cases before deploying such products.
Or they did test it, and just said "fuck it, as long as nobody dies attributable to us, it's good PR".
Both scenarios are terrifying.
Option number 3 - these things cannot be tested. It's an intrinsic part of the technology, LLMs will simply always hallucinate and can never be completely protected from outputting training data verbatim or giving these false answers. They were never meant to. It's a glorified random word generator. And we are now building businesses on top of that.
Of course you can test them by, you know, trying stuff out.
You can't unit test them like traditional software, but I expect every company to have at least a few test queries for these models. Including some weird ones. This is not that one weird query that breaks Google, it's endemic. This feature is clearly not ready and releasing it to the public reeks of recklessness or stupidity.