I think people really underestimate how fragile LLMs are for auto-translation. You can put complete garbage into it where none of the words are real words and still get out plausible-sounding "translations" just because the LLM sees it as "close" to a real sentence, and then translates what it thinks that "close" sentence is based, once again, on what seems "close".
The whole benchmarking approach really does not help with this since benchmarks rarely include testing for failures. You need to test that garbage-in is recognized as garbage, otherwise you get garbage-out too.
@endrift @joe that reminds me of this recent example i encountered where i came across a quote in someone's bio and wanted to know what it meant. google translate confidently decided it was italian and gave this response. i dont know italian but was pretty sure this looked nothing like italian so ended up looking harder and found that its a quote of a made up language from a song in a game.
if u turn off gemini then it more sensibly determines that it cant find a translation, but its on now by default
This is a fundamental problem with the "solve everything" model.
Just from a high level perspective, how do you test "Everything" and then the follow-up question is how do you regression test "Everything".
Folks have been developing AI for decades and testing models meant to solve a narrow hard problem requires heroic amounts of work. This is basically impossible.
@endrift Not that previous approaches did any better though. Google Translate has been doing stupid stuff since its inception and it has never had a "this doesn't make any sense" flag.
The main difference is LLMs are more likely to come up with grammatically correct, plausible sounding text instead of something clearly broken, when given clearly broken input.
@endrift I honestly don't know why they didn't have, like, a confidence flag in old models that could actually refuse to translate or warn.
"Standard" LLMs (the text completion kind), if that's what they're using now, probably can't implement that reliably, but you can probably architect a translation model that can (I forget what it's called but I think there's a model architecture better suited to translation).
@lina @endrift
I'd like it if machine translations added (?) markers and footnotes to explain areas of low confidence in the translation. That'd help the person using them to know they should look for more context, and deepen their understanding.
Answers without showing the working are just *much less useful*, in any field of study. Other machine translation methods could show their working. LLMs can't, there isn't really any.
@petealexharris @endrift LLMs do actually have the ability to show alternate terms/synonyms, that's like the one bit of data you can get out of them relatively easily.
But that doesn't really help with overall confidence in the translation, just identifying word alternatives.
@endrift Yeah it's kind of always been a thing. See also the endless posts on Reddit of people getting ridiculous Japanese/Chinese tattoos that don't make any sense ^^;;
Of all things I think better MT is important for society and something LLMs are actually arguably good at (though there's still plenty of room for improvement), so I'm reluctant to clown on them for this use case. Plus I'm pretty sure MT-optimized models don't need to be of the huge energy-guzzling variety. And translation, *in general*, doesn't have significant copyright concerns.
Of course, they in no way invalidate the need for human translators, but that market has already been going downhill for decades before LLMs because... people simply don't care, and primitive MT was already good enough for them not to care.
I think with LLMs it's important to focus on the ways in which they can be uniquely dangerous and wasteful. In just the sense of "they're a crappy machine replacement for something humans do better" they aren't unique and we've gone through such technologies before.
@lina @endrift
*syntactically correct
Yeah, never failing to give an answer destroys confidence that *any* apparently suitable translation came from what was in the source text. If an output can come from literally nowhere, how do we know when that's happening as part of translating real input? The algorithm is the same in both cases.
@petealexharris @endrift Edited (whoops, that could certainly be read wrong).
But yeah, machine translation has always been like this, so this is nothing new.
This does not make MT (LLM or not) useless, it's just not a crystal ball that works in every case. You need to either be okay with the error rate, or have enough familiarity with the source language to at least be able to catch glaring errors or sense when something is off (and obviously you should not use it to produce professional output without a professional translator in the loop).
@endrift this is the problem with LLMs for transcription too. They do ok, but they MAKE SHIT UP. A more specific transformer-based model does a great job! But the chatbot is more convenient.
I'm disconcerted they're putting the LLM "I know what you mean" thing into Google Translate, the original showcase for transformers. I mean, it's obvious they think that's helpful. But still, urgh.
My favourite example was actually you! A post containing pivot-to-ai.com was translated (I can't remember the source language) and it decided to replace the domain name with 'pineapple.com'.
It wasn't allowed to touch the HTML, so this ended up with a link that showed 'pineapple.com' but went to 'pivot-to-ai.com'.
I strongly suspect that there are some neat ways of sneaking malicious links past existing email filters that rely on this.
@grumpybozo @davidgerard @endrift
Current spam filters notice when the link target doesn’t match the text. The attack I’m proposing is where they do match as they go through the filter, but then local translation makes the target the victim sees appear innocuous.
@endrift
I see
I played with it a little bit trying to stop at different steps, revert, retry, change the end. The translation self-correct immediately in case the ending guess is wrong.
Could it be that the system has a very high confidence that 1) your sentence not finished yet, but you will finish it and 2) the possible ending are so close to what it expects... in that position, it's proactive.
Not saying it's "THE good" choice. More like there is a trade-off between fast and accurate.