@baldur @wikiyu Color me really surprised; paragon and derivatives (at least for German) were definitely worse; I remember the (2008-ish?) surprise when ANN-based translations started to achieve higher rankings than purely Bayes-based/statistical methods.
(Oh and I do personally remember things like the Babylon spyware thing, which wasn't really good. IBM Watson didn't work as well as Google translate when that came out, for German<->English at least. I had played with Aperium in its earlier …
@funkylab @wikiyu So, around the time the LLM bubble first began, there was a noticeable sharp decline in the performance of publicly available translation services (i.e. Google Translate and the like) when it came to translating most Nordic languages and it's generally gotten worse, not better, over time. It's become a running joke.
An important note here is that there is much much less text available for these languages in machine-readable form than even German or French.
@UkeleleEric @baldur @wikiyu don't know whether that's a good example, because the difference is clear even devoid of context, PLUS existing LLMs have no problem with that difference at all. (The two phrases are only similar to the human reader. You're projecting things that are easy to make mistakes on for humans to machine translation! (see attached Deep-L)
I'm also not sure rule based & Bayesian translation makes a lot of difference when it comes to sarcasm. That's sentiment detection!
@abucci @wikiyu @baldur I feel like we're arguing based on perceptions here – I certainly am, and can but vaguely remember the press echo when neural (not LLM) translators came out. So, I might need to shut up here and say: Have not enough data to base my claims here. Do you?
Do we have any qualitative analysis in literature that I could read? So far we've got four people claiming things, that's not a great discussion :)
@qgustavor @wikiyu That's what I'd argue, too, but: this very basic theory and reality, especially of really available implementations, might diverge there.
Thing is that @baldur is actually someone from the field, so his word does weigh heavy to me, even if it doesn't reflect my own experience with translation quality.
(EDIT: way->weigh. Human in-mind translations are not perfect, either :D)
So, AFAICT and as best I know, in general LLMs are sensitive to the size of the training data set. Only a few languages have a collection of machine-readable texts big enough for these models
IIRC they used to compensate for this in the pre-LLM days specifically for each language.
Once everybody began to migrate to approaches that require large data sets, performance for all of those tasks (translation, summary, correction) in smaller languages especially began to suffer
Though, it should be noted that in a lot of third party, neutral testing, specialised models outperform LLMs for many language tasks such as summarisation, even in English. At least in the same ballpark, even if they underperform, while costing orders of magnitudes less