I guess this deserves to be posted on a regular cadence for the benefit of anyone who hasn't seen it before: https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_are_switching_written_numbers_when_scanning
Xerox scanners/photocopiers randomly alter numbers in scanned documents

Xerox scanners/photocopiers randomly alter numbers in scanned documents Please see the “condensed time line” section (the next one) for a time line of how the Xerox saga unfolded. It for example depicts that I did not push the thing to the public right away, but gave Xerox a lot of time before I did so. <iframe width="700" height="394" src="https://www.youtube.com/embed/c0O6UXrOZJo" frameborder="0" allowfullscreen></iframe>

D. Kriesel
@pervognsen I was thinking about this on the LLM-based OCR models thread. LLM-based OCR models have a tendency to mistranscribe things as different sentences that make sense in context, instead of garbling characters like one might expect from traditional OCR, which feels like a very similar failure mode. Unfortunately in this case the "world knowledge" that makes them good at OCR is the source of the problem, so there isn't a simple "don't use JBIG2" style solution.
@dougall @pervognsen Do you have examples of such mistranscriptions?

@wolfpld @pervognsen I don't speak Polish, so you can tell me how plausible they are, but on your handwriting sample:
* "Nie ma innych węzłów!" (Qwen2.5-VL 7B)
* "Nie ma już więcej!" (Gemma3 12B)
* "Nice use of punctuation!" (DeepSeek-OCR 3B)

Qwen3-VL 8B managed to replace "tracy::" with "tray::" on every single line.

"MemAllocCallstack" -> "MemoryAllocCallstack" (MiniCPM-V 4.5)
"ProfilerData::ProfilerData()" -> "Profiler::Data()" (Gemma3 12B)
"ThreadData" -> "PinnedData" (DeepSeek-OCR 3B)

@wolfpld @pervognsen DeepSeek-OCR 3B also hallucinated "llvm.pl.so.2", it was clearly the worst I tested.

Some counter-examples, which Opus tells me are wrong, but garbled:
* "Nie ma innego wątków!" (Qwen3-VL 8B)
* "Nie ma imnych wetków!" (MiniCPM-V 4.5)

Bigger models are obviously *way* better, but I suspect you would see similar failure modes on borderline-legible text.

RE: https://mastodon.gamedev.place/@wolfpld/116088970554232592

@wolfpld @pervognsen For anyone reading along, the expected text is "Nie ma innych wątków!", and the test image and thread are here:

https://mastodon.social/@wolfpld@mastodon.gamedev.place/116088970688566418

@wolfpld @pervognsen Heh, I only just noticed the "^" "Start" insertion myself. Your huge Qwen3.5 nailed that. It's maybe also worth noting that none of the models I tested produced anything related to that. No extra "Start" on its own line, no "^", nothing.

@wolfpld @pervognsen Oh, and I see "SystemTracing" -> "SystemTraining" here. They're surprisingly hard to find by eye.

https://mastodon.social/@pervognsen/116098703237535154

@dougall @pervognsen Good examples!

Now, the philosophical question is, do these mistakes mean these new models shouldn't be used. People were quite happy to use OCR for things, even though you would typically get things like my "{-Diata () |" half of the time.

I can see a human reading that "SystemTracing" as "SystemTraining" very easily. Example: without checking on your system, is frame #15 libvpl, or librpl?

The model actually reasoned about this a lot, as well as the "Start" insertion.

@wolfpld @dougall Doesn't that apply to the JBIG2 example as well? The false confidence it engenders is the problem in both cases and people were happy using those Xerox scanners and JBIG2 compressors too until they weren't. More traditional ML models seem to better equipped to communicate confidence/probability estimates with its results vs LLM-based models.

@pervognsen @dougall Not at all. Copying documents was already a solved problem. No computer was even needed, you could just project the scanned surface on the drum to get it replicated. Like laser printing, but with bright light, mirrors and lenses.

JBIG2 seems like an unwanted bean-counter solution. I see it was used in stationary machines where the payload size was not an issue at all? Maybe it would make sense to halve the fax transmission speed, but just to have less RAM in the machine?

@pervognsen @dougall Anyways, the failure is spectacularly bad, and if you wanted to maximize cost savings, you could just scan everything as white pages.

- Much easier to implement.
- Gives you output that is just as useful.
- Doesn't give users false confidence that the copy is exact.

Now, where's my paycheck?

@pervognsen @dougall Going back to the LLM OCR question, I stand by what I wrote in Tracy release notes.

We made a computer program that is supposed to be simulating humans, and suddenly the expectation is that the computer will always give correct answers? No, what we got instead is human-like behavior.

@pervognsen @dougall The attached image is a scan of some notes my friend made when we were in high school. This is not a joke, this is how he made his notes, and the absurdity of it is why I scanned it.

*This* is the problem the LLM OCRs are solving. Have fun trying to decipher what's going on around here.

Spoiler alert: there's "głowa, tułów i ogon" in there – "head, body and tail".

@pervognsen @dougall Per posted a fantastic little test case here: https://mastodon.social/@pervognsen/116145896478562434

This is how the big Quen3.5 manages. The prompt is "Produce a formatted output that reflects the image contents."

Now, I was not able to decipher what the "(??? + deletes)" contained. I read "keep" as "kap", and assumed it was a typo for "cap". The "(say)" I read as "(sg)" or "(sq)".

The model did better than me in all these cases. I wonder what does that say in the context of the discussion we have.

@wolfpld @pervognsen @dougall it's interesting in general, the cursive there is very readable by cursive standards, but obviously if you're not a native speaker, or if you haven't read cursive before, it's going to be a mess. The issue as always is how do you build tools that people understand the limits of. And which don't make things practically worse by ignoring the impact on the user. Safety 101 stuff - the outcome is what matters.
@wolfpld @pervognsen @dougall realistically speaking nobody's going to put confidence interval style things on their ocr anyway because it wouldn't sell as well as the hypothetical magical tool. (And the people most likely to use it are also those least likely able to validate the output) For anything which matters you'd want a person who understands the context as well as the writing, but those are hard to come by oftentimes.
@wolfpld @pervognsen @dougall practically this style of ocr seems very valuable for indexing first and foremost - an already naturally lossy process. The problem with transcription and "summarization" is that it often ends up consumed in a way completely divorced from the source material and I just have zero trust anybody even read the original contents when dealing with matchine translated output.

@dotstdy @pervognsen @dougall These are not behaviors I have observed, in general.

You said you try to stay away from AI. Have you looked at how the current solutions (not: models) work, in depth? I mean, it's easy to look at a random model a year or two ago and stay in that mindset, but things do advance. Despite the "just try another model, just try another random seed bro" meme, there are true advancements.

@wolfpld I don't use them at all, but I read a lot of the code and output that people generate with the latest technology (chat transcripts, commit history, etc) and I'm continually unimpressed. In my experience people are so excited by the fact that it works at all they don't do much in the way of comparison with anything other than other LLMs. (And of course there's the usual tech thing of "it's just tech, don't worry about its impacts", which I find pretty odious at the best of times)
@wolfpld broadly in this particular domain it's quite interesting though - it's very valuable to have the ability as a user to translate a document quickly even if you're not able to really verify it. So ocr in that context is super valuable to fill gaps in a pinch. However ocr and translation for "data at rest" has *much* higher requirements, since you're creating something which will potentially replace the original. That distinction in use case implies a distinction in tooling as well.
@wolfpld also there's of course a huge amount of OCR systems in deployment across the world, including in domains where correctness is important - e.g. ocr for mail routing, internal production line stuff. But in situations like that you have use-case specific advantages, like e.g. a database of all valid addresses. So fallback on weird input can be relatively graceful, or dropped to human intervention.