Mastodawn

I guess this deserves to be posted on a regular cadence for the benefit of anyone who hasn't seen it before: https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_are_switching_written_numbers_when_scanning

Xerox scanners/photocopiers randomly alter numbers in scanned documents

Xerox scanners/photocopiers randomly alter numbers in scanned documents Please see the “condensed time line” section (the next one) for a time line of how the Xerox saga unfolded. It for example depicts that I did not push the thing to the public right away, but gave Xerox a lot of time before I did so. <iframe width="700" height="394" src="https://www.youtube.com/embed/c0O6UXrOZJo" frameborder="0" allowfullscreen></iframe>

D. Kriesel

Show thread

Dougall Feb 27

@pervognsen I was thinking about this on the LLM-based OCR models thread. LLM-based OCR models have a tendency to mistranscribe things as different sentences that make sense in context, instead of garbling characters like one might expect from traditional OCR, which feels like a very similar failure mode. Unfortunately in this case the "world knowledge" that makes them good at OCR is the source of the problem, so there isn't a simple "don't use JBIG2" style solution.

Show thread

Bartosz Taudul Feb 27

@dougall @pervognsen Do you have examples of such mistranscriptions?

Show thread

Dougall Feb 27

@wolfpld @pervognsen I don't speak Polish, so you can tell me how plausible they are, but on your handwriting sample:
* "Nie ma innych węzłów!" (Qwen2.5-VL 7B)
* "Nie ma już więcej!" (Gemma3 12B)
* "Nice use of punctuation!" (DeepSeek-OCR 3B)

Qwen3-VL 8B managed to replace "tracy::" with "tray::" on every single line.

"MemAllocCallstack" -> "MemoryAllocCallstack" (MiniCPM-V 4.5)
"ProfilerData::ProfilerData()" -> "Profiler::Data()" (Gemma3 12B)
"ThreadData" -> "PinnedData" (DeepSeek-OCR 3B)

Show thread

Dougall Feb 27

@wolfpld @pervognsen DeepSeek-OCR 3B also hallucinated "llvm.pl.so.2", it was clearly the worst I tested.

Some counter-examples, which Opus tells me are wrong, but garbled:
* "Nie ma innego wątków!" (Qwen3-VL 8B)
* "Nie ma imnych wetków!" (MiniCPM-V 4.5)

Bigger models are obviously *way* better, but I suspect you would see similar failure modes on borderline-legible text.

Show thread

Dougall Feb 27

RE: https://mastodon.gamedev.place/@wolfpld/116088970554232592

@wolfpld @pervognsen For anyone reading along, the expected text is "Nie ma innych wątków!", and the test image and thread are here:

https://mastodon.social/@wolfpld@mastodon.gamedev.place/116088970688566418

Show thread

Dougall Feb 27

@wolfpld @pervognsen Heh, I only just noticed the "^" "Start" insertion myself. Your huge Qwen3.5 nailed that. It's maybe also worth noting that none of the models I tested produced anything related to that. No extra "Start" on its own line, no "^", nothing.

Show thread

Dougall Feb 27

@wolfpld @pervognsen Oh, and I see "SystemTracing" -> "SystemTraining" here. They're surprisingly hard to find by eye.

https://mastodon.social/@pervognsen/116098703237535154

Show thread

Bartosz Taudul Feb 28

@dougall @pervognsen Good examples!

Now, the philosophical question is, do these mistakes mean these new models shouldn't be used. People were quite happy to use OCR for things, even though you would typically get things like my "{-Diata () |" half of the time.

I can see a human reading that "SystemTracing" as "SystemTraining" very easily. Example: without checking on your system, is frame #15 libvpl, or librpl?

The model actually reasoned about this a lot, as well as the "Start" insertion.

Show thread

Per Vognsen Feb 28

@wolfpld @dougall Doesn't that apply to the JBIG2 example as well? The false confidence it engenders is the problem in both cases and people were happy using those Xerox scanners and JBIG2 compressors too until they weren't. More traditional ML models seem to better equipped to communicate confidence/probability estimates with its results vs LLM-based models.

Show thread

Dougall Feb 28

@pervognsen @wolfpld Yeah, but the LLMs can do something that can't be done without them. They're good for indexing images for search, rather than replacing the original copies with something more compact. Or providing a guess at handwriting in historical documents with a human in the loop.

(I'm not sure what you're actually using OCR for, mostly I'm pasting text from screenshots - where I'd prefer Apple's OCR. It works very well and errors are unlikely to mislead me.)

Show thread

Dougall

@pervognsen @wolfpld I think the main thing is just to be aware of the risks, and to not rely on them in "high-risk domains":
https://apnews.com/article/ai-artificial-intelligence-health-business-90020cdf5fa16c79ca2e5b6c4c9bbb14

Researchers say AI transcription tool used in hospitals invents things no one ever said

Whisper is a popular transcription tool powered by artificial intelligence, but it has a major flaw. It makes things up that were never said. Whisper was created by OpenAI. It's being used in many industries worldwide to translate and transcribe interviews, generate text in popular consumer technologies and create subtitles for videos. OpenAI has promoted Whisper as having near “human level robustness and accuracy." But more than a dozen computer scientists and software developers tell The Associated Press that isn’t always the case and that it's prone to making up chunks of text and even entire sentences. An OpenAI spokesperson says the company studies how to reduce that and updates its models incorporating feedback received.

AP News