fucking ocr

For reference, I'm running #tesseract_ocr on the subtitles of an episode of SG-1.

(For those unaware, because Unicode wasn't really a widely-deployed thing when the DVD format was standardized, but the people standardizing it still wanted DVDs to be able do display subtitles in every language, the subtitles on a DVD aren't encoded as text, they're encoded as images. This is why different DVDs have subtitles in different fonts. Blu-Rays kept this decision, because I guess they didn't want to ship a font with 100% Unicode coverage on every Blu-Ray player. I wrote a script that takes an MKV file with PGS subtitles and spits out a folder full of PNGs.)

Here are the files it's looking at. They're bright white font-rendered text on a transparent background.

Why does #OCR struggle with this?

A while ago I saw a Tumblr post of someone trying to transcribe a screenshot of a data: URL, and they remarked that traditional OCR programs tend to struggle with this for some reason. I remember they wrote their own OCR algorithm from scratch that knew the font a priori, compared the characters pixel for pixel, and simply failed unless they exactly matched.

I could do that. I would rather not do that, but I could do that.

It would also be complicated by the fact that I would need to extract the font from every single different subtitle file. And by the fact that the font in this particular subtitle file seems to support aligning characters not exactly on a pixel boundary.

@jhwgh1968 @nycki @argv_minus_one I have just discovered a program called SubtitleEdit that does exactly this by splitting an image into characters and then prompting the user for each unique character it identifies. It handles accents and italics *flawlessly.* I'm seriously impressed.

This might be the strat, gamers. I might not have to write any code at all.

Although I am tempted to pop the hood on that algorithm and more tightly integrate it with the program I'm making.

@AVincentInSpace

Well now, there's an interesting strategy.

Subtitle images are stored losslessly unless I'm mistaken, so yeah, every instance of a given glyph should be pixel-for-pixel identical.

OCR is designed to solve a different problem (reading text from scanned paper) which has very different requirements (recognizing glyphs despite sensor noise, printing imperfections, paper texture, differences in paper alignment, etc).

@jhwgh1968 @nycki

@argv_minus_one @AVincentInSpace @jhwgh1968 not sure why i got tagged, but this reminds me a lot of an article i just read about reverse-engineering kindle's drm to convert a pile of svgs back into a font.

https://blog.pixelmelt.dev/kindle-web-drm/

How I Reversed Amazon's Kindle Web Obfuscation Because Their App Sucked

As it turns out they don't actually want you to do this (and have some interesting ways to stop you)

Cats with power tools

@AVincentInSpace

I don't suppose it would help to re-color the images as black text on white background?

@argv_minus_one i swear to god if this works

@AVincentInSpace

It's long-shot AF, but you may as well try. 🤷‍♂️

@argv_minus_one oh my god it worked perfectly

the joy i feel is greatly outweighed by disappointment

@argv_minus_one okay, not quite, there are still a couple of errors, like pipe characters instead of capital I's and missing spaces, but it's 99% there

if i just took @jhwgh1968's idea and postprocessed the output by replacing all pipe characters with I's (much as I might delude myself, they're never going to put a pipe character in a TV subtitle) and called it a day, this might be just about done

i'd still *rather* fix it on the OCR level just on principle but that sounds like Effort and this is more than good enough.

@AVincentInSpace

To be fair to Tesseract, the pipe and uppercase I are basically indistinguishable in most sans-serif fonts.

For example, you probably can't reliably distinguish these on Mastodon: Il| (uppercase india, lowercase lima, pipe)

Humans infer the correct character from context, and even that only works when the human is looking at a known word. I suppose Tesseract could do the same using a dictionary. 🤔

@jhwgh1968

@AVincentInSpace

That actually worked?! You're fing kidding me.