fucking ocr

For reference, I'm running #tesseract_ocr on the subtitles of an episode of SG-1.

(For those unaware, because Unicode wasn't really a widely-deployed thing when the DVD format was standardized, but the people standardizing it still wanted DVDs to be able do display subtitles in every language, the subtitles on a DVD aren't encoded as text, they're encoded as images. This is why different DVDs have subtitles in different fonts. Blu-Rays kept this decision, because I guess they didn't want to ship a font with 100% Unicode coverage on every Blu-Ray player. I wrote a script that takes an MKV file with PGS subtitles and spits out a folder full of PNGs.)

Here are the files it's looking at. They're bright white font-rendered text on a transparent background.

Why does #OCR struggle with this?

@AVincentInSpace

I don't suppose it would help to re-color the images as black text on white background?

@argv_minus_one i swear to god if this works

@AVincentInSpace

It's long-shot AF, but you may as well try. 🤷‍♂️

@argv_minus_one oh my god it worked perfectly

the joy i feel is greatly outweighed by disappointment

@AVincentInSpace

That actually worked?! You're fing kidding me.