Mastodawn

asmw

I have an interesting issue with the #tesseract #OCR command line tool on Ubuntu 24.04.

The tool detects text more reliably if I convert my JPG images to TIFF first.

Simply using imagemagicks convert orig.jpg ocr.tiff improves the results reliably.

Anyone know why?

Show thread

PhasonMatrix Feb 3

@asmw Very interesting. Have you tried experimenting with different page segmentation modes (PSMs)? It might be assuming one over another based on file extension? Might also be worth searching the code base (on Github) for mentions of file formats.

Show thread

asmw Feb 3

@PhasonMatrix Yeah, I tried fixing PSM and DPI.

The legacy OCR engine is unavailable anyways.

The TIFF is generated from the JPG, so there can't be more information in the image.

So weird.