Mastodawn

Benjamin Rosemann Jul 6, 2024

@[email protected] as far as I understand you want to implement a PDF -> Text -> PDF workflow. Using plaintext as intermediate is problematic, as you (may) lose a lot of layout information.

For high quality fulltext you may need a more sophisticated intermediate format like #PageXML or #AltoXML. But they also require a more sophisticated tool for editing like #OCR4All.

Stefan Weil Jun 6, 2024

Extra zur #BiblioCON24 gibt's das neue Release 5.4.0 für #TesseractOCR, unsere Standardlösung für die automatisierte Texterkennung (nicht nur) bei der #Zeitungsdigitalisierung. Tesseract kann jetzt auch #PAGEXML erzeugen und generiert schönere PDF-Dateien.

Show thread

Janne Mar 4, 2023

@einerseits Interesting project! I'm experimenting with #eScriptorium and #kraken #OCR for recognition of German Kurrent. Can you possibly shed light on your transcription process? And do you provide the OCR full text files in #ALTO or #PageXML as well or plan on doing so in the future?