Mastodawn

Comparando archivos PDF (en Cápsula (e)Lucubración)

Cómo comparar dos archivos PDF

<gemini://itan.pollux.casa/elucubracion/comparando_dos_pdf.gmi>

#GemistaciónItan #PDF #comparar #GNU-Linux #DiffPDF #OCRmyPDF #CápsulaLucubración #geminiSpace #geminiProtocol #geminiEspacio

Jeff Fortin T. (風の庭園のNekohayo)Mar 20

Hey, it turns out that GNOME's "Document Scanner" application (Simple Scan) actually _can_ do Optical Character Recognition, running a post-processing script. It's just really, really, really not obvious (nor easy to set up): https://gitlab.gnome.org/GNOME/simple-scan/-/issues/1#note_2713733

As a stopgap, here's my proposed UI lipstick fix just so that the existing UI's purpose can be understood: https://gitlab.gnome.org/GNOME/simple-scan/-/merge_requests/322

I'm hoping to see a built-in implementation someday.

#SimpleScan #OCR #scanning #productivity #GNOME #UX #OCRmyPDF

Integrate Optical Character Recognition (OCR) (#1) · Issues · GNOME / Document Scanner · GitLab

Submitted by Robert Ancell in bug (#782107): Automatically extract text using Optical Character Recognition. This can be stored...

GitLab

Show thread

Jim Spath Feb 5

Old fuzzy pages, still tricky at 1600 dpi with #XSane and #OCRmyPDF on unix.
=
Hooper Ranch Bookkeeper ............... cobhebgeneaneen Lisa Salkov, Mana Diaz (alt.)
=
1. Mana should have been Maria.

2. And a bunch of dots got halluncinated into random letters. Same as it ever was, back to encoded Bacon wrote Shakespeare gibberish.

Otherwise, damned decent!

Show thread

Liane M. Dubowy Jan 6

@WorziArmin Ein Kollege hatte schon mal Tools fürs #Dokumentenmanagement vorgestellt. Aber ich fürchte: Das erfordert noch mehr Disziplin. #OCRmyPDF kann das Problem nicht lösen, das scannt ja nur ein und macht die Texterkennung. Für alle, die keine Lust haben zu sortieren, empfehle ich tatsächlich #Recoll. Festplatte indizieren, dann findet das fast alles. Aber mich würde das Chaos auf der Festplatte irre machen.

eWe Nov 20, 2025

¯\_(ツ)_/¯ *meh
Homebrew pillow 12.0.0 Upgrade macht meinen PDF Workflow kaputt :(
Aber ich kann nicht downgraden auf die 11.3.0 weil dependencies
Und weil homebrew die alte Version nicht gelistet hat?

Hmpf

#homebrew #python #ocrmypdf

Schlaf ist überbewertet Nov 17, 2025

Ich bin ja sonst nicht so der Typ für #Software und Empfehlungen....

Aber das hier ist ein absolutes Muss, wenn Du massenhaft pdf-Dateien nachträglich mit einem Text-Layer versehen willst.

Massenhaft scannen in eine Datei und während der Texterkennung automatisch trennen lassen mit ist nur ein Highlight...

Muss man haben!
Github:
https://github.com/digidigital/OCRthyPDF-Essentials

#ocrthypdf #ocr #ocrmypdf #ubuntu #foss

Show thread

Tim Schlotfeldt ⚓🏳️‍🌈Oct 28, 2025

@Martin Seeger Ah, Benamung ist echt ein Thema. Und dann auch wieder nicht. Mein Benamungsschema für Dateien ist Datum-Typ-Ersteller.

Ich benutze allerdings kein #paperless sondern mache das händisch mit #ocrmypdf. Die Dateien sortiere ich in eine Verzeichnisstruktur. Und dank OCR findet bei mir #Recoll dann alles wieder. @Bastian

The Hubzilla @ tschlotfeldt.de

Show thread

Victor Forberger Oct 11, 2025

@D_J_Nathanson

#pdftk for terminal
@libreoffice draw
#masterpdf v4 is free; current version is paid
#ocrmypdf
#pdfunite etc

I can send you various aliases I have created. Also, see various pdf posts at linuxatty.wordpress.com.

Show thread

Jonathan Kamens 86 47 Sep 9, 2025

Editing or redacting a #PDF using #LibreOffice Draw is far superior to the commonly used method of converting the PDF's pages into images and editing the images, because the latter results in a PDF that is many times larger and doesn't render as well. Also, text copy and paste is lost, which you can recover from to some extent with a tool like #OCRmyPDF, but you'll never get the text quality back to as high as it was before you converted the PDF to images.
#FOSS

Samuel Plumppu Sep 5, 2025

Have you ever needed to extract text from images embedded in a #PDF? I can highly recommend the open source #CLI tool #OCRmyPDF which is easy to automate in for example a #DataPipeline.

It uses #Tesseract #OCR under the hood and has many options to experiment with to get the best possible accuracy for your language and PDF content.

You can get started with just a few commands:

https://samuelplumppu.se/blog/automated-text-extraction-from-pdf-images-with-ocrmypdf

Automated Text Extraction from PDF Images with OCRmyPDF

Experienced fullstack developer, curious about how tech, systems thinking and Doughnut design for business can be combined to create a positive impact.