Mastodawn

#OpenDataLoader PDF — #opensource #PDF parser for #AI & #accessibility 🚀 #1 in benchmarks (0.90 overall), runs 100% locally, zero cloud, no GPU needed. #RAG #LLM #Python #NodeJS #Java

📄 Extracts #Markdown, #JSON (with bounding boxes) & #HTML from any PDF — correct reading order, heading hierarchy, list & image detection with XY-Cut++ algorithm

🧵👇#pdf

Show thread

michabbb 4d ago

🏆 Ranked #1 overall (0.90) across 200 real-world PDFs — 0.93 table accuracy, 0.94 reading order in hybrid mode. Beats docling (0.86), marker (0.83), pymupdf4llm (0.57)

⚡ Two modes: Fast local (0.05s/page, CPU-only) for standard PDFs + Hybrid mode (0.43s/page) for complex pages — AI backend runs on your machine, zero cloud dependency

Show thread

michabbb 4d ago

🔍 Hybrid unlocks: borderless table extraction (0.49→0.93 TEDS), #OCR in 80+ languages, #LaTeX formula extraction from scientific papers, AI chart & image descriptions

🛡️ Built-in #AI safety: filters hidden prompt injection attacks — transparent fonts, off-page content & invisible layers stripped before your #LLM ever sees the data

Show thread

michabbb 4d ago

♿ First #opensource end-to-end PDF accessibility tool: layout analysis → auto-tagging → Tagged PDF (Apache 2.0, Q2 2026). Built with PDF Association & veraPDF devs

🔗 #LangChain integration, #Python/#NodeJS/#Java SDKs. JSON output with bounding boxes + page numbers enables click-to-source citation UX in #RAG pipelines
https://github.com/opendataloader-project/opendataloader-pdf

History for .claude - opendataloader-project/opendataloader-pdf

PDF Parser for AI-ready data. Automate PDF accessibility. Open-source. - History for .claude - opendataloader-project/opendataloader-pdf

GitHub