#OpenDataLoader PDF โ€” #opensource #PDF parser for #AI & #accessibility ๐Ÿš€ #1 in benchmarks (0.90 overall), runs 100% locally, zero cloud, no GPU needed. #RAG #LLM #Python #NodeJS #Java

๐Ÿ“„ Extracts #Markdown, #JSON (with bounding boxes) & #HTML from any PDF โ€” correct reading order, heading hierarchy, list & image detection with XY-Cut++ algorithm

๐Ÿงต๐Ÿ‘‡#pdf

๐Ÿ† Ranked #1 overall (0.90) across 200 real-world PDFs โ€” 0.93 table accuracy, 0.94 reading order in hybrid mode. Beats docling (0.86), marker (0.83), pymupdf4llm (0.57)

โšก Two modes: Fast local (0.05s/page, CPU-only) for standard PDFs + Hybrid mode (0.43s/page) for complex pages โ€” AI backend runs on your machine, zero cloud dependency

๐Ÿ” Hybrid unlocks: borderless table extraction (0.49โ†’0.93 TEDS), #OCR in 80+ languages, #LaTeX formula extraction from scientific papers, AI chart & image descriptions

๐Ÿ›ก๏ธ Built-in #AI safety: filters hidden prompt injection attacks โ€” transparent fonts, off-page content & invisible layers stripped before your #LLM ever sees the data

โ™ฟ First #opensource end-to-end PDF accessibility tool: layout analysis โ†’ auto-tagging โ†’ Tagged PDF (Apache 2.0, Q2 2026). Built with PDF Association & veraPDF devs

๐Ÿ”— #LangChain integration, #Python/#NodeJS/#Java SDKs. JSON output with bounding boxes + page numbers enables click-to-source citation UX in #RAG pipelines
https://github.com/opendataloader-project/opendataloader-pdf

History for .claude - opendataloader-project/opendataloader-pdf

PDF Parser for AI-ready data. Automate PDF accessibility. Open-source. - History for .claude - opendataloader-project/opendataloader-pdf

GitHub