Docling vs MarkItDown: GenAI向けのドキュメント処理における最適なツールはどっち? - Qiita

はじめに GenAI(生成AI)プロジェクトやRAG(検索拡張生成)システムを構築する際、データのクレンジングと準備はとても重要なステップですよね。でも実際には、企業の内部ドキュメントがきれいなテキスト形式になっていることはほとんどありません。 複数列のPDF、複雑な表が...

Qiita

RE: https://fedi.simonwillison.net/@simon/116457708120212477

#liteparse werde ich im Vergleich zu #Docling lokal mal testen, nutze das bisher nur per Web über #llamaindex.

Join Ming and I for a #Docling workshop at @pycon_austria this weekend! It's a free event with a wide range of talks, hands-on workshops, and networking opportunities.

"Workshop: Learn to Unlock Document Intelligence with Open-Source AI" will be on Sunday, April 19, at 10:00-12:00 in room E.HG 209. More details including venue & registration: https://2026.pycon.at/

#PyCon #PyConAT #PyConAT26 #opensource

Google for Developers (@googledevs)

RAG 파이프라인을 최적화해 더 정교한 AI 에이전트를 만드는 방법이 소개됐다. Docling으로 문서 구조화를 하고, dot product로 효율을 높이며, re-ranking으로 정확도를 개선하는 등 검색증강생성 기반 에이전트 개발 기법을 다룬다.

https://x.com/googledevs/status/2042331722298060929

#rag #aiagents #docling #reranking #llm

Google for Developers (@googledevs) on X

Build more refined AI agents by optimizing your RAG pipeline with GDE, Glen Yu → https://t.co/uR9hJ0LKy0 Glen Yu shows how to: 📄 Use Docling for structured formats 🔢 Apply dot product for efficiency 🎯 Implement re-ranking for accuracy

X (formerly Twitter)

It was such a pleasure to share a stage with @cybette at #OpenSearchCon, and even more so to share the work of the #Docling team and how it can be integrated with #OpenSearch.

Check out the video of the talk here: https://www.youtube.com/watch?v=IqUJVGyI5to

Build AI-Ready Search: Integrating Docling with OpenSearch for Advanced RA... Carol Chen & Phil Nash

YouTube

Here's the presentation @philnash and I gave at #OpenSearchCon China about integrating #Docling with OpenSearch for advanced RAG: https://www.youtube.com/watch?v=IqUJVGyI5to

Our slides are available on sessionize: https://opensearchcon-china-2026.sessionize.com/session/1115191

Thanks to @OpenSearchProject, @krisfreedain for having us in #Shanghai!

Build AI-Ready Search: Integrating Docling with OpenSearch for Advanced RA... Carol Chen & Phil Nash

YouTube
Are you going to KubeCon EU? Come find me and get one of these awesome (limited in number) Duckling stickers. #duckling #kubeconeu #docling https://twp.ai/E6HlHZ

Contrary to what I'm reading, #Claude is far worse than I remember. It often doesn't fully understand simple questions and needs nudging - and as in this case, can't solve even trivial problems which only require careful reading of the log.

It took #Kimi 10secs to solve a problem with an api call to a #docling
server crashing when accessed over its api. I used Kimi in the browser. #ClaudeCode Sonnet 4.6 had full access to the actual server and couldn't solve the issue after 30 mins of repeated attempts.

```
The root cause is clear: **Docling workers are crashing due to missing EasyOCR dependency**, causing the async jobs to fail. OpenWebUI polls for results, gets a 404 because the job died, and reports "Task result not found."

**Fix the Docling installation:**

```bash
# Stop the service
sudo launchctl unload /Library/LaunchDaemons/com.docling-serve.plist

# Install EasyOCR in the correct uv tool environment
uv tool install --upgrade docling-serve --with easyocr

# Or if that doesn't work, run the specific Python interpreter:
/Users/urisharf/.local/share/uv/tools/docling-serve/bin/python -m pip install easyocr

# Reload
sudo launchctl load /Library/LaunchDaemons/com.docling-serve.plist
```

**Alternative (if you don't need OCR):** Configure Docling to use a different OCR engine or disable it. Check Docling's env vars for `DOCLING_OCR_ENGINE` or similar.

**Verify the fix:**
```bash
curl -X POST http://mbp.lan:5001/v1/convert/source \
-H "Content-Type: application/json" \
-d '{"source": "https://arxiv.org/pdf/2206.01062.pdf"}'
```

The duplicate `libavdevice` dylibs warning is harmless noise from Gradio/CV2 conflicts, but the EasyOCR error is fatal.
```

#AIHype #TheAICON #Anthropic

Build Agent-Ready RAG Systems in Java with Quarkus and Docling https://www.the-main-thread.com/p/enterprise-rag-quarkus-docling-pgvector-tutorial

#java #docling

Build Agent-Ready RAG Systems in Java with Quarkus and Docling

This hands-on tutorial shows how to ingest complex PDFs with layout-aware parsing, store embeddings in PostgreSQL, retrieve context reliably, and add guardrails for safe AI responses.

The Main Thread

@karstenpe ich habe jetzt zwei Varianten der Notizbücher vom Remarkable lokal gespeichert: 1x als PDF mit Bitmap drin und 1x PDF mit Vektoren.

Welches CLI-Tool würdest du mir für #OCR empfehlen? #Tesseract?

Bei der Gelegenheit werde ich auch mal #Docling mit OCR-Option ausprobieren, das hat aber glaube ich keine eigene Engine.

Geht das auch mit #Ollama direkt aus PDF und einem lokalen LLM? Hat jemand Ideen?