Tech with Mak (@techNmak)

LangExtract라는 오픈소스·무료 문서 추출 도구가 소개되었습니다. 비정형 텍스트에서 구조화된 데이터를 추출하고, 모든 엔티티를 정확한 원문 위치에 매핑하며 100페이지 이상의 문서도 처리한다고 주장합니다. 기존 수만 달러짜리 엔터프라이즈 도구보다 우수하다고 선전하며 문서 추출 시장에 큰 영향을 미칠 가능성이 있습니다.

https://x.com/techNmak/status/2020867240753819983

#langextract #documentextraction #nlp #opensource

Tech with Mak (@techNmak) on X

Google just killed the document extraction industry. LangExtract: Open-source. Free. Better than $50K enterprise tools. What it does: → Extracts structured data from unstructured text → Maps EVERY entity to its exact source location → Handles 100+ page documents with high

X (formerly Twitter)

**Thơm mới-estdeveloper! 🚀 Chào mừng extractous-go – thư viện trích xuất tài liệu nhanh, hỗ trợ OCR!**

🔹 Trích xuất từ PDF, DOCX, XLSX, HTML…
🔹 OCR với Tesseract (danh ảnh Gimick).
🔹 API stream (tiết nén, tiết kiệm nhớ).
🔹 Hoạt động trên Windows/macOS/Linux.

Thử dùng ngay, phản hồi!“https://github.com/rahulpoonia29/extractous-go”

#documentextraction #OCR #Golang #FextractionGO #tecnologiatáán #tríchxéttài liệu #OCR #gohaxe

https://www.reddit.com/r/SideProject/comments/1oakd6

🤖🎉 Breaking news: A budget model outsmarts the #AI giants in document extraction! Apparently, $196 is the price of embarrassing #OpenAI with a "fine-tuned" solution that sounds like a model from a sci-fi B movie. Who knew cutting-edge tech was one step away from being outclassed by a bargain bin special? 😂🔍
https://arxiv.org/abs/2509.22906 #Outsmarted #BudgetModel #DocumentExtraction #TechNews #HackerNews #ngated
Extract-0: A Specialized Language Model for Document Information Extraction

This paper presents Extract-0, a 7-billion parameter language model specifically optimized for document information extraction that achieves performance exceeding models with parameter counts several orders of magnitude larger. Through a novel combination of synthetic data generation, supervised fine-tuning with Low-Rank Adaptation (LoRA), and reinforcement learning via Group Relative Policy Optimization (GRPO), Extract-0 achieves a mean reward of 0.573 on a benchmark of 1,000 diverse document extraction tasks, outperforming GPT-4.1 (0.457), o3 (0.464), and GPT-4.1-2025 (0.459). The training methodology employs a memory-preserving synthetic data generation pipeline that produces 280,128 training examples from diverse document sources, followed by parameterefficient fine-tuning that modifies only 0.53% of model weights (40.4M out of 7.66B parameters). The reinforcement learning phase introduces a novel semantic similarity-based reward function that handles the inherent ambiguity in information extraction tasks. This research demonstrates that task-specific optimization can yield models that surpass general-purpose systems while requiring substantially fewer computational resource.

arXiv.org