A 3.5 MB C++ engine for deterministic RAG deduplication hitting 30 GB/s

Merlin Community Edition은 LLM 컨텍스트에서 중복 제거를 통해 토큰 사용을 절감하는 경량 C++ 엔진과 통합 도구를 제공한다. 이 오픈소스 프로젝트는 MITM 없이 VSCode 확장, Claude Code 등과 연동되며, 커뮤니티 버전은 일일 및 월간 사용량 제한이 있다. 고성능 멀티스레드 C++ 엔터프라이즈 엔진은 별도 유료 제품으로 제공된다. 중복 제거를 통해 RAG 파이프라인에서 최대 71%의 중복을 줄여 비용 절감 효과가 크다. 현재 사전 출시 상태이며, arXiv 논문과 함께 공개되어 AI 개발자들이 직접 활용 가능하다.

https://github.com/corbenicai/merlin-community

#rag #deduplication #llm #cpp #vscodeextension

GitHub - corbenicai/merlin-community: Merlin Community Edition — free dedup engine + integrations. Saves LLM tokens. No telemetry.

Merlin Community Edition — free dedup engine + integrations. Saves LLM tokens. No telemetry. - corbenicai/merlin-community

GitHub

RE: https://vivoweb.org/2026/03/03/request-for-comments-disambiguation-deduplication-spec/

A #disambiguation and #deduplication engine for #VIVO will be developed. The proposed specs are published now, they are open for comments until March 17.

#openresearchInformation #OpenInfrastructures

Databricks just showed that clean, deduped data beats fancy model tweaks for faster LLMs. Their paper reveals a simple data pipeline—language filtering, deduplication, and high‑quality datasets—outperforms architecture tweaks on GPU training. Curious how to boost speed without extra compute? Dive in. #LLMTraining #DataQuality #Databricks #Deduplication

🔗 https://aidailypost.com/news/databricks-paper-finds-data-quality-outweighs-model-architecture-llm

Fixing Noisy Logs with OpenTelemetry Log Deduplication · Dash0

Learn how the OpenTelemetry log deduplication processor collapses log storms without losing context reduces noise and keeps observability pipelines efficient

Hab mein PyHardLinkBackup komplett neu geschrieben. Ursprünglich 2015 gestartet und bis 2020 genutzt, schlief es jetzt fast 6 Jahre...

Aber als ich über alte damit erstellte backups gestolpert bin, hab ich mir gedacht, das Konzept ist doch ganz nützlich.

Also kompletter rewrite: https://github.com/jedie/PyHardLinkBackup

#backup #OpenSource #Python #deduplication #hardlinks

TIL: #XFS kann #Snapshots aber keine #Compression, aber dafür #deduplication, wenn auch noch experimental
Wer ein #snapshot artiges Backup für #Linux sucht, könnte sich #kopia ansehen.
Über Regeln sehr fein granuliert einstellbar.
Es hat mich allerdings jetzt fast eine Woche gekostet, es so zum laufen zu bringen, wie ich es mir gewünscht habe. Aber mit viel #script’en hat alles geklappt.
#deduplication und #kompression, schnell und easy.
Sehr zu empfehlen.

And once in a while I cleanup the external libraries with #Czkawka

This is an amazing software for #deduplication of image folders.

https://github.com/qarmin/czkawka

GitHub - qarmin/czkawka: Multi functional app to find duplicates, empty folders, similar images etc.

Multi functional app to find duplicates, empty folders, similar images etc. - qarmin/czkawka

GitHub

Sick: Indexed deduplicated binary storage for JSON-like data structures

https://github.com/7mind/sick

#HackerNews #Sick #Indexed #Binary #Storage #JSON #Deduplication #DataStructures

GitHub - 7mind/sick: Streams of Independent Constant Keys

Streams of Independent Constant Keys. Contribute to 7mind/sick development by creating an account on GitHub.

GitHub
Borg: The Memory That Never Forgets

The machine forgets. The Ghost does not.