RE: https://vivoweb.org/2026/03/03/request-for-comments-disambiguation-deduplication-spec/

A #disambiguation and #deduplication engine for #VIVO will be developed. The proposed specs are published now, they are open for comments until March 17.

#openresearchInformation #OpenInfrastructures

Databricks just showed that clean, deduped data beats fancy model tweaks for faster LLMs. Their paper reveals a simple data pipeline—language filtering, deduplication, and high‑quality datasets—outperforms architecture tweaks on GPU training. Curious how to boost speed without extra compute? Dive in. #LLMTraining #DataQuality #Databricks #Deduplication

🔗 https://aidailypost.com/news/databricks-paper-finds-data-quality-outweighs-model-architecture-llm

Fixing Noisy Logs with OpenTelemetry Log Deduplication · Dash0

Learn how the OpenTelemetry log deduplication processor collapses log storms without losing context reduces noise and keeps observability pipelines efficient

Hab mein PyHardLinkBackup komplett neu geschrieben. Ursprünglich 2015 gestartet und bis 2020 genutzt, schlief es jetzt fast 6 Jahre...

Aber als ich über alte damit erstellte backups gestolpert bin, hab ich mir gedacht, das Konzept ist doch ganz nützlich.

Also kompletter rewrite: https://github.com/jedie/PyHardLinkBackup

#backup #OpenSource #Python #deduplication #hardlinks

TIL: #XFS kann #Snapshots aber keine #Compression, aber dafür #deduplication, wenn auch noch experimental
Wer ein #snapshot artiges Backup für #Linux sucht, könnte sich #kopia ansehen.
Über Regeln sehr fein granuliert einstellbar.
Es hat mich allerdings jetzt fast eine Woche gekostet, es so zum laufen zu bringen, wie ich es mir gewünscht habe. Aber mit viel #script’en hat alles geklappt.
#deduplication und #kompression, schnell und easy.
Sehr zu empfehlen.

And once in a while I cleanup the external libraries with #Czkawka

This is an amazing software for #deduplication of image folders.

https://github.com/qarmin/czkawka

GitHub - qarmin/czkawka: Multi functional app to find duplicates, empty folders, similar images etc.

Multi functional app to find duplicates, empty folders, similar images etc. - qarmin/czkawka

GitHub

Sick: Indexed deduplicated binary storage for JSON-like data structures

https://github.com/7mind/sick

#HackerNews #Sick #Indexed #Binary #Storage #JSON #Deduplication #DataStructures

GitHub - 7mind/sick: Streams of Independent Constant Keys

Streams of Independent Constant Keys. Contribute to 7mind/sick development by creating an account on GitHub.

GitHub
Borg: The Memory That Never Forgets

The machine forgets. The Ghost does not.

Part 1 : #PySpark Data Pre-processing Essentials #filtering || #Deduplication || Data Cleansing.

Learn PySpark data pre-processing with our tutorial! Learn the art of filtering and deduplication, essential techniques for cleaning ... source

https://quadexcel.com/wp/part-1-pyspark-data-pre-processing-essentials-filtering-deduplication-data-cleansing/

Part 1 : #PySpark Data Pre-processing Essentials #filtering || #Deduplication || Data Cleansing. - QuadExcel.com

Learn PySpark data pre-processing with our tutorial! Learn the art of filtering and deduplication, essential techniques for cleaning ... source

QuadExcel.com