Mastodawn

RE: https://vivoweb.org/2026/03/03/request-for-comments-disambiguation-deduplication-spec/

A #disambiguation and #deduplication engine for #VIVO will be developed. The proposed specs are published now, they are open for comments until March 17.

#openresearchInformation #OpenInfrastructures

AI Daily Post Mar 2

Databricks just showed that clean, deduped data beats fancy model tweaks for faster LLMs. Their paper reveals a simple data pipeline—language filtering, deduplication, and high‑quality datasets—outperforms architecture tweaks on GPU training. Curious how to boost speed without extra compute? Dive in. #LLMTraining #DataQuality #Databricks #Deduplication

🔗 https://aidailypost.com/news/databricks-paper-finds-data-quality-outweighs-model-architecture-llm

Nicolas Fränkel 🇪🇺🇺🇦🇬🇪Feb 6

Fixing Noisy Logs with #OpenTelemetry Log #Deduplication

https://www.dash0.com/guides/opentelemetry-log-deduplication-processor

Fixing Noisy Logs with OpenTelemetry Log Deduplication · Dash0

Learn how the OpenTelemetry log deduplication processor collapses log storms without losing context reduces noise and keeps observability pipelines efficient

🅹🅴🅳🅸🅴 🇺🇦🕊️Jan 15

Hab mein PyHardLinkBackup komplett neu geschrieben. Ursprünglich 2015 gestartet und bis 2020 genutzt, schlief es jetzt fast 6 Jahre...

Aber als ich über alte damit erstellte backups gestolpert bin, hab ich mir gedacht, das Konzept ist doch ganz nützlich.

Also kompletter rewrite: https://github.com/jedie/PyHardLinkBackup

#backup #OpenSource #Python #deduplication #hardlinks

Jan Dec 5, 2025

Wer ein #snapshot artiges Backup für #Linux sucht, könnte sich #kopia ansehen.
Über Regeln sehr fein granuliert einstellbar.
Es hat mich allerdings jetzt fast eine Woche gekostet, es so zum laufen zu bringen, wie ich es mir gewünscht habe. Aber mit viel #script’en hat alles geklappt.
#deduplication und #kompression, schnell und easy.
Sehr zu empfehlen.

Show thread

Whatisgoingon Nov 17, 2025

And once in a while I cleanup the external libraries with #Czkawka

This is an amazing software for #deduplication of image folders.

https://github.com/qarmin/czkawka

GitHub - qarmin/czkawka: Multi functional app to find duplicates, empty folders, similar images etc.

Multi functional app to find duplicates, empty folders, similar images etc. - qarmin/czkawka

GitHub

Hacker News Oct 28, 2025

Sick: Indexed deduplicated binary storage for JSON-like data structures

https://github.com/7mind/sick

#HackerNews #Sick #Indexed #Binary #Storage #JSON #Deduplication #DataStructures

GitHub - 7mind/sick: Streams of Independent Constant Keys

Streams of Independent Constant Keys. Contribute to 7mind/sick development by creating an account on GitHub.

GitHub

Tom's IT Cafe Oct 18, 2025

The machine forgets. The Ghost does not.

https://deadswitch.tomsitcafe.com/2025/10/borg-backup-intro.html

#borg #backup #encryption #deduplication #ghostware

Borg: The Memory That Never Forgets

The machine forgets. The Ghost does not.

Python Job Support Oct 1, 2025

Part 1 : #PySpark Data Pre-processing Essentials #filtering || #Deduplication || Data Cleansing.

Learn PySpark data pre-processing with our tutorial! Learn the art of filtering and deduplication, essential techniques for cleaning ... source

https://quadexcel.com/wp/part-1-pyspark-data-pre-processing-essentials-filtering-deduplication-data-cleansing/

Part 1 : #PySpark Data Pre-processing Essentials #filtering || #Deduplication || Data Cleansing. - QuadExcel.com

Learn PySpark data pre-processing with our tutorial! Learn the art of filtering and deduplication, essential techniques for cleaning ... source

QuadExcel.com

Paula Gentle on Friendica 🇺🇦Sep 20, 2025

Ich hab mal versucht, die Speicheroptimierung durch #Deduplication beim #Backup mit #restic zu quantifizieren. Dies nach einer Laufzeit von knapp 2 Jahren.

Herausgekommen ist: 22,4%

# restic stats latest 
repository d989459c opened successfully, password is correct
scanning...
Stats in restore-size mode:
Snapshots processed:   1
   Total File Count:   438037
         Total Size:   23.271 GiB

# restic stats latest --mode raw-data 
repository d989459c opened successfully, password is correct
scanning...
Stats in raw-data mode:
Snapshots processed:   1
   Total Blob Count:   265960
         Total Size:   18.409 GiB

Hoffe, das richtig interpretiert zu haben.

restic.readthedocs.io/en/stabl…

Fixing Noisy Logs with OpenTelemetry Log Deduplication · Dash0

GitHub - qarmin/czkawka: Multi functional app to find duplicates, empty folders, similar images etc.

GitHub - 7mind/sick: Streams of Independent Constant Keys

Borg: The Memory That Never Forgets

Part 1 : #PySpark Data Pre-processing Essentials #filtering || #Deduplication || Data Cleansing. - QuadExcel.com

Manual — restic 0.18.0 documentation