Estimating Levenshtein Distance for Large Documents Using Compact Signatures

이 논문은 대용량 문서 간의 레벤슈타인 거리(Levenshtein Distance, LD)를 효율적으로 추정하는 새로운 기법을 제안한다. 원본 문서에서 슬라이딩 윈도우 해시를 통해 생성한 짧은 서명(signature)을 사용하여 LD를 계산함으로써, 수십만 자에 달하는 문서도 일반 하드웨어에서 실용적으로 처리할 수 있다. 제안된 방법은 개인정보 보호가 필요한 환경에서도 원본 문서 노출 없이 유사도 비교가 가능하며, 웹 스케일 중복 제거, 콘텐츠 보안, 디지털 포렌식 등 다양한 응용에 활용될 수 있다.

https://zenodo.org/records/20125438

#levenshtein #distanceestimation #documentsimilarity #hashing #privacypreserving

Estimating Levenshtein Distance With Signatures

Levenshtein Distance (LD) is an intuitive measure of lexical similarity, but computing it exactly runs in time proportional to the product of the string lengths, limiting practical use to strings of about a thousand characters. This paper describes a technique for estimating LD between much larger texts by applying LD to compact signatures---short strings generated by a sliding-window hash that function as thumbnails of the originals. Two parameters control the trade-off: a compression factor C determines signature length (approximately file_size/C), and a neighborhood size n controls sensitivity to dense    character-level differences. Signatures are two to three orders of magnitude shorter than the source documents, making LD estimation on documents of hundreds of thousands of characters practical on commodity hardware. At 25KB with C=50, normalized estimation error stays below 13% even for completely unrelated files, and the estimator reliably distinguishes identical, near-duplicate, modified, and unrelated documents across all tested compression factors. Because signatures are self-contained artifacts that support all subsequent operations without access to the source document, they enable privacy-preserving architectures in which neither party to a comparison need expose its original content. Applications include web-scale deduplication, content security and leak detection, double-blind similarity search, digital forensics, and scholarly analysis of manuscript traditions.   

Zenodo

I have not used my #functionalprogramming brain muscle for a while. Yesterday and today tried writing a #Levenshtein string distance function in GNU #guile . Looked up the naive version on wiki. That came easy. But then I decided, that I wanted an FP solution, but with caching/memoization ... I knew from past experience, that I would probably have to use continuations for that. #TheLittleSchemer strikes again.

#programming #coding #scheme #algorithm

[1/3]

Aufgrund der aktuell neuen Überprüfung von #Empfänger|#namen scheitern gerade massenhaft #Überweisungen, weil das Zahlungssystem der #Bank|en 'Unsicherheit' vermutet.

Anscheinend ist der Begriff der #Levenshtein-#Distanz für manche Software-Entwickler:innen ein Fremdwort.

https://www.faz.net/aktuell/finanzen/empfaengerueberpruefung-bei-banken-kunden-berichten-von-chaos-accg-110730141.html

Empfängerüberprüfung bei Banken: Kunden berichten von Chaos

Fehlerhafte Namensangaben und Sonderzeichen: Die neue Empfängerüberprüfung für Überweisungen verursacht bei vielen Bankkunden massive Probleme. Mehr als 200 F.A.Z.-Leser schildern ihre Erfahrungen.

Frankfurter Allgemeine Zeitung
It occured to me today that the words #rust and #c have the same #levenshtein distance to the word #safe. #rustlang #exherbo #security

Levenshtein Word Distance in Python. This algorithm calculates the number of steps required to change one word into another and has various uses including spell checking suggestions.

https://codedrome.substack.com/p/levenshtein-word-distance-in-python

#python

#pythonprogramming

#algorithms

#levenshtein

Levenshtein Word Distance in Python

I recently wrote an implementation of the Soundex Algorithm which attempts to assign the same encoding to words which are pronounced the same but spelled differently.

CodeDrome

Looks like it is now a choice between two wrongs. At least one is #OpenSource and does not collect data. Kind of.

#DeepSeek #ChatGPT #AI #LLM #Levenshtein

#ChatGPT seems confused here. Sometimes this bot seems to have a double personality.

#Levenshtein

#syd's killer feature is its disturbingly close #levenshtein distance to #bsd! Check out SegvGuard, and Force Sandboxing which are largely based on #HardenedBSD's implementation: http://man.exherbolinux.org/syd.7.html#SegvGuard #sydbox #exherbo
SYD(7)

🚀 Découvrez notre nouveau tutoriel sur la comparaison de textes avec la fonction Levenshtein en PHP ! Apprenez comment cette technique peut vous aider à améliorer vos projets web. Visionnez la vidéo ici : https://youtu.be/JhiBbqYrBn8 #PHP #Levenshtein #WebDevelopment #Programmation
- YouTube

Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.

For this reason, edit distance is called #Levenshtein distance or #Ulam distance (but strangely, never "#Vintsyuk distance"). (j e)