Estimating Levenshtein Distance for Large Documents Using Compact Signatures
이 논문은 대용량 문서 간의 레벤슈타인 거리(Levenshtein Distance, LD)를 효율적으로 추정하는 새로운 기법을 제안한다. 원본 문서에서 슬라이딩 윈도우 해시를 통해 생성한 짧은 서명(signature)을 사용하여 LD를 계산함으로써, 수십만 자에 달하는 문서도 일반 하드웨어에서 실용적으로 처리할 수 있다. 제안된 방법은 개인정보 보호가 필요한 환경에서도 원본 문서 노출 없이 유사도 비교가 가능하며, 웹 스케일 중복 제거, 콘텐츠 보안, 디지털 포렌식 등 다양한 응용에 활용될 수 있다.
https://zenodo.org/records/20125438
#levenshtein #distanceestimation #documentsimilarity #hashing #privacypreserving
Estimating Levenshtein Distance With Signatures
Levenshtein Distance (LD) is an intuitive measure of lexical similarity, but computing it exactly runs in time proportional to the product of the string lengths, limiting practical use to strings of about a thousand characters. This paper describes a technique for estimating LD between much larger texts by applying LD to compact signatures---short strings generated by a sliding-window hash that function as thumbnails of the originals. Two parameters control the trade-off: a compression factor C determines signature length (approximately file_size/C), and a neighborhood size n controls sensitivity to dense character-level differences. Signatures are two to three orders of magnitude shorter than the source documents, making LD estimation on documents of hundreds of thousands of characters practical on commodity hardware. At 25KB with C=50, normalized estimation error stays below 13% even for completely unrelated files, and the estimator reliably distinguishes identical, near-duplicate, modified, and unrelated documents across all tested compression factors. Because signatures are self-contained artifacts that support all subsequent operations without access to the source document, they enable privacy-preserving architectures in which neither party to a comparison need expose its original content. Applications include web-scale deduplication, content security and leak detection, double-blind similarity search, digital forensics, and scholarly analysis of manuscript traditions.



