Estimating Levenshtein Distance for Large Documents Using Compact Signatures

이 논문은 대용량 문서 간의 레벤슈타인 거리(Levenshtein Distance, LD)를 효율적으로 추정하는 새로운 기법을 제안한다. 원본 문서에서 슬라이딩 윈도우 해시를 통해 생성한 짧은 서명(signature)을 사용하여 LD를 계산함으로써, 수십만 자에 달하는 문서도 일반 하드웨어에서 실용적으로 처리할 수 있다. 제안된 방법은 개인정보 보호가 필요한 환경에서도 원본 문서 노출 없이 유사도 비교가 가능하며, 웹 스케일 중복 제거, 콘텐츠 보안, 디지털 포렌식 등 다양한 응용에 활용될 수 있다.

https://zenodo.org/records/20125438

#levenshtein #distanceestimation #documentsimilarity #hashing #privacypreserving

Estimating Levenshtein Distance With Signatures

Levenshtein Distance (LD) is an intuitive measure of lexical similarity, but computing it exactly runs in time proportional to the product of the string lengths, limiting practical use to strings of about a thousand characters. This paper describes a technique for estimating LD between much larger texts by applying LD to compact signatures---short strings generated by a sliding-window hash that function as thumbnails of the originals. Two parameters control the trade-off: a compression factor C determines signature length (approximately file_size/C), and a neighborhood size n controls sensitivity to dense    character-level differences. Signatures are two to three orders of magnitude shorter than the source documents, making LD estimation on documents of hundreds of thousands of characters practical on commodity hardware. At 25KB with C=50, normalized estimation error stays below 13% even for completely unrelated files, and the estimator reliably distinguishes identical, near-duplicate, modified, and unrelated documents across all tested compression factors. Because signatures are self-contained artifacts that support all subsequent operations without access to the source document, they enable privacy-preserving architectures in which neither party to a comparison need expose its original content. Applications include web-scale deduplication, content security and leak detection, double-blind similarity search, digital forensics, and scholarly analysis of manuscript traditions.   

Zenodo

Scientists discover the brain's 'mileage clock' that helps estimate distance

This new discovery could completely change how we look at Alzheimer's. If this 'mileage clock' is disrupted in patients, it could give us new insights on both diagnosing and treating memory loss.

[View original comment]

Scientists discover the brain's 'mileage clock' that helps estimate distance

Researchers have uncovered a key mechanism in the brain that helps us estimate distance as we navigate our environment. This 'mileage clock' was found in a part of the brain important for navigation and memory, specifically in 'grid cells' that fire in patterns to track the distance traveled. The di... [More info]

Scientists discover the brain's 'mileage clock' that helps estimate distance

How might this discovery of the brain's 'mileage clock' influence our understanding of memory-related diseases like Alzheimer's, @aibot, and could it lead to new approaches in diagnosing or treating these conditions?

[View original comment]

Fun with distance estimation

A 10x10 mosaic of some Mandelbrot/Burning ship -like fractals, colored using black/white distance estimation.

#Mandelbrot #BurningShip #Fractal #DistanceEstimation #Digitalart #Mathart #Art #TilingTuesday #Mosaic

Klein bottle

* Slight generalization *

While browsing the fractal forums, I came across this formula, which the author called the "Klein bottle".
I don't know whether the function described below actually has anything to do with the Klein bottle. However, I find it fascinating.

Formula: \(z_{n+1} = f(z_n^2+c)\)

where f(z) is defined as (pseudocode)

f(z) {
dist = abs(re(z)) - 1.2
if (re(z) > 1.2) {
re(z) = - 1.2 + dist
im(z) = - im(z)
} else if (re(z) < -1.2) {
re(z) = 1.2 - dist
im(z) = - im(z)
}
dist = abs(im(z)) - 1.2
if (im(z) > 1.2) {
re(z) = - re(z)
im(z) = -1.2 + dist
} else if (im(z) < -1.2) {
re(z) = - re(z)
im(z) = 1.2 - dist
}
return z
}

#fractal #fractalart #mathart #mandelbrot #distanceestimation #art

Inspired by https://en.wikipedia.org/wiki/Herman_ring#Herman_and_parabolic_basin

with a sightly different parameter 'a' and a more interesting coloring.

\(z_{n+1}=e^{2 \pi i t}z_n^3\frac{1-\bar{a}z_n}{z_n-a}\frac{1-\bar{b}z_n}{z_n-b}\)

with
\(t=0.6141866\)
\(a=0.25+0.008i\)
\(b=0.0405353-0.0255082i\)

#fractal #fractalart #juliaset #rationalfunction #escapetimefractals #rendering #distanceestimation

Herman ring - Wikipedia

#Magnet #Mandelbrot set #fractal mashed up with #ThreeD #Triplex algebra (a la #Mandelbulb) rendered with #DualNumber #DistanceEstimation in #FragM fork of #Fragmentarium

My highly over-engineered extravagant framework of shaders including each other multiple times with different things defined (to emulate C++ templates with #GLSL function overloading without polymorphism) takes significantly longer to link the #shader than it does to render the #animation.

First attempts with typos gave 100k lines of cascaded errors in the shader info log, which which the Qt GUI list widget was Not Happy At All. Luckily the log went to stdout too, so I could pipe to a file and see the start where I missed a return statement or two.