N-gram similarity and distance
In many applications, it is necessary to algorithmically quantify the
similarity exhibited by two strings composed of symbols from a finite
alphabet. Numerous string similarity measures have been proposed.
Particularly well-known measures are based are edit distance
and the length of the longest common subsequence.
We develop a notion of n-gram similarity and distance.
We show that
edit distance and the length of the longest common subsequence are
special cases of n-gram distance and similarity, respectively.
We provide formal, recursive definitions of n-gram similarity and distance,
together with efficient algorithms for computing them.
We formulate a family of word similarity measures based on n-grams,
and report the results of experiments that suggest that the new measures
outperform their unigram equivalents.