1 research outputs found
Various improvements to text fingerprinting
Let s = s_1 .. s_n be a text (or sequence) on a finite alphabet \Sigma of
size \sigma. A fingerprint in s is the set of distinct characters appearing in
one of its substrings. The problem considered here is to compute the set {\cal
F} of all fingerprints of all substrings of s in order to answer efficiently
certain questions on this set. A substring s_i .. s_j is a maximal location for
a fingerprint f in F (denoted by ) if the alphabet of s_i .. s_j is f and
s_{i-1}, s_{j+1}, if defined, are not in f. The set of maximal locations ins is
{\cal L} (it is easy to see that |{\cal L}| \leq n \sigma). Two maximal
locations and such that s_i .. s_j = s_k .. s_l are named {\em
copies}, and the quotient set of {\cal L} according to the copy relation is
denoted by {\cal L}_C. We present new exact and approximate efficient
algorithms and data structures for the following three problems: (1) to compute
{\cal F}; (2) given f as a set of distinct characters in \Sigma, to answer if f
represents a fingerprint in {\cal F}; (3) given f, to find all maximal
locations of f in s