Abstract

B. Stiller; E. Hunt; Ela Hunt; Ela Hunt; T. Bocek; Thomas Bocek; Thomas Bocek

Abstract

Authors: B. Stiller
E. Hunt
Ela Hunt
Ela Hunt
T. Bocek
Thomas Bocek
Thomas Bocek
Publication date
Publisher

Abstract

Fast similarity search is important for time-sensitive applications. Those include both enterprise and web scenarios, where typos, misspellings, and noise need to be removed in an efficient way, in order to improve data quality, or to find all information of interest to the user. This paper presents a new algorithm called Fast Similarity Search (FastSS) that performs an exhaustive similarity search in a dictionary, based on the edit distance model of string similarity. The algorithm uses deletions to model the edit distance. For a dictionary containing n words of average length m, and given a maximum number of spelling errors k, FastSS uses a deletion dictionary of size O(nm k). At search time each query is mutated to generate a deletion neighborhood of size O(m k), which is compared to the indexed deletion dictionary. As a deletion neighborhood is smaller than a neighborhood using deletions, insertions and replacements, this contributes to a faster search. FastSS looks up misspellings in a time which is independent of n fo

Similar works

Full text

Available Versions

CiteSeerX

oai:CiteSeerX.psu:10.1.1.90.73

Last time updated on 22/10/2014