Abstract

Abstract

Fast similarity search is important for time-sensitive applications. Those include both enterprise and web scenarios, where typos, misspellings, and noise need to be removed in an efficient way, in order to improve data quality, or to find all information of interest to the user. This paper presents a new algorithm called Fast Similarity Search (FastSS) that performs an exhaustive similarity search in a dictionary, based on the edit distance model of string similarity. The algorithm uses deletions to model the edit distance. For a dictionary containing n words of average length m, and given a maximum number of spelling errors k, FastSS uses a deletion dictionary of size O(nm k). At search time each query is mutated to generate a deletion neighborhood of size O(m k), which is compared to the indexed deletion dictionary. As a deletion neighborhood is smaller than a neighborhood using deletions, insertions and replacements, this contributes to a faster search. FastSS looks up misspellings in a time which is independent of n fo

    Similar works

    Full text

    thumbnail-image

    Available Versions