5 research outputs found

    A practical index for approximate dictionary matching with few mismatches

    Get PDF
    Approximate dictionary matching is a classic string matching problem (checking if a query string occurs in a collection of strings) with applications in, e.g., spellchecking, online catalogs, geolocation, and web searchers. We present a surprisingly simple solution called a split index, which is based on the Dirichlet principle, for matching a keyword with few mismatches, and experimentally show that it offers competitive space-time tradeoffs. Our implementation in the C++ language is focused mostly on data compaction, which is beneficial for the search speed (e.g., by being cache friendly). We compare our solution with other algorithms and we show that it performs better for the Hamming distance. Query times in the order of 1 microsecond were reported for one mismatch for the dictionary size of a few megabytes on a medium-end PC. We also demonstrate that a basic compression technique consisting in qq-gram substitution can significantly reduce the index size (up to 50% of the input text size for the DNA), while still keeping the query time relatively low

    Simple, compact and robust approximate string dictionary

    Full text link
    This paper is concerned with practical implementations of approximate string dictionaries that allow edit errors. In this problem, we have as input a dictionary DD of dd strings of total length nn over an alphabet of size σ\sigma. Given a bound kk and a pattern xx of length mm, a query has to return all the strings of the dictionary which are at edit distance at most kk from xx, where the edit distance between two strings xx and yy is defined as the minimum-cost sequence of edit operations that transform xx into yy. The cost of a sequence of operations is defined as the sum of the costs of the operations involved in the sequence. In this paper, we assume that each of these operations has unit cost and consider only three operations: deletion of one character, insertion of one character and substitution of a character by another. We present a practical implementation of the data structure we recently proposed and which works only for one error. We extend the scheme to 2≤k<m2\leq k<m. Our implementation has many desirable properties: it has a very fast and space-efficient building algorithm. The dictionary data structure is compact and has fast and robust query time. Finally our data structure is simple to implement as it only uses basic techniques from the literature, mainly hashing (linear probing and hash signatures) and succinct data structures (bitvectors supporting rank queries).Comment: Accepted to a journal (19 pages, 2 figures

    Efficient Error-Correcting Geocoding

    Get PDF
    We study the problem of resolving a perhaps misspelled address of a location into geographic coordinates of latitude and longitude. Our data structure solves this problem within a few milliseconds even for misspelled and fragmentary queries. Compared to major geographic search engines such as Google or Bing we achieve results of significantly better quality

    Building Blocks for Mapping Services

    Get PDF
    Mapping services are ubiquitous on the Internet. These services enjoy a considerable user base. But it is often overlooked that providing a service on a global scale with virtually millions of users has been the playground of an oligopoly of a select few service providers are able to do so. Unfortunately, the literature on these solutions is more than scarce. This thesis adds a number of building blocks to the literature that explain how to design and implement a number of features
    corecore