10,864 research outputs found

    Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order Terms

    Get PDF

    A practical index for approximate dictionary matching with few mismatches

    Get PDF
    Approximate dictionary matching is a classic string matching problem (checking if a query string occurs in a collection of strings) with applications in, e.g., spellchecking, online catalogs, geolocation, and web searchers. We present a surprisingly simple solution called a split index, which is based on the Dirichlet principle, for matching a keyword with few mismatches, and experimentally show that it offers competitive space-time tradeoffs. Our implementation in the C++ language is focused mostly on data compaction, which is beneficial for the search speed (e.g., by being cache friendly). We compare our solution with other algorithms and we show that it performs better for the Hamming distance. Query times in the order of 1 microsecond were reported for one mismatch for the dictionary size of a few megabytes on a medium-end PC. We also demonstrate that a basic compression technique consisting in qq-gram substitution can significantly reduce the index size (up to 50% of the input text size for the DNA), while still keeping the query time relatively low

    String Indexing for Patterns with Wildcards

    Get PDF
    We consider the problem of indexing a string tt of length nn to report the occurrences of a query pattern pp containing mm characters and jj wildcards. Let occocc be the number of occurrences of pp in tt, and Ļƒ\sigma the size of the alphabet. We obtain the following results. - A linear space index with query time O(m+Ļƒjlogā”logā”n+occ)O(m+\sigma^j \log \log n + occ). This significantly improves the previously best known linear space index by Lam et al. [ISAAC 2007], which requires query time Ī˜(jn)\Theta(jn) in the worst case. - An index with query time O(m+j+occ)O(m+j+occ) using space O(Ļƒk2nlogā”klogā”n)O(\sigma^{k^2} n \log^k \log n), where kk is the maximum number of wildcards allowed in the pattern. This is the first non-trivial bound with this query time. - A time-space trade-off, generalizing the index by Cole et al. [STOC 2004]. We also show that these indexes can be generalized to allow variable length gaps in the pattern. Our results are obtained using a novel combination of well-known and new techniques, which could be of independent interest

    Upper and lower bounds for dynamic data structures on strings

    Get PDF
    We consider a range of simply stated dynamic data structure problems on strings. An update changes one symbol in the input and a query asks us to compute some function of the pattern of length mm and a substring of a longer text. We give both conditional and unconditional lower bounds for variants of exact matching with wildcards, inner product, and Hamming distance computation via a sequence of reductions. As an example, we show that there does not exist an O(m1/2āˆ’Īµ)O(m^{1/2-\varepsilon}) time algorithm for a large range of these problems unless the online Boolean matrix-vector multiplication conjecture is false. We also provide nearly matching upper bounds for most of the problems we consider.Comment: Accepted at STACS'1

    siEDM: an efficient string index and search algorithm for edit distance with moves

    Full text link
    Although several self-indexes for highly repetitive text collections exist, developing an index and search algorithm with editing operations remains a challenge. Edit distance with moves (EDM) is a string-to-string distance measure that includes substring moves in addition to ordinal editing operations to turn one string into another. Although the problem of computing EDM is intractable, it has a wide range of potential applications, especially in approximate string retrieval. Despite the importance of computing EDM, there has been no efficient method for indexing and searching large text collections based on the EDM measure. We propose the first algorithm, named string index for edit distance with moves (siEDM), for indexing and searching strings with EDM. The siEDM algorithm builds an index structure by leveraging the idea behind the edit sensitive parsing (ESP), an efficient algorithm enabling approximately computing EDM with guarantees of upper and lower bounds for the exact EDM. siEDM efficiently prunes the space for searching query strings by the proposed method, which enables fast query searches with the same guarantee as ESP. We experimentally tested the ability of siEDM to index and search strings on benchmark datasets, and we showed siEDM's efficiency.Comment: 23 page
    • ā€¦
    corecore