18 research outputs found

    The k-mismatch problem revisited

    Get PDF
    We revisit the complexity of one of the most basic problems in pattern matching. In the k-mismatch problem we must compute the Hamming distance between a pattern of length m and every m-length substring of a text of length n, as long as that Hamming distance is at most k. Where the Hamming distance is greater than k at some alignment of the pattern and text, we simply output "No". We study this problem in both the standard offline setting and also as a streaming problem. In the streaming k-mismatch problem the text arrives one symbol at a time and we must give an output before processing any future symbols. Our main results are as follows: 1) Our first result is a deterministic O(nk2log⁥k/m+npolylogm)O(n k^2\log{k} / m+n \text{polylog} m) time offline algorithm for k-mismatch on a text of length n. This is a factor of k improvement over the fastest previous result of this form from SODA 2000 by Amihood Amir et al. 2) We then give a randomised and online algorithm which runs in the same time complexity but requires only O(k2polylogm)O(k^2\text{polylog} {m}) space in total. 3) Next we give a randomised (1+Ï”)(1+\epsilon)-approximation algorithm for the streaming k-mismatch problem which uses O(k2polylogm/Ï”2)O(k^2\text{polylog} m / \epsilon^2) space and runs in O(polylogm/Ï”2)O(\text{polylog} m / \epsilon^2) worst-case time per arriving symbol. 4) Finally we combine our new results to derive a randomised O(k2polylogm)O(k^2\text{polylog} {m}) space algorithm for the streaming k-mismatch problem which runs in O(klog⁥k+polylogm)O(\sqrt{k}\log{k} + \text{polylog} {m}) worst-case time per arriving symbol. This improves the best previous space complexity for streaming k-mismatch from FOCS 2009 by Benny Porat and Ely Porat by a factor of k. We also improve the time complexity of this previous result by an even greater factor to match the fastest known offline algorithm (up to logarithmic factors)

    Which Regular Expression Patterns are Hard to Match?

    Full text link
    Regular expressions constitute a fundamental notion in formal language theory and are frequently used in computer science to define search patterns. A classic algorithm for these problems constructs and simulates a non-deterministic finite automaton corresponding to the expression, resulting in an O(mn)O(mn) running time (where mm is the length of the pattern and nn is the length of the text). This running time can be improved slightly (by a polylogarithmic factor), but no significantly faster solutions are known. At the same time, much faster algorithms exist for various special cases of regular expressions, including dictionary matching, wildcard matching, subset matching, word break problem etc. In this paper, we show that the complexity of regular expression matching can be characterized based on its {\em depth} (when interpreted as a formula). Our results hold for expressions involving concatenation, OR, Kleene star and Kleene plus. For regular expressions of depth two (involving any combination of the above operators), we show the following dichotomy: matching and membership testing can be solved in near-linear time, except for "concatenations of stars", which cannot be solved in strongly sub-quadratic time assuming the Strong Exponential Time Hypothesis (SETH). For regular expressions of depth three the picture is more complex. Nevertheless, we show that all problems can either be solved in strongly sub-quadratic time, or cannot be solved in strongly sub-quadratic time assuming SETH. An intriguing special case of membership testing involves regular expressions of the form "a star of an OR of concatenations", e.g., [a∣ab∣bc]∗[a|ab|bc]^*. This corresponds to the so-called {\em word break} problem, for which a dynamic programming algorithm with a runtime of (roughly) O(nm)O(n\sqrt{m}) is known. We show that the latter bound is not tight and improve the runtime to O(nm0.44
)O(nm^{0.44\ldots})

    Secure Search via Multi-Ring Fully Homomorphic Encryption

    Get PDF
    Secure search is the problem of securely retrieving from a database table (or any unsorted array) the records matching specified attributes, as in SQL ``SELECT...WHERE...\u27\u27 queries, but where the database and the query are encrypted. Secure search has been the leading example for practical applications of Fully Homomorphic Encryption (FHE) since Gentry\u27s seminal work in 2009, attaining the desired properties of a single-round low-communication protocol with semantic security for database and query (even during search). Nevertheless, the wide belief was that the high computational overhead of current FHE candidates is too prohibitive in practice for secure search solutions (except for the restricted case of searching for a uniquely identified record as in SQL UNIQUE constrain and Private Information Retrieval). This is due to the high degree in existing solutions that is proportional at least to the number of database records m. We present the first algorithm for secure search that is realized by a polynomial of logarithmic degree (log m)^c for a small constant c>0. We implemented our algorithm in an open source library based on HElib, and ran experiments on Amazon\u27s EC2 cloud with up to 100 processors. Our experiments show that we can securely search to retrieve database records in a rate of searching in millions of database records in less than an hour on a single machine. We achieve our result by: (1) Designing a novel sketch that returns the first strictly-positive entry in a (not necessarily sparse) array of non-negative real numbers; this sketch may be of independent interest. (2) Suggesting a multi-ring evaluation of FHE -- instead of a single ring as in prior works -- and leveraging this to achieve an exponential reduction in the degree

    A survey on tree matching and XML retrieval

    Get PDF
    International audienceWith the increasing number of available XML documents, numerous approaches for retrieval have been proposed in the literature. They usually use the tree representation of documents and queries to process them, whether in an implicit or explicit way. Although retrieving XML documents can be considered as a tree matching problem between the query tree and the document trees, only a few approaches take advantage of the algorithms and methods proposed by the graph theory. In this paper, we aim at studying the theoretical approaches proposed in the literature for tree matching and at seeing how these approaches have been adapted to XML querying and retrieval, from both an exact and an approximate matching perspective. This study will allow us to highlight theoretical aspects of graph theory that have not been yet explored in XML retrieval

    Locality of Distributed Graph Problems

    Full text link
    Locality is one of the central themes in distributed computing. Suppose in a network each node only has direct communication with its local neighbors, how efficiently can a global task be solved? We aim to investigate the locality of fundamental distributed graph problems. Toward this goal, we consider the following three basic abstract models of distributed computing. ‱ LOCAL: each device has direct communication links with its neighbors, there is no message size constraint. ‱ CONGEST: each device has direct communication links with its neighbors, the size of each message is at most O(log n) bits. ‱ CONGESTED-CLIQUE: each device has direct communication links with all other devices, the size of each message is at most O(log n) bits. A brief summary of our results is as follows. 1. Complexity Theory for the LOCAL Model: We study the spectrum of natural problem complexities that can exist in the LOCAL model. We provide answers to the following fundamental questions regarding the nature of the LOCAL model: (i) How to classify the distributed problems according to their complexities? (ii) How much does randomness help? (iii) Can we solve more problems given more time? 2. Complexity of Distributed Coloring: The coloring problem is a classical and well-studied problem in distributed computing. We devise distributed algorithms for the edge-coloring problem and the vertex-coloring problem in the LOCAL model that improve upon the previous state of the art. 3. Bandwidth Constraint: We develop a new framework for algorithm design based on expander decompositions that allows us to apply CONGESTED-CLIQUE techniques to the CONGEST model. Using this approach, we provide improved algorithms for the triangle detection and enumeration problem in CONGEST.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/149872/1/cyijun_1.pd
    corecore