1,920 research outputs found

    Approximating Approximate Pattern Matching

    Full text link
    Given a text TT of length nn and a pattern PP of length mm, the approximate pattern matching problem asks for computation of a particular \emph{distance} function between PP and every mm-substring of TT. We consider a (1±ε)(1\pm\varepsilon) multiplicative approximation variant of this problem, for p\ell_p distance function. In this paper, we describe two (1+ε)(1+\varepsilon)-approximate algorithms with a runtime of O~(nε)\widetilde{O}(\frac{n}{\varepsilon}) for all (constant) non-negative values of pp. For constant p1p \ge 1 we show a deterministic (1+ε)(1+\varepsilon)-approximation algorithm. Previously, such run time was known only for the case of 1\ell_1 distance, by Gawrychowski and Uzna\'nski [ICALP 2018] and only with a randomized algorithm. For constant 0p10 \le p \le 1 we show a randomized algorithm for the p\ell_p, thereby providing a smooth tradeoff between algorithms of Kopelowitz and Porat [FOCS~2015, SOSA~2018] for Hamming distance (case of p=0p=0) and of Gawrychowski and Uzna\'nski for 1\ell_1 distance

    Pattern Matching in Multiple Streams

    Full text link
    We investigate the problem of deterministic pattern matching in multiple streams. In this model, one symbol arrives at a time and is associated with one of s streaming texts. The task at each time step is to report if there is a new match between a fixed pattern of length m and a newly updated stream. As is usual in the streaming context, the goal is to use as little space as possible while still reporting matches quickly. We give almost matching upper and lower space bounds for three distinct pattern matching problems. For exact matching we show that the problem can be solved in constant time per arriving symbol and O(m+s) words of space. For the k-mismatch and k-difference problems we give O(k) time solutions that require O(m+ks) words of space. In all three cases we also give space lower bounds which show our methods are optimal up to a single logarithmic factor. Finally we set out a number of open problems related to this new model for pattern matching.Comment: 13 pages, 1 figur

    Generalised Pattern Matching Revisited

    Get PDF
    In the problem of Generalised Pattern Matching (GPM)\texttt{Generalised Pattern Matching}\ (\texttt{GPM}) [STOC'94, Muthukrishnan and Palem], we are given a text TT of length nn over an alphabet ΣT\Sigma_T, a pattern PP of length mm over an alphabet ΣP\Sigma_P, and a matching relationship ΣT×ΣP\subseteq \Sigma_T \times \Sigma_P, and must return all substrings of TT that match PP (reporting) or the number of mismatches between each substring of TT of length mm and PP (counting). In this work, we improve over all previously known algorithms for this problem for various parameters describing the input instance: * D\mathcal{D}\, being the maximum number of characters that match a fixed character, * S\mathcal{S}\, being the number of pairs of matching characters, * I\mathcal{I}\, being the total number of disjoint intervals of characters that match the mm characters of the pattern PP. At the heart of our new deterministic upper bounds for D\mathcal{D}\, and S\mathcal{S}\, lies a faster construction of superimposed codes, which solves an open problem posed in [FOCS'97, Indyk] and can be of independent interest. To conclude, we demonstrate first lower bounds for GPM\texttt{GPM}. We start by showing that any deterministic or Monte Carlo algorithm for GPM\texttt{GPM} must use Ω(S)\Omega(\mathcal{S}) time, and then proceed to show higher lower bounds for combinatorial algorithms. These bounds show that our algorithms are almost optimal, unless a radically new approach is developed

    Topics in combinatorial pattern matching

    Get PDF

    Approximating Approximate Pattern Matching

    Get PDF
    Given a text T of length n and a pattern P of length m, the approximate pattern matching problem asks for computation of a particular distance function between P and every m-substring of T. We consider a (1 +/- epsilon) multiplicative approximation variant of this problem, for l_p distance function. In this paper, we describe two (1+epsilon)-approximate algorithms with a runtime of O~(n/epsilon) for all (constant) non-negative values of p. For constant p >= 1 we show a deterministic (1+epsilon)-approximation algorithm. Previously, such run time was known only for the case of l_1 distance, by Gawrychowski and Uznanski [ICALP 2018] and only with a randomized algorithm. For constant 0 <= p <= 1 we show a randomized algorithm for the l_p, thereby providing a smooth tradeoff between algorithms of Kopelowitz and Porat [FOCS 2015, SOSA 2018] for Hamming distance (case of p=0) and of Gawrychowski and Uznanski for l_1 distance

    Linear pattern matching on sparse suffix trees

    Get PDF
    Packing several characters into one computer word is a simple and natural way to compress the representation of a string and to speed up its processing. Exploiting this idea, we propose an index for a packed string, based on a {\em sparse suffix tree} \cite{KU-96} with appropriately defined suffix links. Assuming, under the standard unit-cost RAM model, that a word can store up to logσn\log_{\sigma}n characters (σ\sigma the alphabet size), our index takes O(n/logσn)O(n/\log_{\sigma}n) space, i.e. the same space as the packed string itself. The resulting pattern matching algorithm runs in time O(m+r2+rocc)O(m+r^2+r\cdot occ), where mm is the length of the pattern, rr is the actual number of characters stored in a word and occocc is the number of pattern occurrences

    Data Structure Lower Bounds for Document Indexing Problems

    Get PDF
    We study data structure problems related to document indexing and pattern matching queries and our main contribution is to show that the pointer machine model of computation can be extremely useful in proving high and unconditional lower bounds that cannot be obtained in any other known model of computation with the current techniques. Often our lower bounds match the known space-query time trade-off curve and in fact for all the problems considered, there is a very good and reasonable match between the our lower bounds and the known upper bounds, at least for some choice of input parameters. The problems that we consider are set intersection queries (both the reporting variant and the semi-group counting variant), indexing a set of documents for two-pattern queries, or forbidden- pattern queries, or queries with wild-cards, and indexing an input set of gapped-patterns (or two-patterns) to find those matching a document given at the query time.Comment: Full version of the conference version that appeared at ICALP 2016, 25 page