Search CORE

10 research outputs found

The streaming $k$ -mismatch problem

Author: Clifford Raphaël
Kociumaka Tomasz
Porat Ely
Publication venue
Publication date: 09/04/2018
Field of study

We consider the streaming complexity of a fundamental task in approximate pattern matching: the

k

-mismatch problem. It asks to compute Hamming distances between a pattern of length

n

and all length-

n

substrings of a text for which the Hamming distance does not exceed a given threshold

k

. In our problem formulation, we report not only the Hamming distance but also, on demand, the full \emph{mismatch information}, that is the list of mismatched pairs of symbols and their indices. The twin challenges of streaming pattern matching derive from the need both to achieve small working space and also to guarantee that every arriving input symbol is processed quickly. We present a streaming algorithm for the

k

-mismatch problem which uses

O(k\log{n}\log\frac{n}{k})

bits of space and spends \ourcomplexity time on each symbol of the input stream, which consists of the pattern followed by the text. The running time almost matches the classic offline solution and the space usage is within a logarithmic factor of optimal. Our new algorithm therefore effectively resolves and also extends an open problem first posed in FOCS'09. En route to this solution, we also give a deterministic

O( k (\log \frac{n}{k} + \log |\Sigma|) )

-bit encoding of all the alignments with Hamming distance at most

k

of a length-

n

pattern within a text of length

O(n)

. This secondary result provides an optimal solution to a natural communication complexity problem which may be of independent interest.Comment: 27 page

arXiv.org e-Print Archive

Crossref

Explore Bristol Research

For-all Sparse Recovery in Near-optimal Time

Author: Gilbert A.
Li Y.
Porat E.
Strauss M.
Publication venue
Publication date: 07/02/2014
Field of study

An approximate sparse recovery system in

\ell_1

norm consists of parameters

k

\epsilon

N

, an

m

-by-

N

measurement

\Phi

, and a recovery algorithm,

\mathcal{R}

. Given a vector,

\mathbf{x}

, the system approximates

x

\widehat{\mathbf{x}} = \mathcal{R}(\Phi\mathbf{x})

, which must satisfy

\|\widehat{\mathbf{x}}-\mathbf{x}\|_1 \leq (1+\epsilon)\|\mathbf{x}-\mathbf{x}_k\|_1

. We consider the 'for all' model, in which a single matrix

\Phi

, possibly 'constructed' non-explicitly using the probabilistic method, is used for all signals

\mathbf{x}

. The best existing sublinear algorithm by Porat and Strauss (SODA'12) uses

O(\epsilon^{-3} k\log(N/k))

measurements and runs in time

O(k^{1-\alpha}N^\alpha)

for any constant

\alpha > 0

. In this paper, we improve the number of measurements to

O(\epsilon^{-2} k \log(N/k))

, matching the best existing upper bound (attained by super-linear algorithms), and the runtime to

O(k^{1+\beta}\textrm{poly}(\log N,1/\epsilon))

, with a modest restriction that

\epsilon \leq (\log k/\log N)^{\gamma}

, for any constants

\beta,\gamma > 0

. When

k\leq \log^c N

for some

c>0

, the runtime is reduced to

O(k\textrm{poly}(N,1/\epsilon))

. With no restrictions on

\epsilon

, we have an approximation recovery system with

m = O(k/\epsilon \log(N/k)((\log N/\log k)^\gamma + 1/\epsilon))

measurements

MPG.PuRe

Approximating Properties of Data Streams

Author: Zhou Samson
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2018
Field of study

In this dissertation, we present algorithms that approximate properties in the data stream model, where elements of an underlying data set arrive sequentially, but algorithms must use space sublinear in the size of the underlying data set. We first study the problem of finding all k-periods of a length-n string S, presented as a data stream. S is said to have k-period p if its prefix of length n − p differs from its suffix of length n − p in at most k locations. We give algorithms to compute the k-periods of a string S using poly(k, log n) bits of space and we complement these results with comparable lower bounds. We then study the problem of identifying a longest substring of strings S and T of length n that forms a d-near-alignment under the edit distance, in the simultaneous streaming model. In this model, symbols of strings S and T are streamed at the same time and form a d-near-alignment if the distance between them in some given metric is at most d. We give several algorithms, including an exact one-pass algorithm that uses O(d2 + d log n) bits of space. We then consider the distinct elements and `p-heavy hitters problems in the sliding window model, where only the most recent n elements in the data stream form the underlying set. We first introduce the composable histogram, a simple twist on the exponential (Datar et al., SODA 2002) and smooth histograms (Braverman and Ostrovsky, FOCS 2007) that may be of independent interest. We then show that the composable histogram along with a careful combination of existing techniques to track either the identity or frequency of a few specific items suffices to obtain algorithms for both distinct elements and `p-heavy hitters that is nearly optimal in both n and c. Finally, we consider the problem of estimating the maximum weighted matching of a graph whose edges are revealed in a streaming fashion. We develop a reduction from the maximum weighted matching problem to the maximum cardinality matching problem that only doubles the approximation factor of a streaming algorithm developed for the maximum cardinality matching problem. As an application, we obtain an estimator for the weight of a maximum weighted matching in bounded-arboricity graphs and in particular, a (48 + )-approximation estimator for the weight of a maximum weighted matching in planar graphs

Purdue E-Pubs

From coding theory to efficient pattern matching

Author: Amir Rothschild
Clifford R
Ely Porat
Klim Efremenko
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2009
Field of study

Explore Bristol Research

From coding theory to efficient pattern matching

Author: Amir Rothschild
Ely Porat
Klim Efremenko
Raphaël Clifford
Publication venue
Publication date: 19/05/2012
Field of study

We consider the classic problem of pattern matching with few mismatches in the presence of promiscuously matching wildcard symbols. Given a text t of length n and a pattern p of length m with optional wildcard symbols and a bound k, our algorithm finds all the alignments for which the pattern matches the text with Hamming distance at most k and also returns the location and identity of each mismatch. The algorithm we present is deterministic and runs in Õ(kn) time, matching the best known randomised time complexity to within logarithmic factors. The solutions we develop borrow from the tool set of algebraic coding theory and provide a new framework in which to tackle approximate pattern matching problems.

CiteSeerX

Crossref