Search CORE

1,024 research outputs found

Which Regular Expression Patterns are Hard to Match?

Author: Backurs Arturs
Indyk Piotr
Publication venue
Publication date: 26/09/2016
Field of study

Regular expressions constitute a fundamental notion in formal language theory and are frequently used in computer science to define search patterns. A classic algorithm for these problems constructs and simulates a non-deterministic finite automaton corresponding to the expression, resulting in an

O(mn)

running time (where

m

is the length of the pattern and

n

is the length of the text). This running time can be improved slightly (by a polylogarithmic factor), but no significantly faster solutions are known. At the same time, much faster algorithms exist for various special cases of regular expressions, including dictionary matching, wildcard matching, subset matching, word break problem etc. In this paper, we show that the complexity of regular expression matching can be characterized based on its {\em depth} (when interpreted as a formula). Our results hold for expressions involving concatenation, OR, Kleene star and Kleene plus. For regular expressions of depth two (involving any combination of the above operators), we show the following dichotomy: matching and membership testing can be solved in near-linear time, except for "concatenations of stars", which cannot be solved in strongly sub-quadratic time assuming the Strong Exponential Time Hypothesis (SETH). For regular expressions of depth three the picture is more complex. Nevertheless, we show that all problems can either be solved in strongly sub-quadratic time, or cannot be solved in strongly sub-quadratic time assuming SETH. An intriguing special case of membership testing involves regular expressions of the form "a star of an OR of concatenations", e.g.,

[a|ab|bc]^*

. This corresponds to the so-called {\em word break} problem, for which a dynamic programming algorithm with a runtime of (roughly)

O(n\sqrt{m})

is known. We show that the latter bound is not tight and improve the runtime to

O(nm^{0.44\ldots})

arXiv.org e-Print Archive

DSpace@MIT

Crossref

Nearly Optimal Deterministic Algorithm for Sparse Walsh-Hadamard Transform

Author: Cheraghchi Mahdi
Indyk Piotr
Publication venue
Publication date: 28/04/2015
Field of study

For every fixed constant

\alpha > 0

, we design an algorithm for computing the

k

-sparse Walsh-Hadamard transform of an

N

-dimensional vector

x \in \mathbb{R}^N

in time

k^{1+\alpha} (\log N)^{O(1)}

. Specifically, the algorithm is given query access to

x

and computes a

k

-sparse

\tilde{x} \in \mathbb{R}^N

satisfying

\|\tilde{x} - \hat{x}\|_1 \leq c \|\hat{x} - H_k(\hat{x})\|_1

, for an absolute constant

c > 0

, where

\hat{x}

is the transform of

x

and

H_k(\hat{x})

is its best

k

-sparse approximation. Our algorithm is fully deterministic and only uses non-adaptive queries to

x

(i.e., all queries are determined and performed in parallel when the algorithm starts). An important technical tool that we use is a construction of nearly optimal and linear lossless condensers which is a careful instantiation of the GUV condenser (Guruswami, Umans, Vadhan, JACM 2009). Moreover, we design a deterministic and non-adaptive

\ell_1/\ell_1

compressed sensing scheme based on general lossless condensers that is equipped with a fast reconstruction algorithm running in time

k^{1+\alpha} (\log N)^{O(1)}

(for the GUV-based condenser) and is of independent interest. Our scheme significantly simplifies and improves an earlier expander-based construction due to Berinde, Gilbert, Indyk, Karloff, Strauss (Allerton 2008). Our methods use linear lossless condensers in a black box fashion; therefore, any future improvement on explicit constructions of such condensers would immediately translate to improved parameters in our framework (potentially leading to

k (\log N)^{O(1)}

reconstruction time with a reduced exponent in the poly-logarithmic factor, and eliminating the extra parameter

\alpha

). Finally, by allowing the algorithm to use randomness, while still using non-adaptive queries, the running time of the algorithm can be improved to

\tilde{O}(k \log^3 N)

arXiv.org e-Print Archive

CiteSeerX

Crossref

DSpace@MIT

Spiral - Imperial College Digital Repository

Parallel Processing of Large Graphs

Author: Indyk Wojciech
Kajdanowicz Tomasz
Kazienko Przemyslaw
Publication venue
Publication date: 03/06/2013
Field of study

More and more large data collections are gathered worldwide in various IT systems. Many of them possess the networked nature and need to be processed and analysed as graph structures. Due to their size they require very often usage of parallel paradigm for efficient computation. Three parallel techniques have been compared in the paper: MapReduce, its map-side join extension and Bulk Synchronous Parallel (BSP). They are implemented for two different graph problems: calculation of single source shortest paths (SSSP) and collective classification of graph nodes by means of relational influence propagation (RIP). The methods and algorithms are applied to several network datasets differing in size and structural profile, originating from three domains: telecommunication, multimedia and microblog. The results revealed that iterative graph processing with the BSP implementation always and significantly, even up to 10 times outperforms MapReduce, especially for algorithms with many iterations and sparse communication. Also MapReduce extension based on map-side join usually noticeably presents better efficiency, although not as much as BSP. Nevertheless, MapReduce still remains the good alternative for enormous networks, whose data structures do not fit in local memories.Comment: Preprint submitted to Future Generation Computer System

arXiv.org e-Print Archive

CiteSeerX

Pattern Matching for sets of segments

Author: Efrat Alon
Indyk Piotr
Venkatasubramanian Suresh
Publication venue
Publication date: 01/01/2000
Field of study

In this paper we present algorithms for a number of problems in geometric pattern matching where the input consist of a collections of segments in the plane. Our work consists of two main parts. In the first, we address problems and measures that relate to collections of orthogonal line segments in the plane. Such collections arise naturally from problems in mapping buildings and robot exploration. We propose a new measure of segment similarity called a \emph{coverage measure}, and present efficient algorithms for maximising this measure between sets of axis-parallel segments under translations. Our algorithms run in time O(n^3\polylog n) in the general case, and run in time O(n^2\polylog n) for the case when all segments are horizontal. In addition, we show that when restricted to translations that are only vertical, the Hausdorff distance between two sets of horizontal segments can be computed in time roughly O(n^{3/2}{\sl polylog}n). These algorithms form significant improvements over the general algorithm of Chew et al. that takes time

O(n^4 \log^2 n)

. In the second part of this paper we address the problem of matching polygonal chains. We study the well known \Frd, and present the first algorithm for computing the \Frd under general translations. Our methods also yield algorithms for computing a generalization of the \Fr distance, and we also present a simple approximation algorithm for the \Frd that runs in time O(n^2\polylog n).Comment: To appear in the 12 ACM Symposium on Discrete Algorithms, Jan 200

arXiv.org e-Print Archive

CiteSeerX

On the Power of Adaptivity in Sparse Recovery

Author: Indyk Piotr
Price Eric
Woodruff David P.
Publication venue
Publication date: 01/01/2011
Field of study

The goal of (stable) sparse recovery is to recover a

k

-sparse approximation

x*

of a vector

x

from linear measurements of

x

. Specifically, the goal is to recover

x*

such that ||x-x*||_p <= C min_{k-sparse x'} ||x-x'||_q for some constant

C

and norm parameters

p

and

q

. It is known that, for

p=q=1

p=q=2

, this task can be accomplished using

m=O(k \log (n/k))

non-adaptive measurements [CRT06] and that this bound is tight [DIPW10,FPRU10,PW11]. In this paper we show that if one is allowed to perform measurements that are adaptive, then the number of measurements can be considerably reduced. Specifically, for

C=1+eps

and

p=q=2

we show - A scheme with

m=O((1/eps)k log log (n eps/k))

measurements that uses

O(log* k \log \log (n eps/k))

rounds. This is a significant improvement over the best possible non-adaptive bound. - A scheme with

m=O((1/eps) k log (k/eps) + k \log (n/k))

measurements that uses /two/ rounds. This improves over the best possible non-adaptive bound. To the best of our knowledge, these are the first results of this type. As an independent application, we show how to solve the problem of finding a duplicate in a data stream of

n

items drawn from

{1, 2, ..., n-1}

using

O(log n)

bits of space and

O(log log n)

passes, improving over the best possible space complexity achievable using a single pass.Comment: 18 pages; appearing at FOCS 201

arXiv.org e-Print Archive

CiteSeerX

DSpace@MIT

Crossref