142 research outputs found
Which Regular Expression Patterns are Hard to Match?
Regular expressions constitute a fundamental notion in formal language theory
and are frequently used in computer science to define search patterns. A
classic algorithm for these problems constructs and simulates a
non-deterministic finite automaton corresponding to the expression, resulting
in an running time (where is the length of the pattern and is
the length of the text). This running time can be improved slightly (by a
polylogarithmic factor), but no significantly faster solutions are known. At
the same time, much faster algorithms exist for various special cases of
regular expressions, including dictionary matching, wildcard matching, subset
matching, word break problem etc.
In this paper, we show that the complexity of regular expression matching can
be characterized based on its {\em depth} (when interpreted as a formula). Our
results hold for expressions involving concatenation, OR, Kleene star and
Kleene plus. For regular expressions of depth two (involving any combination of
the above operators), we show the following dichotomy: matching and membership
testing can be solved in near-linear time, except for "concatenations of
stars", which cannot be solved in strongly sub-quadratic time assuming the
Strong Exponential Time Hypothesis (SETH). For regular expressions of depth
three the picture is more complex. Nevertheless, we show that all problems can
either be solved in strongly sub-quadratic time, or cannot be solved in
strongly sub-quadratic time assuming SETH.
An intriguing special case of membership testing involves regular expressions
of the form "a star of an OR of concatenations", e.g., . This
corresponds to the so-called {\em word break} problem, for which a dynamic
programming algorithm with a runtime of (roughly) is known. We
show that the latter bound is not tight and improve the runtime to
Making the Dynamic Time Warping Distance Warping-Invariant
The literature postulates that the dynamic time warping (dtw) distance can
cope with temporal variations but stores and processes time series in a form as
if the dtw-distance cannot cope with such variations. To address this
inconsistency, we first show that the dtw-distance is not warping-invariant.
The lack of warping-invariance contributes to the inconsistency mentioned above
and to a strange behavior. To eliminate these peculiarities, we convert the
dtw-distance to a warping-invariant semi-metric, called time-warp-invariant
(twi) distance. Empirical results suggest that the error rates of the twi and
dtw nearest-neighbor classifier are practically equivalent in a Bayesian sense.
However, the twi-distance requires less storage and computation time than the
dtw-distance for a broad range of problems. These results challenge the current
practice of applying the dtw-distance in nearest-neighbor classification and
suggest the proposed twi-distance as a more efficient and consistent option.Comment: arXiv admin note: substantial text overlap with arXiv:1808.0996
Dynamic and Internal Longest Common Substring
Given two strings S and T, each of length at most n, the longest common substring (LCS) problem is to find a longest substring common to S and T. This is a classical problem in computer science with an O(n) -time solution. In the fully dynamic setting, edit operations are allowed in either of the two strings, and the problem is to find an LCS after each edit. We present the first solution to the fully dynamic LCS problem requiring sublinear time in n per edit operation. In particular, we show how to find an LCS after each edit operation in O~ (n2 / 3) time, after O~ (n) -time and space preprocessing. This line of research has been recently initiated in a somewhat restricted dynamic variant by Amir et al. [SPIRE 2017]. More specifically, the authors presented an O~ (n) -sized data structure that returns an LCS of the two strings after a single edit operation (that is reverted afterwards) in O~ (1) time. At CPM 2018, three papers (Abedin et al., Funakoshi et al., and Urabe et al.) studied analogously restricted dynamic variants of problems on strings; specifically, computing the longest palindrome and the Lyndon factorization of a string after a single edit operation. We develop dynamic sublinear-time algorithms for both of these problems as well. We also consider internal LCS queries, that is, queries in which we are to return an LCS of a pair of substrings of S and T. We show that answering such queries is hard in general and propose efficient data structures for several restricted cases
They are Not Equally Reliable: Semantic Event Search Using Differentiated Concept Classifiers
© 2016 IEEE. Complex event detection on unconstrained Internet videos has seen much progress in recent years. However, state-of-the-art performance degrades dramatically when the number of positive training exemplars falls short. Since label acquisition is costly, laborious, and time-consuming, there is a real need to consider the much more challenging semantic event search problem, where no example video is given. In this paper, we present a state-of-the-art event search system without any example videos. Relying on the key observation that events (e.g. dog show) are usually compositions of multiple mid-level concepts (e.g. 'dog,' 'theater,' and 'dog jumping'), we first train a skip-gram model to measure the relevance of each concept with the event of interest. The relevant concept classifiers then cast votes on the test videos but their reliability, due to lack of labeled training videos, has been largely unaddressed. We propose to combine the concept classifiers based on a principled estimate of their accuracy on the unlabeled test videos. A novel warping technique is proposed to improve the performance and an efficient highly-scalable algorithm is provided to quickly solve the resulting optimization. We conduct extensive experiments on the latest TRECVID MEDTest 2014, MEDTest 2013 and CCV datasets, and achieve state-of-the-art performances
- …