933 research outputs found
Generic Subsequence Matching Framework: Modularity, Flexibility, Efficiency
Subsequence matching has appeared to be an ideal approach for solving many
problems related to the fields of data mining and similarity retrieval. It has
been shown that almost any data class (audio, image, biometrics, signals) is or
can be represented by some kind of time series or string of symbols, which can
be seen as an input for various subsequence matching approaches. The variety of
data types, specific tasks and their partial or full solutions is so wide that
the choice, implementation and parametrization of a suitable solution for a
given task might be complicated and time-consuming; a possibly fruitful
combination of fragments from different research areas may not be obvious nor
easy to realize. The leading authors of this field also mention the
implementation bias that makes difficult a proper comparison of competing
approaches. Therefore we present a new generic Subsequence Matching Framework
(SMF) that tries to overcome the aforementioned problems by a uniform frame
that simplifies and speeds up the design, development and evaluation of
subsequence matching related systems. We identify several relatively separate
subtasks solved differently over the literature and SMF enables to combine them
in straightforward manner achieving new quality and efficiency. This framework
can be used in many application domains and its components can be reused
effectively. Its strictly modular architecture and openness enables also
involvement of efficient solutions from different fields, for instance
efficient metric-based indexes. This is an extended version of a paper
published on DEXA 2012.Comment: This is an extended version of a paper published on DEXA 201
KV-match: A Subsequence Matching Approach Supporting Normalization and Time Warping [Extended Version]
The volume of time series data has exploded due to the popularity of new
applications, such as data center management and IoT. Subsequence matching is a
fundamental task in mining time series data. All index-based approaches only
consider raw subsequence matching (RSM) and do not support subsequence
normalization. UCR Suite can deal with normalized subsequence match problem
(NSM), but it needs to scan full time series. In this paper, we propose a novel
problem, named constrained normalized subsequence matching problem (cNSM),
which adds some constraints to NSM problem. The cNSM problem provides a knob to
flexibly control the degree of offset shifting and amplitude scaling, which
enables users to build the index to process the query. We propose a new index
structure, KV-index, and the matching algorithm, KV-match. With a single index,
our approach can support both RSM and cNSM problems under either ED or DTW
distance. KV-index is a key-value structure, which can be easily implemented on
local files or HBase tables. To support the query of arbitrary lengths, we
extend KV-match to KV-match, which utilizes multiple varied-length
indexes to process the query. We conduct extensive experiments on synthetic and
real-world datasets. The results verify the effectiveness and efficiency of our
approach.Comment: 13 page
A hybrid algorithm for the longest common transposition-invariant subsequence problem
The longest common transposition-invariant subsequence (LCTS) problem is a music information retrieval oriented variation of the classic LCS problem. There are basically only two known efficient approaches to calculate the length of the LCTS, one based on sparse dynamic programming and the other on bit-parallelism. In this work, we propose a hybrid algorithm picking the better of the two algorithms for individual subproblems. Experiments on music (MIDI), with 32-bit and 64-bit implementations, show that the proposed algorithm outperforms the faster of the two component algorithms by a factor of 1.4–2.0, depending on sequence lengths. Similar, if not better, improvements can be observed for random data with Gaussian distribution. Also for uniformly random data, the hybrid algorithm is the winner if the alphabet is neither too small (at least 32 symbols) nor too large (up to 128 symbols). Part of the success of our scheme is attributed to a quite robust component selection heuristic
Multivariate Fine-Grained Complexity of Longest Common Subsequence
We revisit the classic combinatorial pattern matching problem of finding a
longest common subsequence (LCS). For strings and of length , a
textbook algorithm solves LCS in time , but although much effort has
been spent, no -time algorithm is known. Recent work
indeed shows that such an algorithm would refute the Strong Exponential Time
Hypothesis (SETH) [Abboud, Backurs, Vassilevska Williams + Bringmann,
K\"unnemann FOCS'15].
Despite the quadratic-time barrier, for over 40 years an enduring scientific
interest continued to produce fast algorithms for LCS and its variations.
Particular attention was put into identifying and exploiting input parameters
that yield strongly subquadratic time algorithms for special cases of interest,
e.g., differential file comparison. This line of research was successfully
pursued until 1990, at which time significant improvements came to a halt. In
this paper, using the lens of fine-grained complexity, our goal is to (1)
justify the lack of further improvements and (2) determine whether some special
cases of LCS admit faster algorithms than currently known.
To this end, we provide a systematic study of the multivariate complexity of
LCS, taking into account all parameters previously discussed in the literature:
the input size , the length of the shorter string
, the length of an LCS of and , the numbers of
deletions and , the alphabet size, as well as
the numbers of matching pairs and dominant pairs . For any class of
instances defined by fixing each parameter individually to a polynomial in
terms of the input size, we prove a SETH-based lower bound matching one of
three known algorithms. Specifically, we determine the optimal running time for
LCS under SETH as .
[...]Comment: Presented at SODA'18. Full Version. 66 page
Distributed PCP Theorems for Hardness of Approximation in P
We present a new distributed model of probabilistically checkable proofs
(PCP). A satisfying assignment to a CNF formula is
shared between two parties, where Alice knows , Bob knows
, and both parties know . The goal is to have
Alice and Bob jointly write a PCP that satisfies , while
exchanging little or no information. Unfortunately, this model as-is does not
allow for nontrivial query complexity. Instead, we focus on a non-deterministic
variant, where the players are helped by Merlin, a third party who knows all of
.
Using our framework, we obtain, for the first time, PCP-like reductions from
the Strong Exponential Time Hypothesis (SETH) to approximation problems in P.
In particular, under SETH we show that there are no truly-subquadratic
approximation algorithms for Bichromatic Maximum Inner Product over
{0,1}-vectors, Bichromatic LCS Closest Pair over permutations, Approximate
Regular Expression Matching, and Diameter in Product Metric. All our
inapproximability factors are nearly-tight. In particular, for the first two
problems we obtain nearly-polynomial factors of ; only
-factor lower bounds (under SETH) were known before
Sketching, Streaming, and Fine-Grained Complexity of (Weighted) LCS
We study sketching and streaming algorithms for the Longest Common Subsequence problem (LCS) on strings of small alphabet size |Sigma|. For the problem of deciding whether the LCS of strings x,y has length at least L, we obtain a sketch size and streaming space usage of O(L^{|Sigma| - 1} log L). We also prove matching unconditional lower bounds.
As an application, we study a variant of LCS where each alphabet symbol is equipped with a weight that is given as input, and the task is to compute a common subsequence of maximum total weight. Using our sketching algorithm, we obtain an O(min{nm, n + m^{|Sigma|}})-time algorithm for this problem, on strings x,y of length n,m, with n >= m. We prove optimality of this running time up to lower order factors, assuming the Strong Exponential Time Hypothesis
- …