2,798 research outputs found
KV-match: A Subsequence Matching Approach Supporting Normalization and Time Warping [Extended Version]
The volume of time series data has exploded due to the popularity of new
applications, such as data center management and IoT. Subsequence matching is a
fundamental task in mining time series data. All index-based approaches only
consider raw subsequence matching (RSM) and do not support subsequence
normalization. UCR Suite can deal with normalized subsequence match problem
(NSM), but it needs to scan full time series. In this paper, we propose a novel
problem, named constrained normalized subsequence matching problem (cNSM),
which adds some constraints to NSM problem. The cNSM problem provides a knob to
flexibly control the degree of offset shifting and amplitude scaling, which
enables users to build the index to process the query. We propose a new index
structure, KV-index, and the matching algorithm, KV-match. With a single index,
our approach can support both RSM and cNSM problems under either ED or DTW
distance. KV-index is a key-value structure, which can be easily implemented on
local files or HBase tables. To support the query of arbitrary lengths, we
extend KV-match to KV-match, which utilizes multiple varied-length
indexes to process the query. We conduct extensive experiments on synthetic and
real-world datasets. The results verify the effectiveness and efficiency of our
approach.Comment: 13 page
Distributed PCP Theorems for Hardness of Approximation in P
We present a new distributed model of probabilistically checkable proofs
(PCP). A satisfying assignment to a CNF formula is
shared between two parties, where Alice knows , Bob knows
, and both parties know . The goal is to have
Alice and Bob jointly write a PCP that satisfies , while
exchanging little or no information. Unfortunately, this model as-is does not
allow for nontrivial query complexity. Instead, we focus on a non-deterministic
variant, where the players are helped by Merlin, a third party who knows all of
.
Using our framework, we obtain, for the first time, PCP-like reductions from
the Strong Exponential Time Hypothesis (SETH) to approximation problems in P.
In particular, under SETH we show that there are no truly-subquadratic
approximation algorithms for Bichromatic Maximum Inner Product over
{0,1}-vectors, Bichromatic LCS Closest Pair over permutations, Approximate
Regular Expression Matching, and Diameter in Product Metric. All our
inapproximability factors are nearly-tight. In particular, for the first two
problems we obtain nearly-polynomial factors of ; only
-factor lower bounds (under SETH) were known before
A generalized matrix profile framework with support for contextual series analysis
The Matrix Profile is a state-of-the-art time series analysis technique that can be used for motif discovery, anomaly detection, segmentation and others, in various domains such as healthcare, robotics, and audio. Where recent techniques use the Matrix Profile as a preprocessing or modeling step, we believe there is unexplored potential in generalizing the approach. We derived a framework that focuses on the implicit distance matrix calculation. We present this framework as the Series Distance Matrix (SDM). In this framework, distance measures (SDM-generators) and distance processors (SDM-consumers) can be freely combined, allowing for more flexibility and easier experimentation. In SDM, the Matrix Profile is but one specific configuration. We also introduce the Contextual Matrix Profile (CMP) as a new SDM-consumer capable of discovering repeating patterns. The CMP provides intuitive visualizations for data analysis and can find anomalies that are not discords. We demonstrate this using two real world cases. The CMP is the first of a wide variety of new techniques for series analysis that fits within SDM and can complement the Matrix Profile
De Novo Assembly of Nucleotide Sequences in a Compressed Feature Space
Sequencing technologies allow for an in-depth analysis
of biological species but the size of the generated datasets
introduce a number of analytical challenges. Recently, we
demonstrated the application of numerical sequence representations
and data transformations for the alignment of short
reads to a reference genome. Here, we expand out approach
for de novo assembly of short reads. Our results demonstrate
that highly compressed data can encapsulate the signal suffi-
ciently to accurately assemble reads to big contigs or complete
genomes
Event-based personal retrieval
People who work in a research, academic or business environment often have personal information collections which are large enough to need retrieval aids. A major difference between personal information retrieval and normal document retrieval is that the items to be retrieved are often associated with events in the searcher's life and can be retrieved by their relationship to other events as well as by content. This paper describes some of the background to event-based retrieval and then describes a prototype graphical event-based retrieval system
Synchronization Strings: Codes for Insertions and Deletions Approaching the Singleton Bound
We introduce synchronization strings as a novel way of efficiently dealing
with synchronization errors, i.e., insertions and deletions. Synchronization
errors are strictly more general and much harder to deal with than commonly
considered half-errors, i.e., symbol corruptions and erasures. For every
, synchronization strings allow to index a sequence with an
size alphabet such that one can efficiently transform
synchronization errors into half-errors. This powerful new
technique has many applications. In this paper, we focus on designing insdel
codes, i.e., error correcting block codes (ECCs) for insertion deletion
channels.
While ECCs for both half-errors and synchronization errors have been
intensely studied, the later has largely resisted progress. Indeed, it took
until 1999 for the first insdel codes with constant rate, constant distance,
and constant alphabet size to be constructed by Schulman and Zuckerman. Insdel
codes for asymptotically large or small noise rates were given in 2016 by
Guruswami et al. but these codes are still polynomially far from the optimal
rate-distance tradeoff. This makes the understanding of insdel codes up to this
work equivalent to what was known for regular ECCs after Forney introduced
concatenated codes in his doctoral thesis 50 years ago.
A direct application of our synchronization strings based indexing method
gives a simple black-box construction which transforms any ECC into an equally
efficient insdel code with a slightly larger alphabet size. This instantly
transfers much of the highly developed understanding for regular ECCs over
large constant alphabets into the realm of insdel codes. Most notably, we
obtain efficient insdel codes which get arbitrarily close to the optimal
rate-distance tradeoff given by the Singleton bound for the complete noise
spectrum
- …