249 research outputs found
Average-Case Optimal Approximate Circular String Matching
Approximate string matching is the problem of finding all factors of a text t
of length n that are at a distance at most k from a pattern x of length m.
Approximate circular string matching is the problem of finding all factors of t
that are at a distance at most k from x or from any of its rotations. In this
article, we present a new algorithm for approximate circular string matching
under the edit distance model with optimal average-case search time O(n(k + log
m)/m). Optimal average-case search time can also be achieved by the algorithms
for multiple approximate string matching (Fredriksson and Navarro, 2004) using
x and its rotations as the set of multiple patterns. Here we reduce the
preprocessing time and space requirements compared to that approach
Sequence queries on temporal graphs
Graphs that evolve over time are called temporal graphs. They can be used to describe and represent real-world networks, including transportation networks, social networks, and communication networks, with higher fidelity and accuracy. However, research is still limited on how to manage large scale temporal graphs and execute queries over these graphs efficiently and effectively. This thesis investigates the problems of temporal graph data management related to node and edge sequence queries. In temporal graphs, nodes and edges can evolve over time. Therefore, sequence queries on nodes and edges can be key components in managing temporal graphs. In this thesis, the node sequence query decomposes into two parts: graph node similarity and subsequence matching. For node similarity, this thesis proposes a modified tree edit distance that is metric and polynomially computable and has a natural, intuitive interpretation. Note that the proposed node similarity works even for inter-graph nodes and therefore can be used for graph de-anonymization, network transfer learning, and cross-network mining, among other tasks. The subsequence matching query proposed in this thesis is a framework that can be adopted to index generic sequence and time-series data, including trajectory data and even DNA sequences for subsequence retrieval. For edge sequence queries, this thesis proposes an efficient storage and optimized indexing technique that allows for efficient retrieval of temporal subgraphs that satisfy certain temporal predicates. For this problem, this thesis develops a lightweight data management engine prototype that can support time-sensitive temporal graph analytics efficiently even on a single PC
Edit Distance: Sketching, Streaming and Document Exchange
We show that in the document exchange problem, where Alice holds and Bob holds , Alice can send Bob a message of
size bits such that Bob can recover using the
message and his input if the edit distance between and is no more
than , and output "error" otherwise. Both the encoding and decoding can be
done in time . This result significantly
improves the previous communication bounds under polynomial encoding/decoding
time. We also show that in the referee model, where Alice and Bob hold and
respectively, they can compute sketches of and of sizes
bits (the encoding), and send to the referee, who can
then compute the edit distance between and together with all the edit
operations if the edit distance is no more than , and output "error"
otherwise (the decoding). To the best of our knowledge, this is the first
result for sketching edit distance using bits.
Moreover, the encoding phase of our sketching algorithm can be performed by
scanning the input string in one pass. Thus our sketching algorithm also
implies the first streaming algorithm for computing edit distance and all the
edits exactly using bits of space.Comment: Full version of an article to be presented at the 57th Annual IEEE
Symposium on Foundations of Computer Science (FOCS 2016
Randomized Sliding Window Algorithms for Regular Languages
A sliding window algorithm receives a stream of symbols and has to output at each time instant a certain value which only depends on the last n symbols. If the algorithm is randomized, then at each time instant it produces an incorrect output with probability at most epsilon, which is a constant error bound. This work proposes a more relaxed definition of correctness which is parameterized by the error bound epsilon and the failure ratio phi: a randomized sliding window algorithm is required to err with probability at most epsilon at a portion of 1-phi of all time instants of an input stream. This work continues the investigation of sliding window algorithms for regular languages. In previous works a trichotomy theorem was shown for deterministic algorithms: the optimal space complexity is either constant, logarithmic or linear in the window size. The main results of this paper concerns three natural settings (randomized algorithms with failure ratio zero and randomized/deterministic algorithms with bounded failure ratio) and provide natural language theoretic characterizations of the space complexity classes
Approximating Properties of Data Streams
In this dissertation, we present algorithms that approximate properties in the data stream model, where elements of an underlying data set arrive sequentially, but algorithms must use space sublinear in the size of the underlying data set. We first study the problem of finding all k-periods of a length-n string S, presented as a data stream. S is said to have k-period p if its prefix of length n − p differs from its suffix of length n − p in at most k locations. We give algorithms to compute the k-periods of a string S using poly(k, log n) bits of space and we complement these results with comparable lower bounds. We then study the problem of identifying a longest substring of strings S and T of length n that forms a d-near-alignment under the edit distance, in the simultaneous streaming model. In this model, symbols of strings S and T are streamed at the same time and form a d-near-alignment if the distance between them in some given metric is at most d. We give several algorithms, including an exact one-pass algorithm that uses O(d2 + d log n) bits of space. We then consider the distinct elements and `p-heavy hitters problems in the sliding window model, where only the most recent n elements in the data stream form the underlying set. We first introduce the composable histogram, a simple twist on the exponential (Datar et al., SODA 2002) and smooth histograms (Braverman and Ostrovsky, FOCS 2007) that may be of independent interest. We then show that the composable histogram along with a careful combination of existing techniques to track either the identity or frequency of a few specific items suffices to obtain algorithms for both distinct elements and `p-heavy hitters that is nearly optimal in both n and c. Finally, we consider the problem of estimating the maximum weighted matching of a graph whose edges are revealed in a streaming fashion. We develop a reduction from the maximum weighted matching problem to the maximum cardinality matching problem that only doubles the approximation factor of a streaming algorithm developed for the maximum cardinality matching problem. As an application, we obtain an estimator for the weight of a maximum weighted matching in bounded-arboricity graphs and in particular, a (48 + )-approximation estimator for the weight of a maximum weighted matching in planar graphs
Small space and streaming pattern matching with k edits
In this work, we revisit the fundamental and well-studied problem of
approximate pattern matching under edit distance. Given an integer , a
pattern of length , and a text of length , the task is to
find substrings of that are within edit distance from . Our main
result is a streaming algorithm that solves the problem in
space and amortised time per character of the text, providing
answers correct with high probability. (Hereafter, hides a
factor.) This answers a decade-old question: since the
discovery of a -space streaming algorithm for pattern
matching under Hamming distance by Porat and Porat [FOCS 2009], the existence
of an analogous result for edit distance remained open. Up to this work, no
-space algorithm was known even in the simpler
semi-streaming model, where comes as a stream but is available for
read-only access. In this model, we give a deterministic algorithm that
achieves slightly better complexity.
In order to develop the fully streaming algorithm, we introduce a new edit
distance sketch parametrised by integers . For any string of length at
most , the sketch is of size and it can be computed with an
-space streaming algorithm. Given the sketches of two strings,
in time we can compute their edit distance or certify that it
is larger than . This result improves upon -size sketches of
Belazzougui and Zhu [FOCS 2016] and very recent -size sketches
of Jin, Nelson, and Wu [STACS 2021]
An Investigation and Application of Biology and Bioinformatics for Activity Recognition
Activity recognition in a smart home context is inherently difficult due to the variable nature of human activities and tracking artifacts introduced by video-based tracking systems. This thesis addresses the activity recognition problem via introducing a biologically-inspired chemotactic approach and bioinformatics-inspired sequence alignment techniques to recognise spatial activities. The approaches are demonstrated in real world conditions to improve robustness and recognise activities in the presence of innate activity variability and tracking noise
Longest Common Subsequence with Gap Constraints
We consider the longest common subsequence problem in the context of
subsequences with gap constraints. In particular, following Day et al. 2022, we
consider the setting when the distance (i. e., the gap) between two consecutive
symbols of the subsequence has to be between a lower and an upper bound (which
may depend on the position of those symbols in the subsequence or on the
symbols bordering the gap) as well as the case where the entire subsequence is
found in a bounded range (defined by a single upper bound), considered by
Kosche et al. 2022. In all these cases, we present effcient algorithms for
determining the length of the longest common constrained subsequence between
two given strings
- …