52 research outputs found
Recovery from Non-Decomposable Distance Oracles
A line of work has looked at the problem of recovering an input from distance
queries. In this setting, there is an unknown sequence , and one chooses a set of queries and
receives for a distance function . The goal is to make as few
queries as possible to recover . Although this problem is well-studied for
decomposable distances, i.e., distances of the form for some function , which includes the important cases of
Hamming distance, -norms, and -estimators, to the best of our
knowledge this problem has not been studied for non-decomposable distances, for
which there are important special cases such as edit distance, dynamic time
warping (DTW), Frechet distance, earth mover's distance, and so on. We initiate
the study and develop a general framework for such distances. Interestingly,
for some distances such as DTW or Frechet, exact recovery of the sequence
is provably impossible, and so we show by allowing the characters in to be
drawn from a slightly larger alphabet this then becomes possible. In a number
of cases we obtain optimal or near-optimal query complexity. We also study the
role of adaptivity for a number of different distance functions. One motivation
for understanding non-adaptivity is that the query sequence can be fixed and
the distances of the input to the queries provide a non-linear embedding of the
input, which can be used in downstream applications involving, e.g., neural
networks for natural language processing.Comment: This work has been presented at conference The 14th Innovations in
Theoretical Computer Science (ITCS 2023) and accepted for publishing in the
journal IEEE Transactions on Information Theor
Efficient Nearest Neighbor Search on Metric Time Series
While Deep-Learning approaches beat Nearest-Neighbor classifiers in an increasing number of areas, searching existing uncertain data remains an exclusive task for similarity search.
Numerous specific solutions exist for different types of data and queries.
This thesis aims at finding fast and general solutions for searching and indexing arbitrarily typed time series.
A time series is considered a sequence of elements where the elements' order matters but not their actual time stamps.
Since this thesis focuses on measuring distances between time series, the metric space is the most appropriate concept where the time series' elements come from.
Hence, this thesis mainly considers metric time series as data type.
Simple examples include time series in Euclidean vector spaces or graphs.
For general similarity search solutions in time series, two primitive comparison semantics need to be distinguished, the first of which compares the time series' trajectories ignoring time warping.
A ubiquitous example of such a distance function is the Dynamic Time Warping distance (DTW) developed in the area of speech recognition.
The Dog Keeper distance (DK) is another time-warping distance that, opposed to DTW, is truly invariant under time warping and yields a metric space.
After canonically extending DTW to accept multi-dimensional time series, this thesis contributes a new algorithm computing DK that outperforms DTW on time series in high-dimensional vector spaces by more than one order of magnitude.
An analytical study of both distance functions reveals the reasons for the superiority of DK over DTW in high-dimensional spaces.
The second comparison semantic compares time series in Euclidean vector spaces regardless of their position or orientation.
This thesis proposes the Congruence distance that is the Euclidean distance minimized under all isometric transformations; thus, it is invariant under translation, rotation, and reflection of the time series and therefore disregards the position or orientation of the time series.
A proof contributed in this thesis shows that there can be no efficient algorithm computing this distance function (unless P=NP).
Therefore, this thesis contributes the Delta distance, a metric distance function serving as a lower bound for the Congruence distance.
While the Delta distance has quadratic time complexity, the provided evaluation shows a speedup of more than two orders of magnitude against the Congruence distance.
Furthermore, the Delta distance is shown to be tight on random time series, although the tightness can be arbitrarily bad in corner-case situations.
Orthogonally to the previous mentioned comparison semantics, similarity search on time series consists of two different types of queries: whole sequence matching and subsequence search.
Metric index structures (e. g., the M-Tree) only provide whole matching queries natively.
This thesis contributes the concept of metric subset spaces and the SuperM-Tree for indexing metric subset spaces as a generic solution for subsequence search.
Examples for metric subset spaces include subsequence search regarding the distance functions from the comparison semantics mentioned above.
The provided evaluation shows that the SuperM-Tree outperforms a linear search by multiple orders of magnitude
Deterministic and Probabilistic Binary Search in Graphs
We consider the following natural generalization of Binary Search: in a given
undirected, positively weighted graph, one vertex is a target. The algorithm's
task is to identify the target by adaptively querying vertices. In response to
querying a node , the algorithm learns either that is the target, or is
given an edge out of that lies on a shortest path from to the target.
We study this problem in a general noisy model in which each query
independently receives a correct answer with probability (a
known constant), and an (adversarial) incorrect one with probability .
Our main positive result is that when (i.e., all answers are
correct), queries are always sufficient. For general , we give an
(almost information-theoretically optimal) algorithm that uses, in expectation,
no more than queries, and identifies the target correctly with probability at
leas . Here, denotes the
entropy. The first bound is achieved by the algorithm that iteratively queries
a 1-median of the nodes not ruled out yet; the second bound by careful repeated
invocations of a multiplicative weights algorithm.
Even for , we show several hardness results for the problem of
determining whether a target can be found using queries. Our upper bound of
implies a quasipolynomial-time algorithm for undirected connected
graphs; we show that this is best-possible under the Strong Exponential Time
Hypothesis (SETH). Furthermore, for directed graphs, or for undirected graphs
with non-uniform node querying costs, the problem is PSPACE-complete. For a
semi-adaptive version, in which one may query nodes each in rounds, we
show membership in in the polynomial hierarchy, and hardness
for
Gaussian Processes for Text Regression
Text Regression is the task of modelling and predicting numerical indicators or response variables from textual data. It arises in a range of different problems, from sentiment and emotion analysis to text-based forecasting. Most models in the literature apply simple text representations such as bag-of-words and predict response variables in the form of point estimates. These simplifying assumptions ignore important information coming from the
data such as the underlying uncertainty present in the outputs and the linguistic structure in the textual inputs. The former is particularly important when the response variables come from human annotations while the latter can capture linguistic phenomena that go beyond simple lexical properties of a text.
In this thesis our aim is to advance the state-of-the-art in Text Regression by improving these two aspects, better uncertainty modelling in the response variables and improved text representations. Our main workhorse to achieve these goals is Gaussian Processes (GPs), a Bayesian kernelised probabilistic framework. GP-based regression models the response variables as well-calibrated probability distributions, providing additional information in
predictions which in turn can improve subsequent decision making. They also model the data using kernels, enabling richer representations based on similarity measures between texts.
To be able to reach our main goals we propose new kernels for text which aim at capturing richer linguistic information. These kernels are then parameterised and learned from the data using efficient model selection procedures that are enabled by the GP framework. Finally
we also capitalise on recent advances in the GP literature to better capture uncertainty in the response variables, such as multi-task learning and models that can incorporate non-Gaussian variables through the use of warping functions.
Our proposed architectures are benchmarked in two Text Regression applications: Emotion Analysis and Machine Translation Quality Estimation. Overall we are able to obtain better results compared to baselines while also providing uncertainty estimates for predictions in the form of posterior distributions. Furthermore we show how these models can be probed to obtain insights about the relation between the data and the response variables and also how
to apply predictive distributions in subsequent decision making procedures
Semi-Lazy Learning Approach to Dynamic Spatio-Temporal Data Analysis
Ph.DDOCTOR OF PHILOSOPH
Gaussian Processes for Text Regression
Text Regression is the task of modelling and predicting numerical indicators or response variables from textual data. It arises in a range of different problems, from sentiment and emotion analysis to text-based forecasting. Most models in the literature apply simple text representations such as bag-of-words and predict response variables in the form of point estimates. These simplifying assumptions ignore important information coming from the
data such as the underlying uncertainty present in the outputs and the linguistic structure in the textual inputs. The former is particularly important when the response variables come from human annotations while the latter can capture linguistic phenomena that go beyond simple lexical properties of a text.
In this thesis our aim is to advance the state-of-the-art in Text Regression by improving these two aspects, better uncertainty modelling in the response variables and improved text representations. Our main workhorse to achieve these goals is Gaussian Processes (GPs), a Bayesian kernelised probabilistic framework. GP-based regression models the response variables as well-calibrated probability distributions, providing additional information in
predictions which in turn can improve subsequent decision making. They also model the data using kernels, enabling richer representations based on similarity measures between texts.
To be able to reach our main goals we propose new kernels for text which aim at capturing richer linguistic information. These kernels are then parameterised and learned from the data using efficient model selection procedures that are enabled by the GP framework. Finally
we also capitalise on recent advances in the GP literature to better capture uncertainty in the response variables, such as multi-task learning and models that can incorporate non-Gaussian variables through the use of warping functions.
Our proposed architectures are benchmarked in two Text Regression applications: Emotion Analysis and Machine Translation Quality Estimation. Overall we are able to obtain better results compared to baselines while also providing uncertainty estimates for predictions in the form of posterior distributions. Furthermore we show how these models can be probed to obtain insights about the relation between the data and the response variables and also how
to apply predictive distributions in subsequent decision making procedures
Fine-grained complexity and algorithm engineering of geometric similarity measures
Point sets and sequences are fundamental geometric objects that arise in any application that considers movement data, geometric shapes, and many more. A crucial task on these objects is to measure their similarity. Therefore, this thesis presents results on algorithms, complexity lower bounds, and algorithm engineering of the most important point set and sequence similarity measures like the Fréchet distance, the Fréchet distance under translation, and the Hausdorff distance under translation. As an extension to the mere computation of similarity, also the approximate near neighbor problem for the continuous Fréchet distance on time series is considered and matching upper and lower bounds are shown.Punktmengen und Sequenzen sind fundamentale geometrische Objekte, welche in vielen Anwendungen auftauchen, insbesondere in solchen die Bewegungsdaten, geometrische Formen, und ähnliche Daten verarbeiten. Ein wichtiger Bestandteil dieser Anwendungen ist die Berechnung der Ähnlichkeit von Objekten. Diese Dissertation präsentiert Resultate, genauer gesagt Algorithmen, untere Komplexitätsschranken und Algorithm Engineering der wichtigsten Ähnlichkeitsmaße für Punktmengen und Sequenzen, wie zum Beispiel Fréchetdistanz, Fréchetdistanz unter Translation und Hausdorffdistanz unter Translation. Als eine Erweiterung der bloßen Berechnung von Ähnlichkeit betrachten wir auch das Near Neighbor Problem für die kontinuierliche Fréchetdistanz auf Zeitfolgen und zeigen obere und untere Schranken dafür
- …