3,287 research outputs found
BTW: a web server for Boltzmann time warping of gene expression time series
Dynamic time warping (DTW) is a well-known quadratic time algorithm to determine the smallest distance and optimal alignment between two numerical sequences, possibly of different length. Originally developed for speech recognition, this method has been used in data mining, medicine and bioinformatics. For gene expression time series data, time warping distance is arguably a more flexible tool to determine genes having similar temporal expression, hence possibly related biological function, than either Euclidean distance or correlation coefficient—especially since time warping accommodates sequences of different length. The BTW web server allows a user to upload two tab-separated text files A,B of gene expression data, each possibly having a different number of time intervals of different durations. BTW then computes time warping distance between each gene of A with each gene of B, using a recently developed symmetric algorithm which additionally computes the Boltzmann partition function and outputs Boltzmann pair probabilities. The Boltzmann pair probabilities, not available with any other existent software, suggest possible biological significance of certain positions in an optimal time warping alignment. Availability:
Towards a Holistic Integration of Spreadsheets with Databases: A Scalable Storage Engine for Presentational Data Management
Spreadsheet software is the tool of choice for interactive ad-hoc data
management, with adoption by billions of users. However, spreadsheets are not
scalable, unlike database systems. On the other hand, database systems, while
highly scalable, do not support interactivity as a first-class primitive. We
are developing DataSpread, to holistically integrate spreadsheets as a
front-end interface with databases as a back-end datastore, providing
scalability to spreadsheets, and interactivity to databases, an integration we
term presentational data management (PDM). In this paper, we make a first step
towards this vision: developing a storage engine for PDM, studying how to
flexibly represent spreadsheet data within a database and how to support and
maintain access by position. We first conduct an extensive survey of
spreadsheet use to motivate our functional requirements for a storage engine
for PDM. We develop a natural set of mechanisms for flexibly representing
spreadsheet data and demonstrate that identifying the optimal representation is
NP-Hard; however, we develop an efficient approach to identify the optimal
representation from an important and intuitive subclass of representations. We
extend our mechanisms with positional access mechanisms that don't suffer from
cascading update issues, leading to constant time access and modification
performance. We evaluate these representations on a workload of typical
spreadsheets and spreadsheet operations, providing up to 20% reduction in
storage, and up to 50% reduction in formula evaluation time
Route Planning in Transportation Networks
We survey recent advances in algorithms for route planning in transportation
networks. For road networks, we show that one can compute driving directions in
milliseconds or less even at continental scale. A variety of techniques provide
different trade-offs between preprocessing effort, space requirements, and
query time. Some algorithms can answer queries in a fraction of a microsecond,
while others can deal efficiently with real-time traffic. Journey planning on
public transportation systems, although conceptually similar, is a
significantly harder problem due to its inherent time-dependent and
multicriteria nature. Although exact algorithms are fast enough for interactive
queries on metropolitan transit systems, dealing with continent-sized instances
requires simplifications or heavy preprocessing. The multimodal route planning
problem, which seeks journeys combining schedule-based transportation (buses,
trains) with unrestricted modes (walking, driving), is even harder, relying on
approximate solutions even for metropolitan inputs.Comment: This is an updated version of the technical report MSR-TR-2014-4,
previously published by Microsoft Research. This work was mostly done while
the authors Daniel Delling, Andrew Goldberg, and Renato F. Werneck were at
Microsoft Research Silicon Valle
On Mini-Batch Training with Varying Length Time Series
In real-world time series recognition applications, it is possible to have
data with varying length patterns. However, when using artificial neural
networks (ANN), it is standard practice to use fixed-sized mini-batches. To do
this, time series data with varying lengths are typically normalized so that
all the patterns are the same length. Normally, this is done using zero padding
or truncation without much consideration. We propose a novel method of
normalizing the lengths of the time series in a dataset by exploiting the
dynamic matching ability of Dynamic Time Warping (DTW). In this way, the time
series lengths in a dataset can be set to a fixed size while maintaining
features typical to the dataset. In the experiments, all 11 datasets with
varying length time series from the 2018 UCR Time Series Archive are used. We
evaluate the proposed method by comparing it with 18 other length normalization
methods on a Convolutional Neural Network (CNN), a Long-Short Term Memory
network (LSTM), and a Bidirectional LSTM (BLSTM).Comment: Accepted to ICASSP 202
Techniques to explore time-related correlation in large datasets
The next generation of database management and computing systems will be significantly complex with data distributed both in functionality and operation. The complexity arises, at least in part, due to data types involved and types of information request rendered by the database user. Time sequence databases are generated in many practical applications. Detecting similar sequences and subsequences within these databases is an important research area and has generated lot of interest recently. Previous studies in this area have concentrated on calculating similitude between (sub)sequences of equal sizes. The question of unequal sized (sub)sequence comparison to report similitude has been an open problem for some time. The problem is an important and non-trivial one. In this dissertation, we propose a solution to the problem of finding sequences, in a database of unequal sized sequences, that are similar to a given query sequence. A paradigm to search pairs of similar, equal and unequal sized, subsequences within a pair of sequences is also presented. We put forward new approaches for sequence time-scale reduction, feature aggregation and object recognition. To make the search of similar sequences efficient, we propose an indexing technique to index the unequal-sized sequence database. We also introduce a unique indexing technique to index identified subsequences within a reference sequence. This index is subsequently employed to report similar pairs of subsequences, when presented with a query sequence. We present several experimental results and also compare the proposed framework with previous work in this area
Modelling the Galaxy in the era of Gaia
The body of photometric and astrometric data on stars in the Galaxy has been
growing very fast in recent years (Hipparcos/Tycho, OGLE-3, 2-Mass, DENIS,
UCAC2, SDSS, RAVE, Pan Starrs, Hermes, ...) and in two years ESA will launch
the Gaia satellite, which will measure astrometric data of unprecedented
precision for a billion stars. On account of our position within the Galaxy and
the complex observational biases that are built into most catalogues, dynamical
models of the Galaxy are a prerequisite full exploitation of these catalogues.
On account of the enormous detail in which we can observe the Galaxy, models of
great sophistication are required. Moreover, in addition to models we require
algorithms for observing them with the same errors and biases as occur in real
observational programs, and statistical algorithms for determining the extent
to which a model is compatible with a given body of data.
JD5 reviewed the status of our knowledge of the Galaxy, the different ways in
which we could model the Galaxy, and what will be required to extract our
science goals from the data that will be on hand when the Gaia Catalogue
becomes available.Comment: Proceedings of Joint Discussion 5 at IAU XXVII, Rio de Janeiro,
August 2009; 31 page
TILDE-Q: A Transformation Invariant Loss Function for Time-Series Forecasting
Time-series forecasting has caught increasing attention in the AI research
field due to its importance in solving real-world problems across different
domains, such as energy, weather, traffic, and economy. As shown in various
types of data, it has been a must-see issue to deal with drastic changes,
temporal patterns, and shapes in sequential data that previous models are weak
in prediction. This is because most cases in time-series forecasting aim to
minimize norm distances as loss functions, such as mean absolute error
(MAE) or mean square error (MSE). These loss functions are vulnerable to not
only considering temporal dynamics modeling but also capturing the shape of
signals. In addition, these functions often make models misbehave and return
uncorrelated results to the original time-series. To become an effective loss
function, it has to be invariant to the set of distortions between two
time-series data instead of just comparing exact values. In this paper, we
propose a novel loss function, called TILDE-Q (Transformation Invariant Loss
function with Distance EQuilibrium), that not only considers the distortions in
amplitude and phase but also allows models to capture the shape of time-series
sequences. In addition, TILDE-Q supports modeling periodic and non-periodic
temporal dynamics at the same time. We evaluate the effectiveness of TILDE-Q by
conducting extensive experiments with respect to periodic and non-periodic
conditions of data, from naive models to state-of-the-art models. The
experiment results indicate that the models trained with TILDE-Q outperform
those trained with other training metrics (e.g., MSE, dynamic time warping
(DTW), temporal distortion index (TDI), and longest common subsequence (LCSS)).Comment: 9 pages paper, 2 pages references, and 7 pages appendix. Submitted as
conference paper to ICLR 202
Design and Implementation of a Middleware for Uniform, Federated and Dynamic Event Processing
In recent years, real-time processing of massive event streams has become an important topic in the area of data analytics. It will become even more important in the future due to cheap sensors, a growing amount of devices and their ubiquitous inter-connection also known as the Internet of Things (IoT). Academia, industry and the open source community have developed several event processing (EP) systems that allow users to define, manage and execute continuous queries over event streams. They achieve a significantly better performance than the traditional store-then-process'' approach in which events are first stored and indexed in a database. Because EP systems have different roots and because of the lack of standardization, the system landscape became highly heterogenous. Today's EP systems differ in APIs, execution behaviors and query languages. This thesis presents the design and implementation of a novel middleware that abstracts from different EP systems and provides a uniform API, execution behavior and query language to users and developers. As a consequence, the presented middleware overcomes the problem of vendor lock-in and different EP systems are enabled to cooperate with each other. In practice, event streams differ dramatically in volume and velocity. We show therefore how the middleware can connect to not only different EP systems, but also database systems and a native implementation. Emerging applications such as the IoT raise novel challenges and require EP to be more dynamic. We present extensions to the middleware that enable self-adaptivity which is needed in context-sensitive applications and those that deal with constantly varying sets of event producers and consumers. Lastly, we extend the middleware to fully support the processing of events containing spatial data and to be able to run distributed in the form of a federation of heterogenous EP systems
Correlation Clustering
Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. The core step of the KDD process is the application of a Data Mining algorithm in order to produce a particular enumeration of patterns and relationships in large databases. Clustering is one of the major data mining techniques and aims at grouping the data objects into meaningful classes (clusters) such that the similarity of objects within clusters is maximized, and the similarity of objects from different clusters is minimized. This can serve to group customers with similar interests, or to group genes with related functionalities.
Currently, a challenge for clustering-techniques are especially high dimensional feature-spaces. Due to modern facilities of data collection, real data sets usually contain many features. These features are often noisy or exhibit correlations among each other. However, since these effects in different parts of the data set are differently relevant, irrelevant features cannot be discarded in advance. The selection of relevant features must therefore be integrated into the data mining technique.
Since about 10 years, specialized clustering approaches have been developed to cope with problems in high dimensional data better than classic clustering approaches. Often, however, the different problems of very different nature are not distinguished from one another. A main objective of this thesis is therefore a systematic classification of the diverse approaches developed in recent years according to their task definition, their basic strategy, and their algorithmic approach. We discern as main categories the search for clusters (i) w.r.t. closeness of objects in axis-parallel subspaces, (ii) w.r.t. common behavior (patterns) of objects in axis-parallel subspaces, and (iii) w.r.t. closeness of objects in arbitrarily oriented subspaces (so called correlation cluster).
For the third category, the remaining parts of the thesis describe novel approaches. A first approach is the adaptation of density-based clustering to the problem of correlation clustering. The starting point here is the first density-based approach in this field, the algorithm 4C. Subsequently, enhancements and variations of this approach are discussed allowing for a more robust, more efficient, or more effective behavior or even find hierarchies of correlation clusters and the corresponding subspaces. The density-based approach to correlation clustering, however, is fundamentally unable to solve some issues since an analysis of local neighborhoods is required. This is a problem in high dimensional data. Therefore, a novel method is proposed tackling the correlation clustering problem in a global approach. Finally, a method is proposed to derive models for correlation clusters to allow for an interpretation of the clusters and facilitate more thorough analysis in the corresponding domain science. Finally, possible applications of these models are proposed and discussed.Knowledge Discovery in Databases (KDD) ist der Prozess der automatischen Extraktion von Wissen aus großen Datenmengen, das gültig, bisher unbekannt und potentiell nützlich für eine gegebene Anwendung ist. Der zentrale Schritt des KDD-Prozesses ist das Anwenden von Data Mining-Techniken,
um nützliche Beziehungen und Zusammenhänge in einer aufbereiteten Datenmenge aufzudecken. Eine der wichtigsten Techniken des Data Mining ist die Cluster-Analyse (Clustering). Dabei sollen die Objekte einer Datenbank in Gruppen (Cluster) partitioniert werden, so dass Objekte eines Clusters möglichst ähnlich und Objekte verschiedener Cluster möglichst unähnlich zu einander sind. Hier können beispielsweise Gruppen von Kunden identifiziert werden, die ähnliche Interessen haben, oder Gruppen von Genen, die ähnliche Funktionalitäten besitzen.
Eine aktuelle Herausforderung für Clustering-Verfahren stellen hochdimensionale Feature-Räume dar. Reale Datensätze beinhalten dank moderner Verfahren zur Datenerhebung häufig sehr viele Merkmale (Features). Teile dieser Merkmale unterliegen oft Rauschen oder Abhängigkeiten und können meist nicht im Vorfeld ausgesiebt werden, da diese Effekte in Teilen der Datenbank jeweils unterschiedlich ausgeprägt sind. Daher muss die Wahl der Features mit dem Data-Mining-Verfahren verknüpft werden.
Seit etwa 10 Jahren werden vermehrt spezialisierte Clustering-Verfahren entwickelt, die mit den in hochdimensionalen Feature-Räumen auftretenden Problemen besser umgehen können als klassische Clustering-Verfahren. Hierbei wird aber oftmals nicht zwischen den ihrer Natur nach im Einzelnen sehr unterschiedlichen Problemen unterschieden. Ein Hauptanliegen der Dissertation ist daher eine systematische Einordnung der in den letzten Jahren entwickelten sehr diversen Ansätze nach den Gesichtspunkten ihrer jeweiligen Problemauffassung, ihrer grundlegenden Lösungsstrategie und ihrer algorithmischen Vorgehensweise. Als Hauptkategorien unterscheiden wir hierbei die Suche nach Clustern (1.) hinsichtlich der Nähe von Cluster-Objekten in
achsenparallelen Unterräumen, (2.) hinsichtlich gemeinsamer Verhaltensweisen (Mustern) von Cluster-Objekten in achsenparallelen Unterräumen und (3.) hinsichtlich der Nähe von Cluster-Objekten in beliebig orientierten Unterräumen (sogenannte Korrelations-Cluster).
Für die dritte Kategorie sollen in den weiteren Teilen der Dissertation innovative Lösungsansätze entwickelt werden. Ein erster Lösungsansatz basiert auf einer Erweiterung des dichte-basierten Clustering auf die Problemstellung des Korrelations-Clustering. Den Ausgangspunkt bildet der erste dichtebasierte Ansatz in diesem Bereich, der Algorithmus 4C. Anschließend werden Erweiterungen und Variationen dieses Ansatzes diskutiert, die robusteres, effizienteres oder effektiveres Verhalten aufweisen oder sogar Hierarchien
von Korrelations-Clustern und den entsprechenden Unterräumen finden. Die dichtebasierten Korrelations-Cluster-Verfahren können allerdings einige Probleme grundsätzlich nicht lösen, da sie auf der Analyse lokaler Nachbarschaften beruhen. Dies ist in hochdimensionalen Feature-Räumen problematisch. Daher wird eine weitere Neuentwicklung vorgestellt, die das Korrelations-Cluster-Problem mit einer globalen Methode angeht. Schließlich wird eine Methode vorgestellt, die Cluster-Modelle für Korrelationscluster ableitet, so dass die gefundenen Cluster interpretiert werden können und tiefergehende Untersuchungen in der jeweiligen Fachdisziplin zielgerichtet möglich sind. Mögliche Anwendungen dieser Modelle werden abschließend vorgestellt und untersucht
- …