3,287 research outputs found

    BTW: a web server for Boltzmann time warping of gene expression time series

    Get PDF
    Dynamic time warping (DTW) is a well-known quadratic time algorithm to determine the smallest distance and optimal alignment between two numerical sequences, possibly of different length. Originally developed for speech recognition, this method has been used in data mining, medicine and bioinformatics. For gene expression time series data, time warping distance is arguably a more flexible tool to determine genes having similar temporal expression, hence possibly related biological function, than either Euclidean distance or correlation coefficient—especially since time warping accommodates sequences of different length. The BTW web server allows a user to upload two tab-separated text files A,B of gene expression data, each possibly having a different number of time intervals of different durations. BTW then computes time warping distance between each gene of A with each gene of B, using a recently developed symmetric algorithm which additionally computes the Boltzmann partition function and outputs Boltzmann pair probabilities. The Boltzmann pair probabilities, not available with any other existent software, suggest possible biological significance of certain positions in an optimal time warping alignment. Availability:

    Towards a Holistic Integration of Spreadsheets with Databases: A Scalable Storage Engine for Presentational Data Management

    Full text link
    Spreadsheet software is the tool of choice for interactive ad-hoc data management, with adoption by billions of users. However, spreadsheets are not scalable, unlike database systems. On the other hand, database systems, while highly scalable, do not support interactivity as a first-class primitive. We are developing DataSpread, to holistically integrate spreadsheets as a front-end interface with databases as a back-end datastore, providing scalability to spreadsheets, and interactivity to databases, an integration we term presentational data management (PDM). In this paper, we make a first step towards this vision: developing a storage engine for PDM, studying how to flexibly represent spreadsheet data within a database and how to support and maintain access by position. We first conduct an extensive survey of spreadsheet use to motivate our functional requirements for a storage engine for PDM. We develop a natural set of mechanisms for flexibly representing spreadsheet data and demonstrate that identifying the optimal representation is NP-Hard; however, we develop an efficient approach to identify the optimal representation from an important and intuitive subclass of representations. We extend our mechanisms with positional access mechanisms that don't suffer from cascading update issues, leading to constant time access and modification performance. We evaluate these representations on a workload of typical spreadsheets and spreadsheet operations, providing up to 20% reduction in storage, and up to 50% reduction in formula evaluation time

    Route Planning in Transportation Networks

    Full text link
    We survey recent advances in algorithms for route planning in transportation networks. For road networks, we show that one can compute driving directions in milliseconds or less even at continental scale. A variety of techniques provide different trade-offs between preprocessing effort, space requirements, and query time. Some algorithms can answer queries in a fraction of a microsecond, while others can deal efficiently with real-time traffic. Journey planning on public transportation systems, although conceptually similar, is a significantly harder problem due to its inherent time-dependent and multicriteria nature. Although exact algorithms are fast enough for interactive queries on metropolitan transit systems, dealing with continent-sized instances requires simplifications or heavy preprocessing. The multimodal route planning problem, which seeks journeys combining schedule-based transportation (buses, trains) with unrestricted modes (walking, driving), is even harder, relying on approximate solutions even for metropolitan inputs.Comment: This is an updated version of the technical report MSR-TR-2014-4, previously published by Microsoft Research. This work was mostly done while the authors Daniel Delling, Andrew Goldberg, and Renato F. Werneck were at Microsoft Research Silicon Valle

    On Mini-Batch Training with Varying Length Time Series

    Full text link
    In real-world time series recognition applications, it is possible to have data with varying length patterns. However, when using artificial neural networks (ANN), it is standard practice to use fixed-sized mini-batches. To do this, time series data with varying lengths are typically normalized so that all the patterns are the same length. Normally, this is done using zero padding or truncation without much consideration. We propose a novel method of normalizing the lengths of the time series in a dataset by exploiting the dynamic matching ability of Dynamic Time Warping (DTW). In this way, the time series lengths in a dataset can be set to a fixed size while maintaining features typical to the dataset. In the experiments, all 11 datasets with varying length time series from the 2018 UCR Time Series Archive are used. We evaluate the proposed method by comparing it with 18 other length normalization methods on a Convolutional Neural Network (CNN), a Long-Short Term Memory network (LSTM), and a Bidirectional LSTM (BLSTM).Comment: Accepted to ICASSP 202

    Proceedings of the Third International Workshop on Management of Uncertain Data (MUD2009)

    Get PDF

    Techniques to explore time-related correlation in large datasets

    Get PDF
    The next generation of database management and computing systems will be significantly complex with data distributed both in functionality and operation. The complexity arises, at least in part, due to data types involved and types of information request rendered by the database user. Time sequence databases are generated in many practical applications. Detecting similar sequences and subsequences within these databases is an important research area and has generated lot of interest recently. Previous studies in this area have concentrated on calculating similitude between (sub)sequences of equal sizes. The question of unequal sized (sub)sequence comparison to report similitude has been an open problem for some time. The problem is an important and non-trivial one. In this dissertation, we propose a solution to the problem of finding sequences, in a database of unequal sized sequences, that are similar to a given query sequence. A paradigm to search pairs of similar, equal and unequal sized, subsequences within a pair of sequences is also presented. We put forward new approaches for sequence time-scale reduction, feature aggregation and object recognition. To make the search of similar sequences efficient, we propose an indexing technique to index the unequal-sized sequence database. We also introduce a unique indexing technique to index identified subsequences within a reference sequence. This index is subsequently employed to report similar pairs of subsequences, when presented with a query sequence. We present several experimental results and also compare the proposed framework with previous work in this area

    Modelling the Galaxy in the era of Gaia

    Full text link
    The body of photometric and astrometric data on stars in the Galaxy has been growing very fast in recent years (Hipparcos/Tycho, OGLE-3, 2-Mass, DENIS, UCAC2, SDSS, RAVE, Pan Starrs, Hermes, ...) and in two years ESA will launch the Gaia satellite, which will measure astrometric data of unprecedented precision for a billion stars. On account of our position within the Galaxy and the complex observational biases that are built into most catalogues, dynamical models of the Galaxy are a prerequisite full exploitation of these catalogues. On account of the enormous detail in which we can observe the Galaxy, models of great sophistication are required. Moreover, in addition to models we require algorithms for observing them with the same errors and biases as occur in real observational programs, and statistical algorithms for determining the extent to which a model is compatible with a given body of data. JD5 reviewed the status of our knowledge of the Galaxy, the different ways in which we could model the Galaxy, and what will be required to extract our science goals from the data that will be on hand when the Gaia Catalogue becomes available.Comment: Proceedings of Joint Discussion 5 at IAU XXVII, Rio de Janeiro, August 2009; 31 page

    TILDE-Q: A Transformation Invariant Loss Function for Time-Series Forecasting

    Full text link
    Time-series forecasting has caught increasing attention in the AI research field due to its importance in solving real-world problems across different domains, such as energy, weather, traffic, and economy. As shown in various types of data, it has been a must-see issue to deal with drastic changes, temporal patterns, and shapes in sequential data that previous models are weak in prediction. This is because most cases in time-series forecasting aim to minimize LpL_p norm distances as loss functions, such as mean absolute error (MAE) or mean square error (MSE). These loss functions are vulnerable to not only considering temporal dynamics modeling but also capturing the shape of signals. In addition, these functions often make models misbehave and return uncorrelated results to the original time-series. To become an effective loss function, it has to be invariant to the set of distortions between two time-series data instead of just comparing exact values. In this paper, we propose a novel loss function, called TILDE-Q (Transformation Invariant Loss function with Distance EQuilibrium), that not only considers the distortions in amplitude and phase but also allows models to capture the shape of time-series sequences. In addition, TILDE-Q supports modeling periodic and non-periodic temporal dynamics at the same time. We evaluate the effectiveness of TILDE-Q by conducting extensive experiments with respect to periodic and non-periodic conditions of data, from naive models to state-of-the-art models. The experiment results indicate that the models trained with TILDE-Q outperform those trained with other training metrics (e.g., MSE, dynamic time warping (DTW), temporal distortion index (TDI), and longest common subsequence (LCSS)).Comment: 9 pages paper, 2 pages references, and 7 pages appendix. Submitted as conference paper to ICLR 202

    Design and Implementation of a Middleware for Uniform, Federated and Dynamic Event Processing

    Get PDF
    In recent years, real-time processing of massive event streams has become an important topic in the area of data analytics. It will become even more important in the future due to cheap sensors, a growing amount of devices and their ubiquitous inter-connection also known as the Internet of Things (IoT). Academia, industry and the open source community have developed several event processing (EP) systems that allow users to define, manage and execute continuous queries over event streams. They achieve a significantly better performance than the traditional store-then-process'' approach in which events are first stored and indexed in a database. Because EP systems have different roots and because of the lack of standardization, the system landscape became highly heterogenous. Today's EP systems differ in APIs, execution behaviors and query languages. This thesis presents the design and implementation of a novel middleware that abstracts from different EP systems and provides a uniform API, execution behavior and query language to users and developers. As a consequence, the presented middleware overcomes the problem of vendor lock-in and different EP systems are enabled to cooperate with each other. In practice, event streams differ dramatically in volume and velocity. We show therefore how the middleware can connect to not only different EP systems, but also database systems and a native implementation. Emerging applications such as the IoT raise novel challenges and require EP to be more dynamic. We present extensions to the middleware that enable self-adaptivity which is needed in context-sensitive applications and those that deal with constantly varying sets of event producers and consumers. Lastly, we extend the middleware to fully support the processing of events containing spatial data and to be able to run distributed in the form of a federation of heterogenous EP systems

    Correlation Clustering

    Get PDF
    Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. The core step of the KDD process is the application of a Data Mining algorithm in order to produce a particular enumeration of patterns and relationships in large databases. Clustering is one of the major data mining techniques and aims at grouping the data objects into meaningful classes (clusters) such that the similarity of objects within clusters is maximized, and the similarity of objects from different clusters is minimized. This can serve to group customers with similar interests, or to group genes with related functionalities. Currently, a challenge for clustering-techniques are especially high dimensional feature-spaces. Due to modern facilities of data collection, real data sets usually contain many features. These features are often noisy or exhibit correlations among each other. However, since these effects in different parts of the data set are differently relevant, irrelevant features cannot be discarded in advance. The selection of relevant features must therefore be integrated into the data mining technique. Since about 10 years, specialized clustering approaches have been developed to cope with problems in high dimensional data better than classic clustering approaches. Often, however, the different problems of very different nature are not distinguished from one another. A main objective of this thesis is therefore a systematic classification of the diverse approaches developed in recent years according to their task definition, their basic strategy, and their algorithmic approach. We discern as main categories the search for clusters (i) w.r.t. closeness of objects in axis-parallel subspaces, (ii) w.r.t. common behavior (patterns) of objects in axis-parallel subspaces, and (iii) w.r.t. closeness of objects in arbitrarily oriented subspaces (so called correlation cluster). For the third category, the remaining parts of the thesis describe novel approaches. A first approach is the adaptation of density-based clustering to the problem of correlation clustering. The starting point here is the first density-based approach in this field, the algorithm 4C. Subsequently, enhancements and variations of this approach are discussed allowing for a more robust, more efficient, or more effective behavior or even find hierarchies of correlation clusters and the corresponding subspaces. The density-based approach to correlation clustering, however, is fundamentally unable to solve some issues since an analysis of local neighborhoods is required. This is a problem in high dimensional data. Therefore, a novel method is proposed tackling the correlation clustering problem in a global approach. Finally, a method is proposed to derive models for correlation clusters to allow for an interpretation of the clusters and facilitate more thorough analysis in the corresponding domain science. Finally, possible applications of these models are proposed and discussed.Knowledge Discovery in Databases (KDD) ist der Prozess der automatischen Extraktion von Wissen aus großen Datenmengen, das gültig, bisher unbekannt und potentiell nützlich für eine gegebene Anwendung ist. Der zentrale Schritt des KDD-Prozesses ist das Anwenden von Data Mining-Techniken, um nützliche Beziehungen und Zusammenhänge in einer aufbereiteten Datenmenge aufzudecken. Eine der wichtigsten Techniken des Data Mining ist die Cluster-Analyse (Clustering). Dabei sollen die Objekte einer Datenbank in Gruppen (Cluster) partitioniert werden, so dass Objekte eines Clusters möglichst ähnlich und Objekte verschiedener Cluster möglichst unähnlich zu einander sind. Hier können beispielsweise Gruppen von Kunden identifiziert werden, die ähnliche Interessen haben, oder Gruppen von Genen, die ähnliche Funktionalitäten besitzen. Eine aktuelle Herausforderung für Clustering-Verfahren stellen hochdimensionale Feature-Räume dar. Reale Datensätze beinhalten dank moderner Verfahren zur Datenerhebung häufig sehr viele Merkmale (Features). Teile dieser Merkmale unterliegen oft Rauschen oder Abhängigkeiten und können meist nicht im Vorfeld ausgesiebt werden, da diese Effekte in Teilen der Datenbank jeweils unterschiedlich ausgeprägt sind. Daher muss die Wahl der Features mit dem Data-Mining-Verfahren verknüpft werden. Seit etwa 10 Jahren werden vermehrt spezialisierte Clustering-Verfahren entwickelt, die mit den in hochdimensionalen Feature-Räumen auftretenden Problemen besser umgehen können als klassische Clustering-Verfahren. Hierbei wird aber oftmals nicht zwischen den ihrer Natur nach im Einzelnen sehr unterschiedlichen Problemen unterschieden. Ein Hauptanliegen der Dissertation ist daher eine systematische Einordnung der in den letzten Jahren entwickelten sehr diversen Ansätze nach den Gesichtspunkten ihrer jeweiligen Problemauffassung, ihrer grundlegenden Lösungsstrategie und ihrer algorithmischen Vorgehensweise. Als Hauptkategorien unterscheiden wir hierbei die Suche nach Clustern (1.) hinsichtlich der Nähe von Cluster-Objekten in achsenparallelen Unterräumen, (2.) hinsichtlich gemeinsamer Verhaltensweisen (Mustern) von Cluster-Objekten in achsenparallelen Unterräumen und (3.) hinsichtlich der Nähe von Cluster-Objekten in beliebig orientierten Unterräumen (sogenannte Korrelations-Cluster). Für die dritte Kategorie sollen in den weiteren Teilen der Dissertation innovative Lösungsansätze entwickelt werden. Ein erster Lösungsansatz basiert auf einer Erweiterung des dichte-basierten Clustering auf die Problemstellung des Korrelations-Clustering. Den Ausgangspunkt bildet der erste dichtebasierte Ansatz in diesem Bereich, der Algorithmus 4C. Anschließend werden Erweiterungen und Variationen dieses Ansatzes diskutiert, die robusteres, effizienteres oder effektiveres Verhalten aufweisen oder sogar Hierarchien von Korrelations-Clustern und den entsprechenden Unterräumen finden. Die dichtebasierten Korrelations-Cluster-Verfahren können allerdings einige Probleme grundsätzlich nicht lösen, da sie auf der Analyse lokaler Nachbarschaften beruhen. Dies ist in hochdimensionalen Feature-Räumen problematisch. Daher wird eine weitere Neuentwicklung vorgestellt, die das Korrelations-Cluster-Problem mit einer globalen Methode angeht. Schließlich wird eine Methode vorgestellt, die Cluster-Modelle für Korrelationscluster ableitet, so dass die gefundenen Cluster interpretiert werden können und tiefergehende Untersuchungen in der jeweiligen Fachdisziplin zielgerichtet möglich sind. Mögliche Anwendungen dieser Modelle werden abschließend vorgestellt und untersucht
    corecore