    6 Access Methods and Query Processing Techniques

    Bitemporal Sliding Windows

    The bitemporal data model associates two time intervals with each record - system time and application time - denoting the validity of the record from the perspective of the database and of the real world, respectively. One issue that has not yet been addressed is how to efficiently answer sliding window queries in this model. In this work, we propose and experimentally evaluate a main-memory index called BiSW that supports sliding windows on system time, application time, and both time attributes simultaneously. Our experimental results show that BiSW outperforms existing approaches in terms of space footprint, maintenance overhead and query performance

    AXMEDIS 2007 Conference Proceedings

    The AXMEDIS International Conference series has been established since 2005 and is focused on the research, developments and applications in the cross-media domain, exploring innovative technologies to meet the challenges of the sector. AXMEDIS2007 deals with all subjects and topics related to cross-media and digital-media content production, processing, management, standards, representation, sharing, interoperability, protection and rights management. It addresses the latest developments and future trends of the technologies and their applications, their impact and exploitation within academic, business and industrial communities

    Advanced Analysis on Temporal Data

    Due to the increase in CPU power and the ever increasing data storage capabilities, more and more data of all kind is recorded, including temporal data. Time series, the most prevalent type of temporal data are derived in a broad number of application domains. Prominent examples include stock price data in economy, gene expression data in biology, the course of environmental parameters in meteorology, or data of moving objects recorded by traffic sensors. This large amount of raw data can only be analyzed by automated data mining algorithms in order to generate new knowledge. One of the most basic data mining operations is the similarity query, which computes a similarity or distance value for two objects. Two aspects of such an similarity function are of special interest. First, the semantics of a similarity function and second, the computational cost for the calculation of a similarity value. The semantics is the actual similarity notion and is highly dependant on the analysis task at hand. This thesis addresses both aspects. We introduce a number of new similarity measures for time series data and show how they can efficiently be calculated by means of index structures and query algorithms. The first of the new similarity measures is threshold-based. Two time series are considered as similar, if they exceed a user-given threshold during similar time intervals. Aside from formally defining this similarity measure, we show how to represent time series in such a way that threshold-based queries can be efficiently calculated. Our representation allows for the specification of the threshold value at query time. This is for example useful for data mining task that try to determine crucial thresholds. The next similarity measure considers a relevant amplitude range. This range is scanned with a certain resolution and for each considered amplitude value features are extracted. We consider the change in the feature values over the amplitude values and thus, generate so-called feature sequences. Different features can finally be combined to answer amplitude-level-based similarity queries. In contrast to traditional approaches which aggregate global feature values along the time dimension, we capture local characteristics and monitor their change for different amplitude values. Furthermore, our method enables the user to specify a relevant range of amplitude values to be considered and so the similarity notion can be adapted to the current requirements. Next, we introduce so-called interval-focused similarity queries. A user can specify one or several time intervals that should be considered for the calculation of the similarity value. Our main focus for this similarity measure was the efficient support of the corresponding query. In particular we try to avoid loading the complete time series objects into main memory, if only a relatively small portion of a time series is of interest. We propose a time series representation which can be used to calculate upper and lower distance bounds, so that only a few time series objects have to be completely loaded and refined. Again, the relevant time intervals do not have to be known in advance. Finally, we define a similarity measure for so-called uncertain time series, where several amplitude values are given for each point in time. This can be due to multiple recordings or to errors in measurements, so that no exact value can be specified. We show how to efficiently support queries on uncertain time series. The last part of this thesis shows how data mining methods can be used to discover crucial threshold parameters for the threshold-based similarity measure. Furthermore we present a data mining tool for time series

    Biomarkers of Sudden Unexpected Death in Epilepsy (SUDEP)

    La SUDEP (Sudden Unexpected Death in Epilepsy) Ăš una complicanza devastante dell’epilessia e rappresenta la piĂč comune causa di mortalitĂ  prematura in epilessia. Studi volti alla definizione di fattori di rischio clinici hanno permesso di identificare gruppi ad alto rischio. Tuttavia al momento non esistono validati biomarkers genomici, elettrofisiologici o strutturali predittivi di aumentato rischio di SUDEP. Al fine di definire la base genetica della SUDEP, abbiamo condotto una analisi di sequenziamento esomico per esaminare la prevalenza di varianti con effetto deleterio in soggetti deceduti per SUDEP rispetto a pazienti epilettici non deceduti e controlli con altre patologie. Abbiamo riscontrato una prevalenza significativamente aumentata di varianti deleterie diffuse a livello dell’intero genoma nei soggetti deceduti per SUDEP in confronto agli altri gruppi. Un secondo studio di neuroimaging Ăš stato dedicato alla valutazione di anomalie regionali del volume della sostanza grigia in soggetti deceduti per SUDEP, confrontati con soggetti epilettici viventi rispettivamente ad alto e basso rischio per SUDEP, e controlli sani. Abbiamo riscontrato un aumento del volume della sostanza grigia in emisfero destro a livello di amigdala, parte anteriore dell’ippocampo e paraippocampo nei soggetti deceduti per SUDEP e nei soggetti ad alto rischio, rispetto ai soggetti a basso rischio ed ai controlli. Sia il sequenziamento esomico sia il neuroimaging strutturale hanno fornito dati significativi per il profilo di rischio di SUDEP. La definizione dei meccanismi eziologici della SUDEP Ăš fondamentale. La traslazione di tali dati in algoritmi predittivi di rischio individuale consente di promuovere la ‘medicina personalizzata’, allo scopo di adottare strategie preventive e ridurre il rischio individuale di SUDEP in pazienti con epilessia.SUDEP (Sudden Unexpected Death in Epilepsy) is the most devastating outcome in epilepsy and the commonest cause of epilepsy-related premature mortality. Studies of clinical risk factors have allowed identifying high-risk populations. However no genomic, electrophysiological or structural features have emerged as established biomarkers of an increased SUDEP risk. To elucidate the genetic architecture of SUDEP, we used an unbiased whole-exome sequencing approach to examine overall burden and over-representation of deleterious variants in people who died of SUDEP compared to living people with epilepsy and non-epilepsy disease controls. We found significantly increased genome-wide polygenic burden per individual in the SUDEP cohort when compared to epilepsy and non-epilepsy disease controls. The polygenic burden was driven both by the number of variants per individual, and overrepresentation of variants likely to be deleterious in the SUDEP cohort. To elucidate which brain regions may be implicated in SUDEP, we investigated whether regional abnormalities in grey matter volume appear in those who died of SUDEP, compared to subjects at high and low risk for SUDEP, and healthy controls. We identified increased grey matter volume in the right anterior hippocampus/amygdala and parahippocampus in SUDEP cases and people at high risk, when compared to those at low risk and controls. Compared to controls, posterior thalamic grey matter volume, an area mediating oxygen regulation, was reduced in SUDEP cases and subjects at high risk. It is fundamental to understand the range of SUDEP aetiological mechanisms. Our results suggest that both exome sequencing data and structural imaging features may contribute to generate SUDEP risk estimates. Translation of this knowledge into predictive algorithms of individual risk and preventive strategies would promote stratified medicine in epilepsy, with the aim of reducing an individual patient's risk of SUDEP

    Tracing the Compositional Process. Sound art that rewrites its own past: formation, praxis and a computer framework

    The domain of this thesis is electroacoustic computer-based music and sound art. It investigates a facet of composition which is often neglected or ill-defined: the process of composing itself and its embedding in time. Previous research mostly focused on instrumental composition or, when electronic music was included, the computer was treated as a tool which would eventually be subtracted from the equation. The aim was either to explain a resultant piece of music by reconstructing the intention of the composer, or to explain human creativity by building a model of the mind. Our aim instead is to understand composition as an irreducible unfolding of material traces which takes place in its own temporality. This understanding is formalised as a software framework that traces creation time as a version graph of transactions. The instantiation and manipulation of any musical structure implemented within this framework is thereby automatically stored in a database. Not only can it be queried ex post by an external researcher—providing a new quality for the empirical analysis of the activity of composing—but it is an integral part of the composition environment. Therefore it can recursively become a source for the ongoing composition and introduce new ways of aesthetic expression. The framework aims to unify creation and performance time, fixed and generative composition, human and algorithmic “writing”, a writing that includes indeterminate elements which condense as concurrent vertices in the version graph. The second major contribution is a critical epistemological discourse on the question of ob- servability and the function of observation. Our goal is to explore a new direction of artistic research which is characterised by a mixed methodology of theoretical writing, technological development and artistic practice. The form of the thesis is an exercise in becoming process-like itself, wherein the epistemic thing is generated by translating the gaps between these three levels. This is my idea of the new aesthetics: That through the operation of a re-entry one may establish a sort of process “form”, yielding works which go beyond a categorical either “sound-in-itself” or “conceptualism”. Exemplary processes are revealed by deconstructing a series of existing pieces, as well as through the successful application of the new framework in the creation of new pieces

    Efficient processing of large-scale spatio-temporal data

    Millionen GerĂ€te, wie z.B. Mobiltelefone, Autos und Umweltsensoren senden ihre Positionen zusammen mit einem Zeitstempel und weiteren Nutzdaten an einen Server zu verschiedenen Analysezwecken. Die Positionsinformationen und ĂŒbertragenen Ereignisinformationen werden als Punkte oder Polygone dargestellt. Eine weitere Art rĂ€umlicher Daten sind Rasterdaten, die zum Beispiel von Kameras und Sensoren produziert werden. Diese großen rĂ€umlich-zeitlichen Datenmengen können nur auf skalierbaren Plattformen wie Hadoop und Apache Spark verarbeitet werden, die jedoch z.B. die Nachbarschaftsinformation nicht ausnutzen können - was die AusfĂŒhrung bestimmter Anfragen praktisch unmöglich macht. Die wiederholten AusfĂŒhrungen der Analyseprogramme wĂ€hrend ihrer Entwicklung und durch verschiedene Nutzer resultieren in langen AusfĂŒhrungszeiten und hohen Kosten fĂŒr gemietete Ressourcen, die durch die Wiederverwendung von Zwischenergebnissen reduziert werden können. Diese Arbeit beschĂ€ftigt sich mit den beiden oben beschriebenen Herausforderungen. Wir prĂ€sentieren zunĂ€chst das STARK Framework fĂŒr die Verarbeitung rĂ€umlich-zeitlicher Vektor- und Rasterdaten in Apache Spark. Wir identifizieren verschiedene Algorithmen fĂŒr Operatoren und analysieren, wie diese von den Eigenschaften der zugrundeliegenden Plattform profitieren können. Weiterhin wird untersucht, wie Indexe in der verteilten und parallelen Umgebung realisiert werden können. Außerdem vergleichen wir Partitionierungsmethoden, die unterschiedlich gut mit ungleichmĂ€ĂŸiger Datenverteilung und der GrĂ¶ĂŸe der Datenmenge umgehen können und prĂ€sentieren einen Ansatz um die auf Operatorebene zu verarbeitende Datenmenge frĂŒhzeitig zu reduzieren. Um die AusfĂŒhrungszeit von Programmen zu verkĂŒrzen, stellen wir einen Ansatz zur transparenten Materialisierung von Zwischenergebnissen vor. Dieser Ansatz benutzt ein Entscheidungsmodell, welches auf den tatsĂ€chlichen Operatorkosten basiert. In der Evaluierung vergleichen wir die verschiedenen Implementierungs- sowie Konfigurationsmöglichkeiten in STARK und identifizieren Szenarien wann Partitionierung und Indexierung eingesetzt werden sollten. Außerdem vergleichen wir STARK mit verwandten Systemen. Im zweiten Teil der Evaluierung zeigen wir, dass die transparente Wiederverwendung der materialisierten Zwischenergebnisse die AusfĂŒhrungszeit der Programme signifikant verringern kann.Millions of location-aware devices, such as mobile phones, cars, and environmental sensors constantly report their positions often in combination with a timestamp to a server for different kinds of analyses. While the location information of the devices and reported events is represented as points and polygons, raster data is another type of spatial data, which is for example produced by cameras and sensors. This Big spatio-temporal Data needs to be processed on scalable platforms, such as Hadoop and Apache Spark, which, however, are unaware of, e.g., spatial neighborhood, what makes them practically impossible to use for this kind of data. The repeated executions of the programs during development and by different users result in long execution times and potentially high costs in rented clusters, which can be reduced by reusing commonly computed intermediate results. Within this thesis, we tackle the two challenges described above. First, we present the STARK framework for processing spatio-temporal vector and raster data on the Apache Spark stack. For operators, we identify several possible algorithms and study how they can benefit from the underlying platform's properties. We further investigate how indexes can be realized in the distributed and parallel architecture of Big Data processing engines and compare methods for data partitioning, which perform differently well with respect to data skew and data set size. Furthermore, an approach to reduce the amount of data to process at operator level is presented. In order to reduce the execution times, we introduce an approach to transparently recycle intermediate results of dataflow programs, based on operator costs. To compute the costs, we instrument the programs with profiling code to gather the execution time and result size of the operators. In the evaluation, we first compare the various implementation and configuration possibilities in STARK and identify scenarios when and how partitioning and indexing should be applied. We further compare STARK to related systems and show that we can achieve significantly better execution times, not only when exploiting existing partitioning information. In the second part of the evaluation, we show that with the transparent cost-based materialization and recycling of intermediate results, the execution times of programs can be reduced significantly
