62 research outputs found
6 Access Methods and Query Processing Techniques
The performance of a database management system (DBMS) is fundamentally dependent on the access methods and query processing techniques available to the system. Traditionally, relational DBMSs have relied on well-known access methods, such as the ubiquitous B +-tree, hashing with chaining, and, in som
Bitemporal Sliding Windows
The bitemporal data model associates two time intervals with each record - system time and application time - denoting the validity of the record from the perspective of the database and of the real world, respectively. One issue that has not yet been addressed is how to efficiently answer sliding window queries in this model. In this work, we propose and experimentally evaluate a main-memory index called BiSW that supports sliding windows on system time, application time, and both time attributes simultaneously. Our experimental results show that BiSW outperforms existing approaches in terms of space footprint, maintenance overhead and query performance
AXMEDIS 2007 Conference Proceedings
The AXMEDIS International Conference series has been established since 2005 and is focused on the research, developments and applications in the cross-media domain, exploring innovative technologies to meet the challenges of the sector. AXMEDIS2007 deals with all subjects and topics related to cross-media and digital-media content production, processing, management, standards, representation, sharing, interoperability, protection and rights management. It addresses the latest developments and future trends of the technologies and their applications, their impact and exploitation within academic, business and industrial communities
Advanced Analysis on Temporal Data
Due to the increase in CPU power and the ever increasing data storage
capabilities, more and more data of all kind is recorded, including temporal
data. Time series, the most prevalent type of temporal data are derived in a
broad number of application domains. Prominent examples include stock price data
in economy, gene expression data in biology, the course of environmental
parameters in meteorology, or data of moving objects recorded by traffic
sensors.
This large amount of raw data can only be analyzed by automated data mining
algorithms in order to generate new knowledge. One of the most basic data
mining operations is the similarity query, which computes a similarity or
distance value for two objects. Two aspects of such an similarity function are
of special interest. First, the semantics of a similarity function and second,
the computational cost for the calculation of a similarity value. The semantics
is the actual similarity notion and is highly dependant on the analysis task at
hand.
This thesis addresses both aspects. We introduce a number of new
similarity measures for time series data and show how they can
efficiently be calculated by means of index structures and query
algorithms.
The first of the new similarity measures is threshold-based. Two
time series are considered as similar, if they exceed a user-given
threshold during similar time intervals. Aside from formally
defining this similarity measure, we show how to represent time
series in such a way that threshold-based queries can be efficiently
calculated. Our representation allows for the specification of the
threshold value at query time. This is for example useful for data
mining task that try to determine crucial thresholds.
The next similarity measure considers a relevant amplitude range.
This range is scanned with a certain resolution and for each
considered amplitude value features are extracted. We consider the
change in the feature values over the amplitude values and thus,
generate so-called feature sequences. Different features can finally
be combined to answer amplitude-level-based similarity queries. In
contrast to traditional approaches which aggregate global feature
values along the time dimension, we capture local characteristics
and monitor their change for different amplitude values.
Furthermore, our method enables the user to specify a relevant range
of amplitude values to be considered and so the similarity notion
can be adapted to the current requirements.
Next, we introduce so-called interval-focused similarity queries. A
user can specify one or several time intervals that should be
considered for the calculation of the similarity value. Our main
focus for this similarity measure was the efficient support of the
corresponding query. In particular we try to avoid loading the
complete time series objects into main memory, if only a relatively
small portion of a time series is of interest. We propose a time
series representation which can be used to calculate upper and
lower distance bounds, so that only a few time series objects have
to be completely loaded and refined. Again, the relevant time
intervals do not have to be known in advance.
Finally, we define a similarity measure for so-called uncertain time series,
where several amplitude values are given for each point in time. This can be
due to multiple recordings or to errors in measurements, so that no exact value
can be specified. We show how to efficiently support queries on uncertain time
series.
The last part of this thesis shows how data mining methods can be used to
discover crucial threshold parameters for the threshold-based similarity
measure. Furthermore we present a data mining tool for time series
Biomarkers of Sudden Unexpected Death in Epilepsy (SUDEP)
La SUDEP (Sudden Unexpected Death in Epilepsy) Ăš una complicanza devastante
dellâepilessia e rappresenta la piĂč comune causa di mortalitĂ prematura in epilessia.
Studi volti alla definizione di fattori di rischio clinici hanno permesso di identificare
gruppi ad alto rischio. Tuttavia al momento non esistono validati biomarkers genomici,
elettrofisiologici o strutturali predittivi di aumentato rischio di SUDEP. Al fine di
definire la base genetica della SUDEP, abbiamo condotto una analisi di sequenziamento
esomico per esaminare la prevalenza di varianti con effetto deleterio in soggetti deceduti
per SUDEP rispetto a pazienti epilettici non deceduti e controlli con altre patologie.
Abbiamo riscontrato una prevalenza significativamente aumentata di varianti deleterie
diffuse a livello dellâintero genoma nei soggetti deceduti per SUDEP in confronto agli
altri gruppi. Un secondo studio di neuroimaging Ăš stato dedicato alla valutazione di
anomalie regionali del volume della sostanza grigia in soggetti deceduti per SUDEP,
confrontati con soggetti epilettici viventi rispettivamente ad alto e basso rischio per
SUDEP, e controlli sani. Abbiamo riscontrato un aumento del volume della sostanza
grigia in emisfero destro a livello di amigdala, parte anteriore dellâippocampo e
paraippocampo nei soggetti deceduti per SUDEP e nei soggetti ad alto rischio, rispetto
ai soggetti a basso rischio ed ai controlli. Sia il sequenziamento esomico sia il
neuroimaging strutturale hanno fornito dati significativi per il profilo di rischio di
SUDEP. La definizione dei meccanismi eziologici della SUDEP Ăš fondamentale. La
traslazione di tali dati in algoritmi predittivi di rischio individuale consente di
promuovere la âmedicina personalizzataâ, allo scopo di adottare strategie preventive e
ridurre il rischio individuale di SUDEP in pazienti con epilessia.SUDEP (Sudden Unexpected Death in Epilepsy) is the most devastating outcome in
epilepsy and the commonest cause of epilepsy-related premature mortality. Studies of
clinical risk factors have allowed identifying high-risk populations. However no
genomic, electrophysiological or structural features have emerged as established
biomarkers of an increased SUDEP risk. To elucidate the genetic architecture of
SUDEP, we used an unbiased whole-exome sequencing approach to examine overall
burden and over-representation of deleterious variants in people who died of SUDEP
compared to living people with epilepsy and non-epilepsy disease controls. We found
significantly increased genome-wide polygenic burden per individual in the SUDEP
cohort when compared to epilepsy and non-epilepsy disease controls. The polygenic
burden was driven both by the number of variants per individual, and overrepresentation
of variants likely to be deleterious in the SUDEP cohort. To elucidate
which brain regions may be implicated in SUDEP, we investigated whether regional
abnormalities in grey matter volume appear in those who died of SUDEP, compared to
subjects at high and low risk for SUDEP, and healthy controls. We identified increased
grey matter volume in the right anterior hippocampus/amygdala and parahippocampus
in SUDEP cases and people at high risk, when compared to those at low risk and
controls. Compared to controls, posterior thalamic grey matter volume, an area
mediating oxygen regulation, was reduced in SUDEP cases and subjects at high risk. It
is fundamental to understand the range of SUDEP aetiological mechanisms. Our results
suggest that both exome sequencing data and structural imaging features may contribute
to generate SUDEP risk estimates. Translation of this knowledge into predictive
algorithms of individual risk and preventive strategies would promote stratified
medicine in epilepsy, with the aim of reducing an individual patient's risk of SUDEP
Tracing the Compositional Process. Sound art that rewrites its own past: formation, praxis and a computer framework
The domain of this thesis is electroacoustic computer-based music and sound art. It investigates
a facet of composition which is often neglected or ill-defined: the process of composing itself
and its embedding in time. Previous research mostly focused on instrumental composition or,
when electronic music was included, the computer was treated as a tool which would eventually
be subtracted from the equation. The aim was either to explain a resultant piece of music by
reconstructing the intention of the composer, or to explain human creativity by building a model
of the mind.
Our aim instead is to understand composition as an irreducible unfolding of material traces which
takes place in its own temporality. This understanding is formalised as a software framework
that traces creation time as a version graph of transactions. The instantiation and manipulation
of any musical structure implemented within this framework is thereby automatically stored
in a database. Not only can it be queried ex post by an external researcherâproviding a new
quality for the empirical analysis of the activity of composingâbut it is an integral part of
the composition environment. Therefore it can recursively become a source for the ongoing
composition and introduce new ways of aesthetic expression. The framework aims to unify
creation and performance time, fixed and generative composition, human and algorithmic
âwritingâ, a writing that includes indeterminate elements which condense as concurrent vertices
in the version graph.
The second major contribution is a critical epistemological discourse on the question of ob-
servability and the function of observation. Our goal is to explore a new direction of artistic
research which is characterised by a mixed methodology of theoretical writing, technological
development and artistic practice. The form of the thesis is an exercise in becoming process-like
itself, wherein the epistemic thing is generated by translating the gaps between these three levels.
This is my idea of the new aesthetics: That through the operation of a re-entry one may establish
a sort of process âformâ, yielding works which go beyond a categorical either âsound-in-itselfâ
or âconceptualismâ.
Exemplary processes are revealed by deconstructing a series of existing pieces, as well as
through the successful application of the new framework in the creation of new pieces
Efficient processing of large-scale spatio-temporal data
Millionen GerĂ€te, wie z.B. Mobiltelefone, Autos und Umweltsensoren senden ihre Positionen zusammen mit einem Zeitstempel und weiteren Nutzdaten an einen Server zu verschiedenen Analysezwecken. Die Positionsinformationen und ĂŒbertragenen Ereignisinformationen werden als Punkte oder Polygone dargestellt. Eine weitere Art rĂ€umlicher Daten sind Rasterdaten, die zum Beispiel von Kameras und Sensoren produziert werden. Diese groĂen rĂ€umlich-zeitlichen Datenmengen können nur auf skalierbaren Plattformen wie Hadoop und Apache Spark verarbeitet werden, die jedoch z.B. die Nachbarschaftsinformation nicht ausnutzen können - was die AusfĂŒhrung bestimmter Anfragen praktisch unmöglich macht. Die wiederholten AusfĂŒhrungen der Analyseprogramme wĂ€hrend ihrer Entwicklung und durch verschiedene Nutzer resultieren in langen AusfĂŒhrungszeiten und hohen Kosten fĂŒr gemietete Ressourcen, die durch die Wiederverwendung von Zwischenergebnissen reduziert werden können. Diese Arbeit beschĂ€ftigt sich mit den beiden oben beschriebenen Herausforderungen. Wir prĂ€sentieren zunĂ€chst das STARK Framework fĂŒr die Verarbeitung rĂ€umlich-zeitlicher Vektor- und Rasterdaten in Apache Spark. Wir identifizieren verschiedene Algorithmen fĂŒr Operatoren und analysieren, wie diese von den Eigenschaften der zugrundeliegenden Plattform profitieren können. Weiterhin wird untersucht, wie Indexe in der verteilten und parallelen Umgebung realisiert werden können. AuĂerdem vergleichen wir Partitionierungsmethoden, die unterschiedlich gut mit ungleichmĂ€Ăiger Datenverteilung und der GröĂe der Datenmenge umgehen können und prĂ€sentieren einen Ansatz um die auf Operatorebene zu verarbeitende Datenmenge frĂŒhzeitig zu reduzieren. Um die AusfĂŒhrungszeit von Programmen zu verkĂŒrzen, stellen wir einen Ansatz zur transparenten Materialisierung von Zwischenergebnissen vor. Dieser Ansatz benutzt ein Entscheidungsmodell, welches auf den tatsĂ€chlichen Operatorkosten basiert. In der Evaluierung vergleichen wir die verschiedenen Implementierungs- sowie Konfigurationsmöglichkeiten in STARK und identifizieren Szenarien wann Partitionierung und Indexierung eingesetzt werden sollten. AuĂerdem vergleichen wir STARK mit verwandten Systemen. Im zweiten Teil der Evaluierung zeigen wir, dass die transparente Wiederverwendung der materialisierten Zwischenergebnisse die AusfĂŒhrungszeit der Programme signifikant verringern kann.Millions of location-aware devices, such as mobile phones, cars, and environmental sensors constantly report their positions often in combination with a timestamp to a server for different kinds of analyses. While the location information of the devices and reported events is represented as points and polygons, raster data is another type of spatial data, which is for example produced by cameras and sensors. This Big spatio-temporal Data needs to be processed on scalable platforms, such as Hadoop and Apache Spark, which, however, are unaware of, e.g., spatial neighborhood, what makes them practically impossible to use for this kind of data.
The repeated executions of the programs during development and by different users result in long execution times and potentially high costs in rented clusters, which can be reduced by reusing commonly computed intermediate results.
Within this thesis, we tackle the two challenges described above. First, we present the STARK framework for processing spatio-temporal vector and raster data on the Apache Spark stack. For operators, we identify several possible algorithms and study how they can benefit from the underlying platform's properties. We further investigate how indexes can be realized in the distributed and parallel architecture of Big Data processing engines and compare methods for data partitioning, which perform differently well with respect to data skew and data set size. Furthermore, an approach to reduce the amount of data to process at operator level is presented. In order to reduce the execution times, we introduce an approach to transparently recycle intermediate results of dataflow programs, based on operator costs. To compute the costs, we instrument the programs with profiling code to gather the execution time and result size of the operators.
In the evaluation, we first compare the various implementation and configuration possibilities in STARK and identify scenarios when and how partitioning and indexing should be applied. We further compare STARK to related systems and show that we can achieve significantly better execution times, not only when exploiting existing partitioning information. In the second part of the evaluation, we show that with the transparent cost-based materialization and recycling of intermediate results, the execution times of programs can be reduced significantly
- âŠ