1,244 research outputs found
A Review of Subsequence Time Series Clustering
Clustering of subsequence time series remains an open issue in time series clustering. Subsequence time series clustering is used in different fields, such as e-commerce, outlier detection, speech recognition, biological systems, DNA recognition, and text mining. One of the useful fields in the domain of subsequence time series clustering is pattern recognition. To improve this field, a sequence of time series data is used. This paper reviews some definitions and backgrounds related to subsequence time series clustering. The categorization of the literature reviews is divided into three groups: preproof, interproof, and postproof period. Moreover, various state-of-the-art approaches in performing subsequence time series clustering are discussed under each of the following categories. The strengths and weaknesses of the employed methods are evaluated as potential issues for future studies
The EM Algorithm and the Rise of Computational Biology
In the past decade computational biology has grown from a cottage industry
with a handful of researchers to an attractive interdisciplinary field,
catching the attention and imagination of many quantitatively-minded
scientists. Of interest to us is the key role played by the EM algorithm during
this transformation. We survey the use of the EM algorithm in a few important
computational biology problems surrounding the "central dogma"; of molecular
biology: from DNA to RNA and then to proteins. Topics of this article include
sequence motif discovery, protein sequence alignment, population genetics,
evolutionary models and mRNA expression microarray data analysis.Comment: Published in at http://dx.doi.org/10.1214/09-STS312 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Generation of Two-Voice Imitative Counterpoint from Statistical Models
Generating new music based on rules of counterpoint has been deeply studied in music informatics. In this article, we try to go further, exploring a method for generating new music based on the style of Palestrina, based on combining statistical generation and pattern discovery. A template piece is used for pattern discovery, and the patterns are selected and organized according to a probabilistic distribution, using horizontal viewpoints to describe melodic properties of events. Once the template is covered with patterns, two-voice counterpoint in a florid style is generated into those patterns using a first-order Markov model. The template method solves the problem of coherence and imitation never addressed before in previous research in counterpoint music generation. For constructing the Markov model, vertical slices of pitch and rhythm are compiled over a large corpus of dyads from Palestrina masses. The template enforces different restrictions that filter the possible paths through the generation process. A double backtracking algorithm is implemented to handle cases where no solutions are found at some point within a generation path. Results are evaluated by both information content and listener evaluation, and the paper concludes with a proposed relationship between musical quality and information content. Part of this research has been presented at SMC 2016 in Hamburg, Germany
Applications of high-frequency telematics for driving behavior analysis
A thesis submitted in partial fulfillment of the requirements for the degree of Doctor in Information Management, specialization in Statistics and EconometricsProcessing driving data and investigating driving behavior has been receiving an
increasing interest in the last decades, with applications ranging from car insurance
pricing to policy-making. A popular way of analyzing driving behavior is to move
the focus to the maneuvers as they give useful information about the driver who is
performing them.
Previous research on maneuver detection can be divided into two strategies, namely,
1) using fixed thresholds in inertial measurements to define the start and end of specific
maneuvers or 2) using features extracted from rolling windows of sensor data
in a supervised learning model to detect maneuvers. While the first strategy is not
adaptable and requires fine-tuning, the second needs a dataset with labels (which is
time-consuming) and cannot identify maneuvers with different lengths in time.
To tackle these shortcomings, we investigate a new way of identifying maneuvers
from vehicle telematics data, through motif detection in time-series. Using a publicly
available naturalistic driving dataset (the UAH-DriveSet), we conclude that motif
detection algorithms are not only capable of extracting simple maneuvers such as accelerations,
brakes, and turns, but also more complex maneuvers, such as lane changes
and overtaking maneuvers, thus validating motif discovery as a worthwhile line for
future research in driving behavior.
We also propose TripMD, a system that extracts the most relevant driving patterns
from sensor recordings (such as acceleration) and provides a visualization that allows
for an easy investigation. We test TripMD in the same UAH-DriveSet dataset and show
that (1) our system can extract a rich number of driving patterns from a single driver
that are meaningful to understand driving behaviors and (2) our system can be used
to identify the driving behavior of an unknown driver from a set of drivers whose
behavior we know.Nas últimas décadas, o processamento e análise de dados de condução tem recebido
um interesse cada vez maior, com aplicações que abrangem a área de seguros de
automóveis até a atea de regulação. Tipicamente, a análise de condução compreende a
extração e estudo de manobras uma vez que estas contêm informação relevante sobre
a performance do condutor.
A investigação prévia sobre este tema pode ser dividida em dois tipos de estratégias,
a saber, 1) o uso de valores fixos de aceleração para definir o início e fim de cada
manobra ou 2) a utilização de modelos de aprendizagem supervisionada em janelas
temporais. Enquanto o primeiro tipo de estratégias é inflexível e requer afinação dos
parâmetros, o segundo precisa de dados de condução anotados (o que é moroso) e não
é capaz de identificar manobras de diferentes durações.
De forma a mitigar estas lacunas, neste trabalho, aplicamos métodos desenvolvidos
na área de investigação de séries temporais de forma a resolver o problema de deteção
de manobras. Em particular, exploramos área de deteção de motifs em séries temporais
e testamos se estes métodos genéricos são bem-sucedidos na deteção de manobras.
Também propomos o TripMD, um sistema que extrai os padrões de condução mais
relevantes de um conjuntos de viagens e fornece uma simples visualização. TripMD é
testado num conjunto de dados públicos (o UAH-DriveSet), do qual concluímos que
(1) o nosso sistema é capaz de extrair padrões de condução/manobras de um único
condutor que estão relacionados com o perfil de condução do condutor em questão e (2)
o nosso sistema pode ser usado para identificar o perfil de condução de um condutor
desconhecido de um conjunto de condutores cujo comportamento nos é conhecido
Mapping the proteome with data-driven methods: A cycle of measurement, modeling, hypothesis generation, and engineering
The living cell exhibits emergence of complex behavior and its modeling requires a systemic, integrative approach if we are to thoroughly understand and harness it. The work in this thesis has had the more narrow aim of quantitatively characterizing and mapping the proteome using data-driven methods, as proteins perform most functional and structural roles within the cell. Covered are the different parts of the cycle from improving quantification methods, to deriving protein features relying on their primary structure, predicting the protein content solely from sequence data, and, finally, to developing theoretical protein engineering tools, leading back to experiment.\ua0\ua0\ua0\ua0 High-throughput mass spectrometry platforms provide detailed snapshots of a cell\u27s protein content, which can be mined towards understanding how the phenotype arises from genotype and the interplay between the various properties of the constituent proteins. However, these large and dense data present an increased analysis challenge and current methods capture only a small fraction of signal. The first part of my work has involved tackling these issues with the implementation of a GPU-accelerated and distributed signal decomposition pipeline, making factorization of large proteomics scans feasible and efficient. The pipeline yields individual analyte signals spanning the majority of acquired signal, enabling high precision quantification and further analytical tasks.\ua0\ua0\ua0 Having such detailed snapshots of the proteome enables a multitude of undertakings. One application has been to use a deep neural network model to learn the amino acid sequence determinants of temperature adaptation, in the form of reusable deep model features. More generally, systemic quantities may be predicted from the information encoded in sequence by evolutionary pressure. Two studies taking inspiration from natural language processing have sought to learn the grammars behind the languages of expression, in one case predicting mRNA levels from DNA sequence, and in the other protein abundance from amino acid sequence. These two models helped build a quantitative understanding of the central dogma and, furthermore, in combination yielded an improved predictor of protein amount. Finally, a mathematical framework relying on the embedded space of a deep model has been constructed to assist guided mutation of proteins towards optimizing their abundance
- …