104 research outputs found
When fish are not poisson :modelling the migration of Atlantic salmon (Salmo salar) and sea trout (Salmo trutta) at multiple time scales
PhD ThesisMigratory species undertake prolonged seasonal journeys; monitoring these movements is challenging but can sometimes be achieved by observations that taken locally and, ideally, using remote methods.
Amongst the best known examples of migrating fish in Europe, are Atlantic salmon (Salmo salar) and sea trout (Salmo trutta) that migrate between river and seawater. Characteristics of habitat suitability, feeding opportunities, predation, as well as salmonid sensitivity and needs, vary throughout successive stages of their anadromous life cycle.
Since the marine stage is the longest but is also challenging to monitor, in-river fish counters are of increasing importance in understanding salmonid patterns in abundance. The original contribution of this thesis lies in the use of modelling techniques to investigate salmonid migration, based on temporal observations produced by an electronic fish counter triggered by salmonid passage, as they return to spawn in the River Tyne.
Small scale observation revealed seasonal differences; aggregation behaviour intensified during the middle of the migration season, and explanatory covariates varied in both their effect size and relevance to salmonid abundance. At the population scale, migration was highly driven by annual periodicity, abundance increased with river temperature and there was an NAO effect with a four year lag, underlining the importance of marine conditions to parent population and/or post-smolts. Differences between distinct populations of S. salar and S. trutta appeared related to a species-specific annual periodicity and oceanic conditions as salmonids return (more so for S. salar). State-space models suggested a complex demographic structure for the two species. There was a species identification learning curve that affected the data by 2007. A classification algorithm determined that observations are more likely to be S. salar for larger signal amplitude, within a higher river flow and earlier in the year; characteristics were too similar between the two species to reach a useful classification success rate (69%). The project overall suggests specificities relating to both species and age-class that cannot be addressed in depth with the collected data; emerging limitations and recommendations are discussed.Environment Agenc
A Functional Data Perspective and Baseline On Multi-Layer Out-of-Distribution Detection
A key feature of out-of-distribution (OOD) detection is to exploit a trained
neural network by extracting statistical patterns and relationships through the
multi-layer classifier to detect shifts in the expected input data
distribution. Despite achieving solid results, several state-of-the-art methods
rely on the penultimate or last layer outputs only, leaving behind valuable
information for OOD detection. Methods that explore the multiple layers either
require a special architecture or a supervised objective to do so. This work
adopts an original approach based on a functional view of the network that
exploits the sample's trajectories through the various layers and their
statistical dependencies. It goes beyond multivariate features aggregation and
introduces a baseline rooted in functional anomaly detection. In this new
framework, OOD detection translates into detecting samples whose trajectories
differ from the typical behavior characterized by the training set. We validate
our method and empirically demonstrate its effectiveness in OOD detection
compared to strong state-of-the-art baselines on computer vision benchmarks
Wrapper algorithms and their performance assessment on high-dimensional molecular data
Prediction problems on high-dimensional molecular data, e.g. the classification of microar-
ray samples into normal and cancer tissues, are complex and ill-posed since the number
of variables usually exceeds the number of observations by orders of magnitude. Recent
research in the area has propagated a variety of new statistical models in order to handle
these new biological datasets. In practice, however, these models are always applied in
combination with preprocessing and variable selection methods as well as model selection
which is mostly performed by cross-validation. Varma and Simon (2006) have used the
term ‘wrapper-algorithm’ for this integration of preprocessing and model selection into the
construction of statistical models. Additionally, they have proposed the method of nested
cross-validation (NCV) as a way of estimating their prediction error which has evolved to
the gold-standard by now.
In the first part, this thesis provides further theoretical and empirical justification for
the usage of NCV in the context of wrapper-algorithms. Moreover, a computationally less
intensive alternative to NCV is proposed which can be motivated in a decision theoretic
framework. The new method can be interpreted as a smoothed variant of NCV and, in
contrast to NCV, guarantees intuitive bounds for the estimation of the prediction error.
The second part focuses on the ranking of wrapper algorithms. Cross-study-validation is
proposed as an alternative concept to the repetition of separated within-study-validations
if several similar prediction problems are available. The concept is demonstrated using
six different wrapper algorithms for survival prediction on censored data on a selection of
eight breast cancer datasets. Additionally, a parametric bootstrap approach for simulating
realistic data from such related prediction problems is described and subsequently applied
to illustrate the concept of cross-study-validation for the ranking of wrapper algorithms.
Eventually, the last part approaches computational aspects of the analyses and simula-
tions performed in the thesis. The preprocessing before the analysis as well as the evaluation
of the prediction models requires the usage of large computing resources. Parallel comput-
ing approaches are illustrated on cluster, cloud and high performance computing resources
using the R programming language. Usage of heterogeneous hardware and processing of
large datasets are covered as well as the implementation of the R-package survHD for
the analysis and evaluation of high-dimensional wrapper algorithms for survival prediction
from censored data.Prädiktionsprobleme für hochdimensionale genetische Daten, z.B. die Klassifikation von
Proben in normales und Krebsgewebe, sind komplex und unterbestimmt, da die Anzahl
der Variablen die Anzahl der Beobachtungen um ein Vielfaches übersteigt. Die Forschung
hat auf diesem Gebiet in den letzten Jahren eine Vielzahl an neuen statistischen Meth-
oden hervorgebracht. In der Praxis werden diese Algorithmen jedoch stets in Kombination mit Vorbearbeitung und Variablenselektion sowie Modellwahlverfahren angewandt,
wobei letztere vorwiegend mit Hilfe von Kreuzvalidierung durchgeführt werden. Varma
und Simon (2006) haben den Begriff ’Wrapper-Algorithmus’ für eine derartige Einbet-
tung von Vorbearbeitung und Modellwahl in die Konstruktion einer statistischen Methode
verwendet. Zudem haben sie die genestete Kreuzvalidierung (NCV) als eine Methode
zur Sch ̈atzung ihrer Fehlerrate eingeführt, welche sich mittlerweile zum Goldstandard entwickelt hat. Im ersten Teil dieser Doktorarbeit, wird eine tiefergreifende theoretische
Grundlage sowie eine empirische Rechtfertigung für die Anwendung von NCV bei solchen
’Wrapper-Algorithmen’ vorgestellt. Außerdem wird eine alternative, weniger computerintensive Methode vorgeschlagen, welche im Rahmen der Entscheidungstheorie motiviert
wird. Diese neue Methode kann als eine gegl ̈attete Variante von NCV interpretiert wer-
den und hält im Gegensatz zu NCV intuitive Grenzen bei der Fehlerratenschätzung ein.
Der zweite Teil behandelt den Vergleich verschiedener ’Wrapper-Algorithmen’ bzw. das
Sch ̈atzen ihrer Reihenfolge gem ̈aß eines bestimmten Gütekriteriums. Als eine Alterna-
tive zur wiederholten Durchführung von Kreuzvalidierung auf einzelnen Datensätzen wird
das Konzept der studienübergreifenden Validierung vorgeschlagen. Das Konzept wird anhand von sechs verschiedenen ’Wrapper-Algorithmen’ für die Vorhersage von Uberlebenszeiten bei acht Brustkrebsstudien dargestellt. Zusätzlich wird ein Bootstrapverfahren
beschrieben, mit dessen Hilfe man mehrere realistische Datens ̈atze aus einer Menge von
solchen verwandten Prädiktionsproblemen generieren kann. Der letzte Teil beleuchtet
schließlich computationale Verfahren, die bei der Umsetzung der Analysen in dieser Dissertation eine tragende Rolle gespielt haben. Die Vorbearbeitungsschritte sowie die Evaluation der Prädiktionsmodelle erfordert die extensive Nutzung von Computerressourcen.
Es werden Ansätze zum parallelen Rechnen auf Cluster-, Cloud- und Hochleistungsrechen-
ressourcen unter der Verwendung der Programmiersprache R beschrieben. Die Benutzung
von heterogenen Hardwarearchitekturen, die Verarbeitung von großen Datensätzen sowie
die Entwicklung des R-Pakets survHD für die Analyse und Evaluierung von ’Wrapper-
Algorithmen’ zur Uberlebenszeitenanalyse
werden thematisiert
Multistep ahead time series prediction
Time series analysis has been the subject of extensive interest in many fields ofstudy ranging from weather forecasting to economic predictions, over the past twocenturies. It has been fundamental to our understanding of previous patterns withindata and has also been used to make predictions in both the short and long termhorizons. When approaching such problems researchers would typically analyzethe given series for a number of distinct characteristics and select the most ap-propriate technique. However, the complexity of aligning a set of characteristicswith a method has increased in complexity with the advent of Machine Learningand the introduction of Multi-Step Ahead Prediction (MSAP). We examine themodel/strategy approaches which are currently applied to conduct multi-step aheadprediction in time series data and propose an alternative MSAP strategy known asMulti-Resolution Forecast Aggregation.Typically, when researchers propose an alternative strategy or method, they demon-strate it on a relatively small set of time series, thus the general breath of use isunknown. We propose a process that generates a diverse set of synthetic time se-ries, that will enable a robust examination of MRFA and other methods/strategies.This dataset in conjunction with a range of popular prediction methods and MSAPstrategies is then used to develop a meta learner that estimates the normalized meansquare error of the prediction approach for the given time serie
Recommended from our members
Online Anomaly Detection for Time Series. Towards Incorporating Feature Extraction, Model Uncertainty and Concept Drift Adaptation for Improving Anomaly Detection
Time series anomaly detection receives increasing research interest given
the growing number of data-rich application domains. Recent additions
to anomaly detection methods in research literature include deep learning
algorithms. The nature and performance of these algorithms in sequence
analysis enable them to learn hierarchical discriminating features
and time-series temporal nature. However, their performance is affected
by the speed at which the time series arrives, the use of a fixed threshold,
and the assumption of Gaussian distribution on the prediction error
to identify anomalous values. An exact parametric distribution is often
not directly relevant in many applications and it’s often difficult to select
an appropriate threshold that will differentiate anomalies with noise.
Thus, implementations need the Prediction Interval (PI) that quantifies the
level of uncertainty associated with the Deep Neural Network (DNN) point
forecasts, which helps in making a better-informed decision and mitigates
against false anomaly alerts. To achieve this, a new anomaly detection
method is proposed that computes the uncertainty in estimates using quantile
regression and used the quantile interval to identify anomalies. Similarly,
to handle the speed at which the data arrives, an online anomaly detection
method is proposed where a model is trained incrementally to adapt
to the concept drift that improves prediction. This is implemented using a
window-based strategy, in which a time series is broken into sliding windows
of sub-sequences as input to the model. To adapt to concept drift,
the model is updated when changes occur in the new arrival instances.
This is achieved by using anomaly likelihood which is computed using the
Q-function to define the abnormal degree of the current data point based
on the previous data points. Specifically, when concept drift occurs, the
proposed method will mark the current data point as anomalous. However,
when the abnormal behavior continues for a longer period of time,
the abnormal degree of the current data point will be low compared to the
previous data points using the likelihood. As such, the current data point is
added to the previous data to retrain the model which will allow the model
to learn the new characteristics of the data and hence adapt to the concept
changes thereby redefining the abnormal behavior. The proposed method
also incorporates feature extraction to capture structural patterns in the
time series. This is especially significant for multivariate time-series data,
for which there is a need to capture the complex temporal dependencies
that may exist between the variables. In summary, this thesis contributes
to the theory, design, and development of algorithms and models for the
detection of anomalies in both static and evolving time series data.
Several experiments were conducted, and the results obtained indicate the
significance of this research on offline and online anomaly detection in
both static and evolving time-series data. In chapter 3, the newly proposed
method (Deep Quantile Regression Anomaly Detection Method) is evaluated
and compared with six other prediction-based anomaly detection
methods that assume a normal distribution of prediction or reconstruction
error for the identification of anomalies. Results in the first part of
the experiment indicate that DQR-AD obtained relatively better precision
than all other methods which demonstrates the capability of the method
in detecting a higher number of anomalous points with low false positive
rates. Also, the results show that DQR-AD is approximately 2 – 3
times better than the DeepAnT which performs better than all the remaining
methods on all domains in the NAB dataset. In the second part of the
experiment, sMAP dataset is used with 4-dimensional features to demonstrate
the method on multivariate time-series data. Experimental result
shows DQR-AD have 10% better performance than AE on three datasets
(SMAP1, SMAP3, and SMAP5) and equal performance on the remaining
two datasets. In chapter 5, two levels of experiments were conducted
basis of false-positive rate and concept drift adaptation. In the first level
of the experiment, the result shows that online DQR-AD is 18% better
than both DQR-AD and VAE-LSTM on five NAB datasets. Similarly, results
in the second level of the experiment show that the online DQR-AD
method has better performance than five counterpart methods with a relatively
10% margin on six out of the seven NAB datasets. This result
demonstrates how concept drift adaptation strategies adopted in the proposed
online DQR-AD improve the performance of anomaly detection in
time series.Petroleum Technology Development Fund (PTDF
Bayesian Ensemble of Regression Trees for Multinomial Probit and Quantile Regression
This dissertation proposes multinomial probit Bayesian additive regression trees (MPBART), ordered multiclass Bayesian additive classification trees (O-MBACT) and Bayesian quantile additive regression trees (BayesQArt) as extensions of BART - Bayesian additive regression trees for tackling multinomial choice, multiclass classification, ordinal regression and quantile regression problems. The proposed models exhibit very good predictive performances. In particular, ranking among the top performing procedures when non-linear relationships exist between the response and the predictors. The proposed procedures can readily be applied on data sets with the number of predictors larger than the number of observations.
MPBART is sufficiently flexible to allow inclusion of predictors that describe the observed units as well as the available choice alternatives and it can also be used as a general multiclass classification procedure. Through two simulation studies and four real data examples, we show that MPBART exhibits very good out-ofsample predictive performance in comparison to other discrete choice and multiclass classification methods. To implement MPBART, the R package mpbart is freely available from CRAN repositories.
When ordered gradation is exhibited by a multinomial response, ordinal regression is an appealing framework. Ensemble of trees models, while widely used for binary classification, multiclass classification and continuous response regression, have not been extensively applied to solve ordinal regression problems. This work fills this void with Bayesian sum of regression trees. The predictive performance of our ordered Bayesian ensemble of trees model is illustrated through simulation studies and real data applications.
Ensemble of regression trees have become popular statistical tools for the estimation of conditional mean given a set of predictors. However, quantile regression trees and their ensembles have not yet garnered much attention despite the increasing popularity of the linear quantile regression model. This work proposes a Bayesian quantile additive regression trees model that shows very good predictive performance illustrated using simulation studies and real data applications. Further extension to tackle binary classification problems is also considered
Recommended from our members
Digital phenotyping through multimodal, unobtrusive sensing
The growing adoption of multimodal wearable and mobile devices, such as smartphones and wrist-worn watches has generated an increase in the collection of physiological and behavioural data at scale. This digital phenotyping data enables researchers to make inferences regarding users’ physical and mental health at scale, for the first time. However, translating this data into actionable insights requires computational approaches that turn unlabelled, multimodal time-series sensor data into validated measures that can be interpreted at scale.
This thesis describes the derivation of novel computational methods that leverage digital phenotyping data from wearable devices in large-scale populations to infer physical behaviours. These methods combine insights from signal processing, data mining and machine learning alongside domain knowledge in physical activity and sleep epidemiology. First, the inference of sleeping windows in free-living conditions through a heart rate sensing approach is explored. This algorithm is particularly valuable in the absence of ground truth or sleep diaries given its simplicity, adaptability and capacity for personalization. I then explore multistage sleep classification through combined movement and cardiac wearable sensing and machine learning. Further, I demonstrate that postural changes detected through wrist accelerometers can inform habitual behaviours and are valuable complements to traditional, intensity-based physical activity metrics. I then leverage the concomitant responses of heart rate to physical activity that can be captured through multimodal wearable sensors through a self-supervised training task. The resulting embeddings from this task are shown to be useful for the downstream classification of demographic factors, BMI, energy expenditure and cardiorespiratory fitness. Finally, I describe a deep learning model for the adaptive inference of cardiorespiratory fitness (VO2max) using wearable data in free living conditions. I demonstrate the robustness of the model in a large UK population and show the models’ adaptability by evaluating its performance in a subset of the population with repeated measures ~6 years after the original recordings.
Together, this work increases the potential of multimodal wearable and mobile sensors for physical activity and behavioural inferences in population studies. In particular, this thesis showcases the potential of using wearable devices to make valuable physical activity, sleep and fitness inferences in large cohort studies. Given the nature of the data collected and the fact that most of this data is currently generated by commercial providers and not research institutes, laying the foundations for responsible data governance and ethical use of these technologies will be critical to building trust and enabling the development of the field of digital phenotyping.I was funded by GlaxoSmithKline and the Engineering and Physical Sciences Research Council. I was also supported by the Alan Turing Institute through their Enrichment Scheme
Random projection ensemble classification
We introduce a very general method for high-dimensional classification, based on careful combination of the results of applying an arbitrary base classifier to random projections of the feature vectors into a lower-dimensional space. In one special case that we study in detail, the random projections are divided into disjoint groups, and within each group we select the projection yielding the smallest estimate of the test error. Our random projection ensemble classifier then aggregates the results of applying the base classifier on the selected projections, with a data-driven voting threshold to determine the final assignment. Our theoretical results elucidate the effect on performance of increasing the number of projections. Moreover, under a boundary condition implied by the sufficient dimension reduction assumption, we show that the test excess risk of the random projection ensemble classifier can be controlled by terms that do not depend on the original data dimension and a term that becomes negligible as the number of projections increases. The classifier is also compared empirically with several other popular high-dimensional classifiers via an extensive simulation study, which reveals its excellent finite-sample performance.Both authors are supported by an Engineering and Physical Sciences Research Council Fellowship EP/J017213/1; the second author is also supported by a Philip Leverhulme prize
Data analysis and machine learning approaches for time series pre- and post- processing pipelines
157 p.En el ámbito industrial, las series temporales suelen generarse de forma continua mediante sensores quecaptan y supervisan constantemente el funcionamiento de las máquinas en tiempo real. Por ello, esimportante que los algoritmos de limpieza admitan un funcionamiento casi en tiempo real. Además, amedida que los datos evolución, la estrategia de limpieza debe cambiar de forma adaptativa eincremental, para evitar tener que empezar el proceso de limpieza desde cero cada vez.El objetivo de esta tesis es comprobar la posibilidad de aplicar flujos de aprendizaje automática a lasetapas de preprocesamiento de datos. Para ello, este trabajo propone métodos capaces de seleccionarestrategias óptimas de preprocesamiento que se entrenan utilizando los datos históricos disponibles,minimizando las funciones de perdida empíricas.En concreto, esta tesis estudia los procesos de compresión de series temporales, unión de variables,imputación de observaciones y generación de modelos subrogados. En cada uno de ellos se persigue laselección y combinación óptima de múltiples estrategias. Este enfoque se define en función de lascaracterísticas de los datos y de las propiedades y limitaciones del sistema definidas por el usuario
- …