Search CORE

104 research outputs found

When fish are not poisson :modelling the migration of Atlantic salmon (Salmo salar) and sea trout (Salmo trutta) at multiple time scales

Author: Van Der Waal Zelda
Publication venue: Newcastle University
Publication date: 01/01/2014
Field of study

PhD ThesisMigratory species undertake prolonged seasonal journeys; monitoring these movements is challenging but can sometimes be achieved by observations that taken locally and, ideally, using remote methods. Amongst the best known examples of migrating fish in Europe, are Atlantic salmon (Salmo salar) and sea trout (Salmo trutta) that migrate between river and seawater. Characteristics of habitat suitability, feeding opportunities, predation, as well as salmonid sensitivity and needs, vary throughout successive stages of their anadromous life cycle. Since the marine stage is the longest but is also challenging to monitor, in-river fish counters are of increasing importance in understanding salmonid patterns in abundance. The original contribution of this thesis lies in the use of modelling techniques to investigate salmonid migration, based on temporal observations produced by an electronic fish counter triggered by salmonid passage, as they return to spawn in the River Tyne. Small scale observation revealed seasonal differences; aggregation behaviour intensified during the middle of the migration season, and explanatory covariates varied in both their effect size and relevance to salmonid abundance. At the population scale, migration was highly driven by annual periodicity, abundance increased with river temperature and there was an NAO effect with a four year lag, underlining the importance of marine conditions to parent population and/or post-smolts. Differences between distinct populations of S. salar and S. trutta appeared related to a species-specific annual periodicity and oceanic conditions as salmonids return (more so for S. salar). State-space models suggested a complex demographic structure for the two species. There was a species identification learning curve that affected the data by 2007. A classification algorithm determined that observations are more likely to be S. salar for larger signal amplitude, within a higher river flow and earlier in the year; characteristics were too similar between the two species to reach a useful classification success rate (69%). The project overall suggests specificities relating to both species and age-class that cannot be addressed in depth with the collected data; emerging limitations and recommendations are discussed.Environment Agenc

Newcastle University eTheses

A Functional Data Perspective and Baseline On Multi-Layer Out-of-Distribution Detection

Author: Colombo Pierre
Dadalto Eduardo
Noiry Nathan
Piantanida Pablo
Staerman Guillaume
Publication venue
Publication date: 06/06/2023
Field of study

A key feature of out-of-distribution (OOD) detection is to exploit a trained neural network by extracting statistical patterns and relationships through the multi-layer classifier to detect shifts in the expected input data distribution. Despite achieving solid results, several state-of-the-art methods rely on the penultimate or last layer outputs only, leaving behind valuable information for OOD detection. Methods that explore the multiple layers either require a special architecture or a supervised objective to do so. This work adopts an original approach based on a functional view of the network that exploits the sample's trajectories through the various layers and their statistical dependencies. It goes beyond multivariate features aggregation and introduces a baseline rooted in functional anomaly detection. In this new framework, OOD detection translates into detecting samples whose trajectories differ from the typical behavior characterized by the training set. We validate our method and empirically demonstrate its effectiveness in OOD detection compared to strong state-of-the-art baselines on computer vision benchmarks

arXiv.org e-Print Archive

Wrapper algorithms and their performance assessment on high-dimensional molecular data

Author: Bernau Christoph Michael
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 08/08/2014
Field of study

Prediction problems on high-dimensional molecular data, e.g. the classification of microar- ray samples into normal and cancer tissues, are complex and ill-posed since the number of variables usually exceeds the number of observations by orders of magnitude. Recent research in the area has propagated a variety of new statistical models in order to handle these new biological datasets. In practice, however, these models are always applied in combination with preprocessing and variable selection methods as well as model selection which is mostly performed by cross-validation. Varma and Simon (2006) have used the term ‘wrapper-algorithm’ for this integration of preprocessing and model selection into the construction of statistical models. Additionally, they have proposed the method of nested cross-validation (NCV) as a way of estimating their prediction error which has evolved to the gold-standard by now. In the first part, this thesis provides further theoretical and empirical justification for the usage of NCV in the context of wrapper-algorithms. Moreover, a computationally less intensive alternative to NCV is proposed which can be motivated in a decision theoretic framework. The new method can be interpreted as a smoothed variant of NCV and, in contrast to NCV, guarantees intuitive bounds for the estimation of the prediction error. The second part focuses on the ranking of wrapper algorithms. Cross-study-validation is proposed as an alternative concept to the repetition of separated within-study-validations if several similar prediction problems are available. The concept is demonstrated using six different wrapper algorithms for survival prediction on censored data on a selection of eight breast cancer datasets. Additionally, a parametric bootstrap approach for simulating realistic data from such related prediction problems is described and subsequently applied to illustrate the concept of cross-study-validation for the ranking of wrapper algorithms. Eventually, the last part approaches computational aspects of the analyses and simula- tions performed in the thesis. The preprocessing before the analysis as well as the evaluation of the prediction models requires the usage of large computing resources. Parallel comput- ing approaches are illustrated on cluster, cloud and high performance computing resources using the R programming language. Usage of heterogeneous hardware and processing of large datasets are covered as well as the implementation of the R-package survHD for the analysis and evaluation of high-dimensional wrapper algorithms for survival prediction from censored data.Prädiktionsprobleme für hochdimensionale genetische Daten, z.B. die Klassifikation von Proben in normales und Krebsgewebe, sind komplex und unterbestimmt, da die Anzahl der Variablen die Anzahl der Beobachtungen um ein Vielfaches übersteigt. Die Forschung hat auf diesem Gebiet in den letzten Jahren eine Vielzahl an neuen statistischen Meth- oden hervorgebracht. In der Praxis werden diese Algorithmen jedoch stets in Kombination mit Vorbearbeitung und Variablenselektion sowie Modellwahlverfahren angewandt, wobei letztere vorwiegend mit Hilfe von Kreuzvalidierung durchgeführt werden. Varma und Simon (2006) haben den Begriff ’Wrapper-Algorithmus’ für eine derartige Einbet- tung von Vorbearbeitung und Modellwahl in die Konstruktion einer statistischen Methode verwendet. Zudem haben sie die genestete Kreuzvalidierung (NCV) als eine Methode zur Sch ̈atzung ihrer Fehlerrate eingeführt, welche sich mittlerweile zum Goldstandard entwickelt hat. Im ersten Teil dieser Doktorarbeit, wird eine tiefergreifende theoretische Grundlage sowie eine empirische Rechtfertigung für die Anwendung von NCV bei solchen ’Wrapper-Algorithmen’ vorgestellt. Außerdem wird eine alternative, weniger computerintensive Methode vorgeschlagen, welche im Rahmen der Entscheidungstheorie motiviert wird. Diese neue Methode kann als eine gegl ̈attete Variante von NCV interpretiert wer- den und hält im Gegensatz zu NCV intuitive Grenzen bei der Fehlerratenschätzung ein. Der zweite Teil behandelt den Vergleich verschiedener ’Wrapper-Algorithmen’ bzw. das Sch ̈atzen ihrer Reihenfolge gem ̈aß eines bestimmten Gütekriteriums. Als eine Alterna- tive zur wiederholten Durchführung von Kreuzvalidierung auf einzelnen Datensätzen wird das Konzept der studienübergreifenden Validierung vorgeschlagen. Das Konzept wird anhand von sechs verschiedenen ’Wrapper-Algorithmen’ für die Vorhersage von Uberlebenszeiten bei acht Brustkrebsstudien dargestellt. Zusätzlich wird ein Bootstrapverfahren beschrieben, mit dessen Hilfe man mehrere realistische Datens ̈atze aus einer Menge von solchen verwandten Prädiktionsproblemen generieren kann. Der letzte Teil beleuchtet schließlich computationale Verfahren, die bei der Umsetzung der Analysen in dieser Dissertation eine tragende Rolle gespielt haben. Die Vorbearbeitungsschritte sowie die Evaluation der Prädiktionsmodelle erfordert die extensive Nutzung von Computerressourcen. Es werden Ansätze zum parallelen Rechnen auf Cluster-, Cloud- und Hochleistungsrechen- ressourcen unter der Verwendung der Programmiersprache R beschrieben. Die Benutzung von heterogenen Hardwarearchitekturen, die Verarbeitung von großen Datensätzen sowie die Entwicklung des R-Pakets survHD für die Analyse und Evaluierung von ’Wrapper- Algorithmen’ zur Uberlebenszeitenanalyse werden thematisiert

Multistep ahead time series prediction

Author: Bahrpeyma Fouad
Publication venue: Dublin City University. School of Computing
Publication date: 01/03/2021
Field of study

Time series analysis has been the subject of extensive interest in many fields ofstudy ranging from weather forecasting to economic predictions, over the past twocenturies. It has been fundamental to our understanding of previous patterns withindata and has also been used to make predictions in both the short and long termhorizons. When approaching such problems researchers would typically analyzethe given series for a number of distinct characteristics and select the most ap-propriate technique. However, the complexity of aligning a set of characteristicswith a method has increased in complexity with the advent of Machine Learningand the introduction of Multi-Step Ahead Prediction (MSAP). We examine themodel/strategy approaches which are currently applied to conduct multi-step aheadprediction in time series data and propose an alternative MSAP strategy known asMulti-Resolution Forecast Aggregation.Typically, when researchers propose an alternative strategy or method, they demon-strate it on a relatively small set of time series, thus the general breath of use isunknown. We propose a process that generates a diverse set of synthetic time se-ries, that will enable a robust examination of MRFA and other methods/strategies.This dataset in conjunction with a range of popular prediction methods and MSAPstrategies is then used to develop a meta learner that estimates the normalized meansquare error of the prediction approach for the given time serie

DCU Online Research Access Service

Recommended from our members

Online Anomaly Detection for Time Series. Towards Incorporating Feature Extraction, Model Uncertainty and Concept Drift Adaptation for Improving Anomaly Detection

Author: Tambuwal Ahmad I.
Publication venue: Department of Computer Science. Faculty of Engineering and Informatics
Publication date: 01/01/2021
Field of study

Time series anomaly detection receives increasing research interest given the growing number of data-rich application domains. Recent additions to anomaly detection methods in research literature include deep learning algorithms. The nature and performance of these algorithms in sequence analysis enable them to learn hierarchical discriminating features and time-series temporal nature. However, their performance is affected by the speed at which the time series arrives, the use of a fixed threshold, and the assumption of Gaussian distribution on the prediction error to identify anomalous values. An exact parametric distribution is often not directly relevant in many applications and it’s often difficult to select an appropriate threshold that will differentiate anomalies with noise. Thus, implementations need the Prediction Interval (PI) that quantifies the level of uncertainty associated with the Deep Neural Network (DNN) point forecasts, which helps in making a better-informed decision and mitigates against false anomaly alerts. To achieve this, a new anomaly detection method is proposed that computes the uncertainty in estimates using quantile regression and used the quantile interval to identify anomalies. Similarly, to handle the speed at which the data arrives, an online anomaly detection method is proposed where a model is trained incrementally to adapt to the concept drift that improves prediction. This is implemented using a window-based strategy, in which a time series is broken into sliding windows of sub-sequences as input to the model. To adapt to concept drift, the model is updated when changes occur in the new arrival instances. This is achieved by using anomaly likelihood which is computed using the Q-function to define the abnormal degree of the current data point based on the previous data points. Specifically, when concept drift occurs, the proposed method will mark the current data point as anomalous. However, when the abnormal behavior continues for a longer period of time, the abnormal degree of the current data point will be low compared to the previous data points using the likelihood. As such, the current data point is added to the previous data to retrain the model which will allow the model to learn the new characteristics of the data and hence adapt to the concept changes thereby redefining the abnormal behavior. The proposed method also incorporates feature extraction to capture structural patterns in the time series. This is especially significant for multivariate time-series data, for which there is a need to capture the complex temporal dependencies that may exist between the variables. In summary, this thesis contributes to the theory, design, and development of algorithms and models for the detection of anomalies in both static and evolving time series data. Several experiments were conducted, and the results obtained indicate the significance of this research on offline and online anomaly detection in both static and evolving time-series data. In chapter 3, the newly proposed method (Deep Quantile Regression Anomaly Detection Method) is evaluated and compared with six other prediction-based anomaly detection methods that assume a normal distribution of prediction or reconstruction error for the identification of anomalies. Results in the first part of the experiment indicate that DQR-AD obtained relatively better precision than all other methods which demonstrates the capability of the method in detecting a higher number of anomalous points with low false positive rates. Also, the results show that DQR-AD is approximately 2 – 3 times better than the DeepAnT which performs better than all the remaining methods on all domains in the NAB dataset. In the second part of the experiment, sMAP dataset is used with 4-dimensional features to demonstrate the method on multivariate time-series data. Experimental result shows DQR-AD have 10% better performance than AE on three datasets (SMAP1, SMAP3, and SMAP5) and equal performance on the remaining two datasets. In chapter 5, two levels of experiments were conducted basis of false-positive rate and concept drift adaptation. In the first level of the experiment, the result shows that online DQR-AD is 18% better than both DQR-AD and VAE-LSTM on five NAB datasets. Similarly, results in the second level of the experiment show that the online DQR-AD method has better performance than five counterpart methods with a relatively 10% margin on six out of the seven NAB datasets. This result demonstrates how concept drift adaptation strategies adopted in the proposed online DQR-AD improve the performance of anomaly detection in time series.Petroleum Technology Development Fund (PTDF

Bradford Scholars

Bayesian Ensemble of Regression Trees for Multinomial Probit and Quantile Regression

Author: Kindo Bereket P.
Publication venue: Scholar Commons
Publication date: 01/01/2016
Field of study

This dissertation proposes multinomial probit Bayesian additive regression trees (MPBART), ordered multiclass Bayesian additive classification trees (O-MBACT) and Bayesian quantile additive regression trees (BayesQArt) as extensions of BART - Bayesian additive regression trees for tackling multinomial choice, multiclass classification, ordinal regression and quantile regression problems. The proposed models exhibit very good predictive performances. In particular, ranking among the top performing procedures when non-linear relationships exist between the response and the predictors. The proposed procedures can readily be applied on data sets with the number of predictors larger than the number of observations. MPBART is sufficiently flexible to allow inclusion of predictors that describe the observed units as well as the available choice alternatives and it can also be used as a general multiclass classification procedure. Through two simulation studies and four real data examples, we show that MPBART exhibits very good out-ofsample predictive performance in comparison to other discrete choice and multiclass classification methods. To implement MPBART, the R package mpbart is freely available from CRAN repositories. When ordered gradation is exhibited by a multinomial response, ordinal regression is an appealing framework. Ensemble of trees models, while widely used for binary classification, multiclass classification and continuous response regression, have not been extensively applied to solve ordinal regression problems. This work fills this void with Bayesian sum of regression trees. The predictive performance of our ordered Bayesian ensemble of trees model is illustrated through simulation studies and real data applications. Ensemble of regression trees have become popular statistical tools for the estimation of conditional mean given a set of predictors. However, quantile regression trees and their ensembles have not yet garnered much attention despite the increasing popularity of the linear quantile regression model. This work proposes a Bayesian quantile additive regression trees model that shows very good predictive performance illustrated using simulation studies and real data applications. Further extension to tackle binary classification problems is also considered

Scholar Commons - Institutional Repository of the University of South Carolina

Recommended from our members

Digital phenotyping through multimodal, unobtrusive sensing

Author: Perez Pozuelo Ignacio
Publication venue: University of Cambridge
Publication date: 28/08/2020
Field of study

The growing adoption of multimodal wearable and mobile devices, such as smartphones and wrist-worn watches has generated an increase in the collection of physiological and behavioural data at scale. This digital phenotyping data enables researchers to make inferences regarding users’ physical and mental health at scale, for the first time. However, translating this data into actionable insights requires computational approaches that turn unlabelled, multimodal time-series sensor data into validated measures that can be interpreted at scale. This thesis describes the derivation of novel computational methods that leverage digital phenotyping data from wearable devices in large-scale populations to infer physical behaviours. These methods combine insights from signal processing, data mining and machine learning alongside domain knowledge in physical activity and sleep epidemiology. First, the inference of sleeping windows in free-living conditions through a heart rate sensing approach is explored. This algorithm is particularly valuable in the absence of ground truth or sleep diaries given its simplicity, adaptability and capacity for personalization. I then explore multistage sleep classification through combined movement and cardiac wearable sensing and machine learning. Further, I demonstrate that postural changes detected through wrist accelerometers can inform habitual behaviours and are valuable complements to traditional, intensity-based physical activity metrics. I then leverage the concomitant responses of heart rate to physical activity that can be captured through multimodal wearable sensors through a self-supervised training task. The resulting embeddings from this task are shown to be useful for the downstream classification of demographic factors, BMI, energy expenditure and cardiorespiratory fitness. Finally, I describe a deep learning model for the adaptive inference of cardiorespiratory fitness (VO2max) using wearable data in free living conditions. I demonstrate the robustness of the model in a large UK population and show the models’ adaptability by evaluating its performance in a subset of the population with repeated measures ~6 years after the original recordings. Together, this work increases the potential of multimodal wearable and mobile sensors for physical activity and behavioural inferences in population studies. In particular, this thesis showcases the potential of using wearable devices to make valuable physical activity, sleep and fitness inferences in large cohort studies. Given the nature of the data collected and the fact that most of this data is currently generated by commercial providers and not research institutes, laying the foundations for responsible data governance and ethical use of these technologies will be critical to building trust and enabling the development of the field of digital phenotyping.I was funded by GlaxoSmithKline and the Engineering and Physical Sciences Research Council. I was also supported by the Alan Turing Institute through their Enrichment Scheme

Apollo (Cambridge)

Random projection ensemble classification

Author: Casarin Roberto
Frattarolo Lorenzo
Rossini Luca
Publication venue: Journal of the Royal Statistical Society. Series B: Statistical Methodology
Publication date: 01/01/2017
Field of study

We introduce a very general method for high-dimensional classification, based on careful combination of the results of applying an arbitrary base classifier to random projections of the feature vectors into a lower-dimensional space. In one special case that we study in detail, the random projections are divided into disjoint groups, and within each group we select the projection yielding the smallest estimate of the test error. Our random projection ensemble classifier then aggregates the results of applying the base classifier on the selected projections, with a data-driven voting threshold to determine the final assignment. Our theoretical results elucidate the effect on performance of increasing the number of projections. Moreover, under a boundary condition implied by the sufficient dimension reduction assumption, we show that the test excess risk of the random projection ensemble classifier can be controlled by terms that do not depend on the original data dimension and a term that becomes negligible as the number of projections increases. The classifier is also compared empirically with several other popular high-dimensional classifiers via an extensive simulation study, which reveals its excellent finite-sample performance.Both authors are supported by an Engineering and Physical Sciences Research Council Fellowship EP/J017213/1; the second author is also supported by a Philip Leverhulme prize

arXiv.org e-Print Archive

Archivio Ricerca Ca'Foscari

Crossref

Edinburgh Research Explorer

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Apollo (Cambridge)

Data analysis and machine learning approaches for time series pre- and post- processing pipelines

Author: Gil Lerchundi Amaia
Publication venue
Publication date: 20/05/2022
Field of study

157 p.En el ámbito industrial, las series temporales suelen generarse de forma continua mediante sensores quecaptan y supervisan constantemente el funcionamiento de las máquinas en tiempo real. Por ello, esimportante que los algoritmos de limpieza admitan un funcionamiento casi en tiempo real. Además, amedida que los datos evolución, la estrategia de limpieza debe cambiar de forma adaptativa eincremental, para evitar tener que empezar el proceso de limpieza desde cero cada vez.El objetivo de esta tesis es comprobar la posibilidad de aplicar flujos de aprendizaje automática a lasetapas de preprocesamiento de datos. Para ello, este trabajo propone métodos capaces de seleccionarestrategias óptimas de preprocesamiento que se entrenan utilizando los datos históricos disponibles,minimizando las funciones de perdida empíricas.En concreto, esta tesis estudia los procesos de compresión de series temporales, unión de variables,imputación de observaciones y generación de modelos subrogados. En cada uno de ellos se persigue laselección y combinación óptima de múltiples estrategias. Este enfoque se define en función de lascaracterísticas de los datos y de las propiedades y limitaciones del sistema definidas por el usuario

Archivo Digital para la Docencia y la Investigación