104 research outputs found

    When fish are not poisson :modelling the migration of Atlantic salmon (Salmo salar) and sea trout (Salmo trutta) at multiple time scales

    Get PDF
    PhD ThesisMigratory species undertake prolonged seasonal journeys; monitoring these movements is challenging but can sometimes be achieved by observations that taken locally and, ideally, using remote methods. Amongst the best known examples of migrating fish in Europe, are Atlantic salmon (Salmo salar) and sea trout (Salmo trutta) that migrate between river and seawater. Characteristics of habitat suitability, feeding opportunities, predation, as well as salmonid sensitivity and needs, vary throughout successive stages of their anadromous life cycle. Since the marine stage is the longest but is also challenging to monitor, in-river fish counters are of increasing importance in understanding salmonid patterns in abundance. The original contribution of this thesis lies in the use of modelling techniques to investigate salmonid migration, based on temporal observations produced by an electronic fish counter triggered by salmonid passage, as they return to spawn in the River Tyne. Small scale observation revealed seasonal differences; aggregation behaviour intensified during the middle of the migration season, and explanatory covariates varied in both their effect size and relevance to salmonid abundance. At the population scale, migration was highly driven by annual periodicity, abundance increased with river temperature and there was an NAO effect with a four year lag, underlining the importance of marine conditions to parent population and/or post-smolts. Differences between distinct populations of S. salar and S. trutta appeared related to a species-specific annual periodicity and oceanic conditions as salmonids return (more so for S. salar). State-space models suggested a complex demographic structure for the two species. There was a species identification learning curve that affected the data by 2007. A classification algorithm determined that observations are more likely to be S. salar for larger signal amplitude, within a higher river flow and earlier in the year; characteristics were too similar between the two species to reach a useful classification success rate (69%). The project overall suggests specificities relating to both species and age-class that cannot be addressed in depth with the collected data; emerging limitations and recommendations are discussed.Environment Agenc

    A Functional Data Perspective and Baseline On Multi-Layer Out-of-Distribution Detection

    Full text link
    A key feature of out-of-distribution (OOD) detection is to exploit a trained neural network by extracting statistical patterns and relationships through the multi-layer classifier to detect shifts in the expected input data distribution. Despite achieving solid results, several state-of-the-art methods rely on the penultimate or last layer outputs only, leaving behind valuable information for OOD detection. Methods that explore the multiple layers either require a special architecture or a supervised objective to do so. This work adopts an original approach based on a functional view of the network that exploits the sample's trajectories through the various layers and their statistical dependencies. It goes beyond multivariate features aggregation and introduces a baseline rooted in functional anomaly detection. In this new framework, OOD detection translates into detecting samples whose trajectories differ from the typical behavior characterized by the training set. We validate our method and empirically demonstrate its effectiveness in OOD detection compared to strong state-of-the-art baselines on computer vision benchmarks

    Wrapper algorithms and their performance assessment on high-dimensional molecular data

    Get PDF
    Prediction problems on high-dimensional molecular data, e.g. the classification of microar- ray samples into normal and cancer tissues, are complex and ill-posed since the number of variables usually exceeds the number of observations by orders of magnitude. Recent research in the area has propagated a variety of new statistical models in order to handle these new biological datasets. In practice, however, these models are always applied in combination with preprocessing and variable selection methods as well as model selection which is mostly performed by cross-validation. Varma and Simon (2006) have used the term ‘wrapper-algorithm’ for this integration of preprocessing and model selection into the construction of statistical models. Additionally, they have proposed the method of nested cross-validation (NCV) as a way of estimating their prediction error which has evolved to the gold-standard by now. In the first part, this thesis provides further theoretical and empirical justification for the usage of NCV in the context of wrapper-algorithms. Moreover, a computationally less intensive alternative to NCV is proposed which can be motivated in a decision theoretic framework. The new method can be interpreted as a smoothed variant of NCV and, in contrast to NCV, guarantees intuitive bounds for the estimation of the prediction error. The second part focuses on the ranking of wrapper algorithms. Cross-study-validation is proposed as an alternative concept to the repetition of separated within-study-validations if several similar prediction problems are available. The concept is demonstrated using six different wrapper algorithms for survival prediction on censored data on a selection of eight breast cancer datasets. Additionally, a parametric bootstrap approach for simulating realistic data from such related prediction problems is described and subsequently applied to illustrate the concept of cross-study-validation for the ranking of wrapper algorithms. Eventually, the last part approaches computational aspects of the analyses and simula- tions performed in the thesis. The preprocessing before the analysis as well as the evaluation of the prediction models requires the usage of large computing resources. Parallel comput- ing approaches are illustrated on cluster, cloud and high performance computing resources using the R programming language. Usage of heterogeneous hardware and processing of large datasets are covered as well as the implementation of the R-package survHD for the analysis and evaluation of high-dimensional wrapper algorithms for survival prediction from censored data.Prädiktionsprobleme für hochdimensionale genetische Daten, z.B. die Klassifikation von Proben in normales und Krebsgewebe, sind komplex und unterbestimmt, da die Anzahl der Variablen die Anzahl der Beobachtungen um ein Vielfaches übersteigt. Die Forschung hat auf diesem Gebiet in den letzten Jahren eine Vielzahl an neuen statistischen Meth- oden hervorgebracht. In der Praxis werden diese Algorithmen jedoch stets in Kombination mit Vorbearbeitung und Variablenselektion sowie Modellwahlverfahren angewandt, wobei letztere vorwiegend mit Hilfe von Kreuzvalidierung durchgeführt werden. Varma und Simon (2006) haben den Begriff ’Wrapper-Algorithmus’ für eine derartige Einbet- tung von Vorbearbeitung und Modellwahl in die Konstruktion einer statistischen Methode verwendet. Zudem haben sie die genestete Kreuzvalidierung (NCV) als eine Methode zur Sch ̈atzung ihrer Fehlerrate eingeführt, welche sich mittlerweile zum Goldstandard entwickelt hat. Im ersten Teil dieser Doktorarbeit, wird eine tiefergreifende theoretische Grundlage sowie eine empirische Rechtfertigung für die Anwendung von NCV bei solchen ’Wrapper-Algorithmen’ vorgestellt. Außerdem wird eine alternative, weniger computerintensive Methode vorgeschlagen, welche im Rahmen der Entscheidungstheorie motiviert wird. Diese neue Methode kann als eine gegl ̈attete Variante von NCV interpretiert wer- den und hält im Gegensatz zu NCV intuitive Grenzen bei der Fehlerratenschätzung ein. Der zweite Teil behandelt den Vergleich verschiedener ’Wrapper-Algorithmen’ bzw. das Sch ̈atzen ihrer Reihenfolge gem ̈aß eines bestimmten Gütekriteriums. Als eine Alterna- tive zur wiederholten Durchführung von Kreuzvalidierung auf einzelnen Datensätzen wird das Konzept der studienübergreifenden Validierung vorgeschlagen. Das Konzept wird anhand von sechs verschiedenen ’Wrapper-Algorithmen’ für die Vorhersage von Uberlebenszeiten bei acht Brustkrebsstudien dargestellt. Zusätzlich wird ein Bootstrapverfahren beschrieben, mit dessen Hilfe man mehrere realistische Datens ̈atze aus einer Menge von solchen verwandten Prädiktionsproblemen generieren kann. Der letzte Teil beleuchtet schließlich computationale Verfahren, die bei der Umsetzung der Analysen in dieser Dissertation eine tragende Rolle gespielt haben. Die Vorbearbeitungsschritte sowie die Evaluation der Prädiktionsmodelle erfordert die extensive Nutzung von Computerressourcen. Es werden Ansätze zum parallelen Rechnen auf Cluster-, Cloud- und Hochleistungsrechen- ressourcen unter der Verwendung der Programmiersprache R beschrieben. Die Benutzung von heterogenen Hardwarearchitekturen, die Verarbeitung von großen Datensätzen sowie die Entwicklung des R-Pakets survHD für die Analyse und Evaluierung von ’Wrapper- Algorithmen’ zur Uberlebenszeitenanalyse werden thematisiert

    Multistep ahead time series prediction

    Get PDF
    Time series analysis has been the subject of extensive interest in many fields ofstudy ranging from weather forecasting to economic predictions, over the past twocenturies. It has been fundamental to our understanding of previous patterns withindata and has also been used to make predictions in both the short and long termhorizons. When approaching such problems researchers would typically analyzethe given series for a number of distinct characteristics and select the most ap-propriate technique. However, the complexity of aligning a set of characteristicswith a method has increased in complexity with the advent of Machine Learningand the introduction of Multi-Step Ahead Prediction (MSAP). We examine themodel/strategy approaches which are currently applied to conduct multi-step aheadprediction in time series data and propose an alternative MSAP strategy known asMulti-Resolution Forecast Aggregation.Typically, when researchers propose an alternative strategy or method, they demon-strate it on a relatively small set of time series, thus the general breath of use isunknown. We propose a process that generates a diverse set of synthetic time se-ries, that will enable a robust examination of MRFA and other methods/strategies.This dataset in conjunction with a range of popular prediction methods and MSAPstrategies is then used to develop a meta learner that estimates the normalized meansquare error of the prediction approach for the given time serie

    Bayesian Ensemble of Regression Trees for Multinomial Probit and Quantile Regression

    Get PDF
    This dissertation proposes multinomial probit Bayesian additive regression trees (MPBART), ordered multiclass Bayesian additive classification trees (O-MBACT) and Bayesian quantile additive regression trees (BayesQArt) as extensions of BART - Bayesian additive regression trees for tackling multinomial choice, multiclass classification, ordinal regression and quantile regression problems. The proposed models exhibit very good predictive performances. In particular, ranking among the top performing procedures when non-linear relationships exist between the response and the predictors. The proposed procedures can readily be applied on data sets with the number of predictors larger than the number of observations. MPBART is sufficiently flexible to allow inclusion of predictors that describe the observed units as well as the available choice alternatives and it can also be used as a general multiclass classification procedure. Through two simulation studies and four real data examples, we show that MPBART exhibits very good out-ofsample predictive performance in comparison to other discrete choice and multiclass classification methods. To implement MPBART, the R package mpbart is freely available from CRAN repositories. When ordered gradation is exhibited by a multinomial response, ordinal regression is an appealing framework. Ensemble of trees models, while widely used for binary classification, multiclass classification and continuous response regression, have not been extensively applied to solve ordinal regression problems. This work fills this void with Bayesian sum of regression trees. The predictive performance of our ordered Bayesian ensemble of trees model is illustrated through simulation studies and real data applications. Ensemble of regression trees have become popular statistical tools for the estimation of conditional mean given a set of predictors. However, quantile regression trees and their ensembles have not yet garnered much attention despite the increasing popularity of the linear quantile regression model. This work proposes a Bayesian quantile additive regression trees model that shows very good predictive performance illustrated using simulation studies and real data applications. Further extension to tackle binary classification problems is also considered

    Random projection ensemble classification

    Get PDF
    We introduce a very general method for high-dimensional classification, based on careful combination of the results of applying an arbitrary base classifier to random projections of the feature vectors into a lower-dimensional space. In one special case that we study in detail, the random projections are divided into disjoint groups, and within each group we select the projection yielding the smallest estimate of the test error. Our random projection ensemble classifier then aggregates the results of applying the base classifier on the selected projections, with a data-driven voting threshold to determine the final assignment. Our theoretical results elucidate the effect on performance of increasing the number of projections. Moreover, under a boundary condition implied by the sufficient dimension reduction assumption, we show that the test excess risk of the random projection ensemble classifier can be controlled by terms that do not depend on the original data dimension and a term that becomes negligible as the number of projections increases. The classifier is also compared empirically with several other popular high-dimensional classifiers via an extensive simulation study, which reveals its excellent finite-sample performance.Both authors are supported by an Engineering and Physical Sciences Research Council Fellowship EP/J017213/1; the second author is also supported by a Philip Leverhulme prize

    Data analysis and machine learning approaches for time series pre- and post- processing pipelines

    Get PDF
    157 p.En el ámbito industrial, las series temporales suelen generarse de forma continua mediante sensores quecaptan y supervisan constantemente el funcionamiento de las máquinas en tiempo real. Por ello, esimportante que los algoritmos de limpieza admitan un funcionamiento casi en tiempo real. Además, amedida que los datos evolución, la estrategia de limpieza debe cambiar de forma adaptativa eincremental, para evitar tener que empezar el proceso de limpieza desde cero cada vez.El objetivo de esta tesis es comprobar la posibilidad de aplicar flujos de aprendizaje automática a lasetapas de preprocesamiento de datos. Para ello, este trabajo propone métodos capaces de seleccionarestrategias óptimas de preprocesamiento que se entrenan utilizando los datos históricos disponibles,minimizando las funciones de perdida empíricas.En concreto, esta tesis estudia los procesos de compresión de series temporales, unión de variables,imputación de observaciones y generación de modelos subrogados. En cada uno de ellos se persigue laselección y combinación óptima de múltiples estrategias. Este enfoque se define en función de lascaracterísticas de los datos y de las propiedades y limitaciones del sistema definidas por el usuario
    corecore