396 research outputs found

    Statistical learning methods for functional data with applications to prediction, classification and outlier detection

    Get PDF
    In the era of big data, Functional Data Analysis has become increasingly important insofar as it constitutes a powerful tool to tackle inference problems in statistics. In particular in this thesis we have proposed several methods aimed to solve problems of prediction of time series, classification and outlier detection from a functional approach. The thesis is organized as follows: In Chapter 1 we introduce the concept of functional data and state the overview of the thesis. In Chapter 2 of this work we present the theoretical framework used to we develop the proposed methodologies. In Chapters 3 and 4 two new ordering mappings for functional data are proposed. The first is a Kernel depth measure, which satisfies the corresponding theoretical properties, while the second is an entropy measure. In both cases we propose a parametric and non-parametric estimation method that allow us to define an order in the data set at hand. A natural application of these measures is the identification of atypical observations (functions). In Chapter 5 we study the Functional Autoregressive Hilbertian model. We also propose a new family of basis functions for the estimation and prediction of the aforementioned model, which belong to a reproducing kernel Hilbert space. The properties of continuity obtained in this space allow us to construct confidence bands for the corresponding predictions in a detracted time horizon. In order to boost different classification methods, in Chapter 6 we propose a divergence measure for functional data. This metric allows us to determine in which part of the domain two classes of functional present divergent behavior. This methodology is framed in the field of domain selection, and it is aimed to solve classification problems by means of the elimination of redundant information. Finally in Chapter 7 the general conclusions of this work and the future research lines are presented.Financial support received from the Spanish Ministry of Economy and Competitiveness ECO2015-66593-P and the UC3M PIF scholarship for doctoral studies.Programa de Doctorado en Economía de la Empresa y Métodos Cuantitativos por la Universidad Carlos III de MadridPresidente: Santiago Velilla Cerdán; Secretario: Kalliopi Mylona; Vocal: Luis Antonio Belanche Muño

    Entropy Measures for Stochastic Processes with Applications in Functional Anomaly Detection

    Get PDF
    We propose a definition of entropy for stochastic processes. We provide a reproducing kernel Hilbert space model to estimate entropy from a random sample of realizations of a stochastic process, namely functional data, and introduce two approaches to estimate minimum entropy sets. These sets are relevant to detect anomalous or outlier functional data. A numerical experiment illustrates the performance of the proposed method; in addition, we conduct an analysis of mortality rate curves as an interesting application in a real-data context to explore functional anomaly detection.The first and third authors acknowledge financial support from the Spanish Ministry of Economy and Competitiveness ECO2015-66593-P. The Second author acknowledges CONICET Argentina Project 20020150200110BA. The fourth author acknowledges the Spanish Ministry of Economy and Competitiveness Projects GROMA(MTM2015-63710-P), PPI (RTC-2015-3580-7) and UNIKO(RTC-2015-3521-7) and the “methaodos.org” research group at URJC

    Exploring Non-Linear Dependencies in Atmospheric Data with Mutual Information

    Get PDF
    Relations between atmospheric variables are often non-linear, which complicates research efforts to explore and understand multivariable datasets. We describe a mutual information approach to screen for the most significant associations in this setting. This method robustly detects linear and non-linear dependencies after minor data quality checking. Confounding factors and seasonal cycles can be taken into account without predefined models. We present two case studies of this method. The first one illustrates deseasonalization of a simple time series, with results identical to the classical method. The second one explores associations in a larger dataset of many variables, some of them lognormal (trace gas concentrations) or circular (wind direction). The examples use our Python package ‘ennemi’

    Information-based Preprocessing of PLC Data for Automatic Behavior Modeling

    Full text link
    Cyber-physical systems (CPS) offer immense optimization potential for manufacturing processes through the availability of multivariate time series data of actors and sensors. Based on automated analysis software, the deployment of adaptive and responsive measures is possible for time series data. Due to the complex and dynamic nature of modern manufacturing, analysis and modeling often cannot be entirely automated. Even machine- or deep learning approaches often depend on a priori expert knowledge and labelling. In this paper, an information-based data preprocessing approach is proposed. By applying statistical methods including variance and correlation analysis, an approximation of the sampling rate in event-based systems and the utilization of spectral analysis, knowledge about the underlying manufacturing processes can be gained prior to modeling. The paper presents, how statistical analysis enables the pruning of a dataset's least important features and how the sampling rate approximation approach sets the base for further data analysis and modeling. The data's underlying periodicity, originating from the cyclic nature of an automated manufacturing process, will be detected by utilizing the fast Fourier transform. This information-based preprocessing method will then be validated for process time series data of cyber-physical systems' programmable logic controllers (PLC)

    Realtime market microstructure analysis: online Transaction Cost Analysis

    Full text link
    Motivated by the practical challenge in monitoring the performance of a large number of algorithmic trading orders, this paper provides a methodology that leads to automatic discovery of the causes that lie behind a poor trading performance. It also gives theoretical foundations to a generic framework for real-time trading analysis. Academic literature provides different ways to formalize these algorithms and show how optimal they can be from a mean-variance, a stochastic control, an impulse control or a statistical learning viewpoint. This paper is agnostic about the way the algorithm has been built and provides a theoretical formalism to identify in real-time the market conditions that influenced its efficiency or inefficiency. For a given set of characteristics describing the market context, selected by a practitioner, we first show how a set of additional derived explanatory factors, called anomaly detectors, can be created for each market order. We then will present an online methodology to quantify how this extended set of factors, at any given time, predicts which of the orders are underperforming while calculating the predictive power of this explanatory factor set. Armed with this information, which we call influence analysis, we intend to empower the order monitoring user to take appropriate action on any affected orders by re-calibrating the trading algorithms working the order through new parameters, pausing their execution or taking over more direct trading control. Also we intend that use of this method in the post trade analysis of algorithms can be taken advantage of to automatically adjust their trading action.Comment: 33 pages, 12 figure

    Automation of cleaning and ensembles for outliers detection in questionnaire data

    Get PDF
    This article is focused on the automatic detection of the corrupted or inappropriate responses in questionnaire data using unsupervised outliers detection. The questionnaire surveys are often used in psychology research to collect self-report data and their preprocessing takes a lot of manual effort. Unlike with numerical data where the distance-based outliers prevail, the records in questionnaires have to be assessed from various perspectives that do not relate so much. We identify the most frequent types of errors in questionnaires. For each of them, we suggest different outliers detection methods ranking the records with the usage of normalized scores. Considering the similarity between pairs of outlier scores (some are highly uncorrelated), we propose an ensemble based on the union of outliers detected by different methods. Our outlier detection framework consists of some well-known algorithms but we also propose novel approaches addressing the typical issues of questionnaires. The selected methods are based on distance, entropy, and probability. The experimental section describes the process of assembling the methods and selecting their parameters for the final model detecting significant outliers in the real-world HBSC dataset.Web of Science206art. no. 11780

    Quality data assessment and improvement in pre-processing pipeline to minimize impact of spurious signals in functional magnetic imaging (fMRI)

    Get PDF
    In the recent years, the field of quality data assessment and signal denoising in functional magnetic resonance imaging (fMRI) is rapidly evolving and the identification and reduction of spurious signal with pre-processing pipeline is one of the most discussed topic. In particular, subject motion or physiological signals, such as respiratory or/and cardiac pulsatility, were showed to introduce false-positive activations in subsequent statistical analyses. Different measures for the evaluation of the impact of motion related artefacts, such as frame-wise displacement and root mean square of movement parameters, and the reduction of these artefacts with different approaches, such as linear regression of nuisance signals and scrubbing or censoring procedure, were introduced. However, we identify two main drawbacks: i) the different measures used for the evaluation of motion artefacts were based on user-dependent thresholds, and ii) each study described and applied their own pre-processing pipeline. Few studies analysed the effect of these different pipelines on subsequent analyses methods in task-based fMRI.The first aim of the study is to obtain a tool for motion fMRI data assessment, based on auto-calibrated procedures, to detect outlier subjects and outliers volumes, targeted on each investigated sample to ensure homogeneity of data for motion. The second aim is to compare the impact of different pre-processing pipelines on task-based fMRI using GLM based on recent advances in resting state fMRI preprocessing pipelines. Different output measures based on signal variability and task strength were used for the assessment

    On nonparametric estimation of a mixing density via the predictive recursion algorithm

    Full text link
    Nonparametric estimation of a mixing density based on observations from the corresponding mixture is a challenging statistical problem. This paper surveys the literature on a fast, recursive estimator based on the predictive recursion algorithm. After introducing the algorithm and giving a few examples, I summarize the available asymptotic convergence theory, describe an important semiparametric extension, and highlight two interesting applications. I conclude with a discussion of several recent developments in this area and some open problems.Comment: 22 pages, 5 figures. Comments welcome at https://www.researchers.one/article/2018-12-
    corecore