396 research outputs found
Statistical learning methods for functional data with applications to prediction, classification and outlier detection
In the era of big data, Functional Data Analysis has become increasingly important insofar
as it constitutes a powerful tool to tackle inference problems in statistics. In particular
in this thesis we have proposed several methods aimed to solve problems of
prediction of time series, classification and outlier detection from a functional approach.
The thesis is organized as follows: In Chapter 1 we introduce the concept of functional
data and state the overview of the thesis. In Chapter 2 of this work we present
the theoretical framework used to we develop the proposed methodologies.
In Chapters 3 and 4 two new ordering mappings for functional data are proposed.
The first is a Kernel depth measure, which satisfies the corresponding theoretical properties,
while the second is an entropy measure. In both cases we propose a parametric
and non-parametric estimation method that allow us to define an order in the data set
at hand. A natural application of these measures is the identification of atypical observations
(functions).
In Chapter 5 we study the Functional Autoregressive Hilbertian model. We also
propose a new family of basis functions for the estimation and prediction of the aforementioned
model, which belong to a reproducing kernel Hilbert space. The properties
of continuity obtained in this space allow us to construct confidence bands for the corresponding
predictions in a detracted time horizon.
In order to boost different classification methods, in Chapter 6 we propose a divergence
measure for functional data. This metric allows us to determine in which part of
the domain two classes of functional present divergent behavior. This methodology is
framed in the field of domain selection, and it is aimed to solve classification problems
by means of the elimination of redundant information.
Finally in Chapter 7 the general conclusions of this work and the future research
lines are presented.Financial support received from the Spanish Ministry of Economy and Competitiveness ECO2015-66593-P and the UC3M PIF scholarship for doctoral studies.Programa de Doctorado en Economía de la Empresa y Métodos Cuantitativos por la Universidad Carlos III de MadridPresidente: Santiago Velilla Cerdán; Secretario: Kalliopi Mylona; Vocal: Luis Antonio Belanche Muño
Entropy Measures for Stochastic Processes with Applications in Functional Anomaly Detection
We propose a definition of entropy for stochastic processes. We provide a reproducing kernel Hilbert space model to estimate entropy from a random sample of realizations of a stochastic process, namely functional data, and introduce two approaches to estimate minimum entropy sets. These sets are relevant to detect anomalous or outlier functional data. A numerical experiment illustrates the performance of the proposed method; in addition, we conduct an analysis of mortality rate curves as an interesting application in a real-data context to explore functional anomaly detection.The first and third authors acknowledge financial support from the Spanish Ministry
of Economy and Competitiveness ECO2015-66593-P. The Second author acknowledges CONICET
Argentina Project 20020150200110BA. The fourth author acknowledges the Spanish Ministry of
Economy and Competitiveness Projects GROMA(MTM2015-63710-P), PPI (RTC-2015-3580-7) and
UNIKO(RTC-2015-3521-7) and the “methaodos.org” research group at URJC
Exploring Non-Linear Dependencies in Atmospheric Data with Mutual Information
Relations between atmospheric variables are often non-linear, which complicates research efforts to explore and understand multivariable datasets. We describe a mutual information approach to screen for the most significant associations in this setting. This method robustly detects linear and non-linear dependencies after minor data quality checking. Confounding factors and seasonal cycles can be taken into account without predefined models. We present two case studies of this method. The first one illustrates deseasonalization of a simple time series, with results identical to the classical method. The second one explores associations in a larger dataset of many variables, some of them lognormal (trace gas concentrations) or circular (wind direction). The examples use our Python package ‘ennemi’
Information-based Preprocessing of PLC Data for Automatic Behavior Modeling
Cyber-physical systems (CPS) offer immense optimization potential for
manufacturing processes through the availability of multivariate time series
data of actors and sensors. Based on automated analysis software, the
deployment of adaptive and responsive measures is possible for time series
data. Due to the complex and dynamic nature of modern manufacturing, analysis
and modeling often cannot be entirely automated. Even machine- or deep learning
approaches often depend on a priori expert knowledge and labelling. In this
paper, an information-based data preprocessing approach is proposed. By
applying statistical methods including variance and correlation analysis, an
approximation of the sampling rate in event-based systems and the utilization
of spectral analysis, knowledge about the underlying manufacturing processes
can be gained prior to modeling. The paper presents, how statistical analysis
enables the pruning of a dataset's least important features and how the
sampling rate approximation approach sets the base for further data analysis
and modeling. The data's underlying periodicity, originating from the cyclic
nature of an automated manufacturing process, will be detected by utilizing the
fast Fourier transform. This information-based preprocessing method will then
be validated for process time series data of cyber-physical systems'
programmable logic controllers (PLC)
Realtime market microstructure analysis: online Transaction Cost Analysis
Motivated by the practical challenge in monitoring the performance of a large
number of algorithmic trading orders, this paper provides a methodology that
leads to automatic discovery of the causes that lie behind a poor trading
performance. It also gives theoretical foundations to a generic framework for
real-time trading analysis. Academic literature provides different ways to
formalize these algorithms and show how optimal they can be from a
mean-variance, a stochastic control, an impulse control or a statistical
learning viewpoint. This paper is agnostic about the way the algorithm has been
built and provides a theoretical formalism to identify in real-time the market
conditions that influenced its efficiency or inefficiency. For a given set of
characteristics describing the market context, selected by a practitioner, we
first show how a set of additional derived explanatory factors, called anomaly
detectors, can be created for each market order. We then will present an online
methodology to quantify how this extended set of factors, at any given time,
predicts which of the orders are underperforming while calculating the
predictive power of this explanatory factor set. Armed with this information,
which we call influence analysis, we intend to empower the order monitoring
user to take appropriate action on any affected orders by re-calibrating the
trading algorithms working the order through new parameters, pausing their
execution or taking over more direct trading control. Also we intend that use
of this method in the post trade analysis of algorithms can be taken advantage
of to automatically adjust their trading action.Comment: 33 pages, 12 figure
Automation of cleaning and ensembles for outliers detection in questionnaire data
This article is focused on the automatic detection of the corrupted or inappropriate responses in questionnaire data using unsupervised outliers detection. The questionnaire surveys are often used in psychology research to collect self-report data and their preprocessing takes a lot of manual effort. Unlike with numerical data where the distance-based outliers prevail, the records in questionnaires have to be assessed from various perspectives that do not relate so much. We identify the most frequent types of errors in questionnaires. For each of them, we suggest different outliers detection methods ranking the records with the usage of normalized scores. Considering the similarity between pairs of outlier scores (some are highly uncorrelated), we propose an ensemble based on the union of outliers detected by different methods. Our outlier detection framework consists of some well-known algorithms but we also propose novel approaches addressing the typical issues of questionnaires. The selected methods are based on distance, entropy, and probability. The experimental section describes the process of assembling the methods and selecting their parameters for the final model detecting significant outliers in the real-world HBSC dataset.Web of Science206art. no. 11780
Quality data assessment and improvement in pre-processing pipeline to minimize impact of spurious signals in functional magnetic imaging (fMRI)
In the recent years, the field of quality data assessment and signal denoising in functional magnetic resonance imaging (fMRI) is rapidly evolving and the identification and reduction of spurious signal with pre-processing pipeline is one of the most discussed topic. In particular, subject motion or physiological signals, such as respiratory or/and cardiac pulsatility, were showed to introduce false-positive activations in subsequent statistical analyses.
Different measures for the evaluation of the impact of motion related artefacts, such as frame-wise displacement and root mean square of movement parameters, and the reduction of these artefacts with different approaches, such as linear regression of nuisance signals and scrubbing or censoring procedure, were introduced. However, we identify two main drawbacks: i) the different measures used for the evaluation of motion artefacts were based on user-dependent thresholds, and ii) each study described and applied their own pre-processing pipeline. Few studies analysed the effect of these different pipelines on subsequent analyses methods in task-based fMRI.The first aim of the study is to obtain a tool for motion fMRI data assessment, based on auto-calibrated procedures, to detect outlier subjects and outliers volumes, targeted on each investigated sample to ensure homogeneity of data for motion.
The second aim is to compare the impact of different pre-processing pipelines on task-based fMRI using GLM based on recent advances in resting state fMRI preprocessing pipelines. Different output measures based on signal variability and task strength were used for the assessment
On nonparametric estimation of a mixing density via the predictive recursion algorithm
Nonparametric estimation of a mixing density based on observations from the
corresponding mixture is a challenging statistical problem. This paper surveys
the literature on a fast, recursive estimator based on the predictive recursion
algorithm. After introducing the algorithm and giving a few examples, I
summarize the available asymptotic convergence theory, describe an important
semiparametric extension, and highlight two interesting applications. I
conclude with a discussion of several recent developments in this area and some
open problems.Comment: 22 pages, 5 figures. Comments welcome at
https://www.researchers.one/article/2018-12-
- …