2,370 research outputs found
Scalable Outlier Detection Methods for Functional Data
Mención Internacional en el tÃtulo de doctorRecent technological advances have led to an exponential growth in the volume of data
generated. The quest to make sense of these data, some of which are usually complex,
has led to recent interest in development of statistical methods for analysing data with
complex structures. One such field of interest is functional data analysis (FDA), which
deals with the analysis of data that can be considered as functions, curves, or surfaces
observed over a domain set. Outlier detection is a challenging but important part of
the exploratory analysis process in FDA because functional observations can exhibit
outlyingness in various ways compared to the bulk of the data. This thesis addresses
the problem of detecting and classifying outliers in functional data with three main
contributions.
First, the fdaoutlier R package is presented in Chapter 2. The package contains
implementations of some of the state-of-the-art functional outlier detection methods
in the literature. Some of the methods implemented include directional outlyingness,
magnitude-shape plot, sequential transformations, total variation depth, and modified
shape similarity index. Detailed illustrations of the functions of the package are provided,
using various simulated and real functional datasets curated from the functional
outlier detection literature. Overviews of the functional outlier detection methods implemented
in the package are also presented in Chapter 2. This chapter therefore, serves
as a review of some of the current literature in outlier detection for functional data.
Next, two new methods, named ‘Semifast- MUOD’ and ‘Fast-MUOD’, are presented
in Chapter 3. These methods work by computing for each curve three indices (magnitude,
amplitude and shape index) that measure the outlyingness of that curve in terms
of its magnitude, amplitude and shape. ‘Semifast- MUOD’ computes these indices with
respect to (w.r.t.) a random sample of the dataset, while ‘Fast-MUOD’ computes these
indices w.r.t. to the point-wise or L1 median. The classical boxplot is then used as a
cutoff on the three indices to identify curves that are outliers of different types. A byproduct
of the methods is an unsupervised classification of the outliers into different
types, without the need for visualisation. Performance evaluation of the methods, using
various real and simulated datasets, shows that Fast-MUOD is the better of the two new proposed methods for outlier detection, in addition to being very scalable. Comparisons
with latest functional outlier detection methods in the literature also show
superior or comparable outlier detection performance.
In Chapter 4, some theoretical properties of the Fast-MUOD indices are presented.
These include some definitions of the indices, as well as convergence proofs of the sample
approximations. Some properties of the indices under simple transformations are
also presented in this chapter. Finally, three techniques are presented in Chapter 5 for
extending the Fast-MUOD indices to outlier detection in multivariate functional data
observed on the same domain. These techniques include the use of random projections
and identifying outliers on the marginal components of the multivariate functional data.
The use of random projections showed the best result in performance evaluations with
various real and simulated datasets.
Chapter 6 contains some concluding remarks and possible future research work.This work has been supported by IMDEA Networks InstitutePrograma de Doctorado en IngenierÃa Matemática por la Universidad Carlos III de MadridPresidente: Francisco Javier Prieto Fernández.- Secretario: Alba MarÃa Franco Pereira.- Vocal: Fabian Scheip
Recent Advances in Anomaly Detection Methods Applied to Aviation
International audienceAnomaly detection is an active area of research with numerous methods and applications. This survey reviews the state-of-the-art of data-driven anomaly detection techniques and their application to the aviation domain. After a brief introduction to the main traditional data-driven methods for anomaly detection, we review the recent advances in the area of neural networks, deep learning and temporal-logic based learning. In particular, we cover unsupervised techniques applicable to time series data because of their relevance to the aviation domain, where the lack of labeled data is the most usual case, and the nature of flight trajectories and sensor data is sequential, or temporal. The advantages and disadvantages of each method are presented in terms of computational efficiency and detection efficacy. The second part of the survey explores the application of anomaly detection techniques to aviation and their contributions to the improvement of the safety and performance of flight operations and aviation systems. As far as we know, some of the presented methods have not yet found an application in the aviation domain. We review applications ranging from the identification of significant operational events in air traffic operations to the prediction of potential aviation system failures for predictive maintenance
Assessing the effects of Multivariate Functional outlier identification and sample robustification on identifying critical PM2.5 air pollution episodes in MedellÃn, Colombia
La identificación de datos atÃpicos de contaminación ambiental, tanto como un problema de identificación de atÃpicos como bajo los problemas de clasificación es una aplicación usual del análisis de datos funcionales multivariados. El artÃculo da cuenta de los efectos de la robustificación de muestras funcionales multivariadas sobre la identificación de episodios crÃticos de polución en MedellÃn, Colombia. Para hacerlo, compara 18 métodos de identificación de atÃpicos basados en profundidades y resalta las mejores opciones en términos de precisión a través de simulación. Después, aplica los dos métodos con mejor desempeño a la robustificación de una base de datos real de contaminación del aire en el área metropolitana de MEdellÃn, Colombia y compara los efectos de robustificar las muestras sobre la precisión de la clasificación supervisada. Los resultados muestran que 10 de los 20 métodos revisados se desempeñan mejor en al menos un tipo de atÃpicos. Sin embargo, no se evidencian resultados positivos de la robustificación en la base de datos real.Identification of critical episodes of environmental pollution, both as a outlier identification problem and as a classification problem, is a usual application of multivariate functional data analysis. This article addresses the effects of robustifying multivariate functional samples on the identification of critical pollution episodes in MedellÃn, Colombia. To do so, it compares 18 depth-based outlier identification methods and highlights the best options in terms of precision through simulation. It then applies the two methods with the best performance to robustify a real dataset of air pollution (PM2.5 concentration) in the Metropolitan Area of MedellÃn, Colombia and compares the effects of robustifying the samples on the accuracy of supervised classification through the multivariate functional DD-classifier. Our results show that 10 out of 20 methods revised perform better in at least one kind outliers. Nevertheless, no clear positive effects of robustification were identified with the real dataset
On the Nature and Types of Anomalies: A Review
Anomalies are occurrences in a dataset that are in some way unusual and do
not fit the general patterns. The concept of the anomaly is generally
ill-defined and perceived as vague and domain-dependent. Moreover, despite some
250 years of publications on the topic, no comprehensive and concrete overviews
of the different types of anomalies have hitherto been published. By means of
an extensive literature review this study therefore offers the first
theoretically principled and domain-independent typology of data anomalies, and
presents a full overview of anomaly types and subtypes. To concretely define
the concept of the anomaly and its different manifestations, the typology
employs five dimensions: data type, cardinality of relationship, anomaly level,
data structure and data distribution. These fundamental and data-centric
dimensions naturally yield 3 broad groups, 9 basic types and 61 subtypes of
anomalies. The typology facilitates the evaluation of the functional
capabilities of anomaly detection algorithms, contributes to explainable data
science, and provides insights into relevant topics such as local versus global
anomalies.Comment: 38 pages (30 pages content), 10 figures, 3 tables. Preprint; review
comments will be appreciated. Improvements in version 2: Explicit mention of
fifth anomaly dimension; Added section on explainable anomaly detection;
Added section on variations on the anomaly concept; Various minor additions
and improvement
- …