144 research outputs found
Efficient and Highly Robust Hotelling T² Control Charts Using Reweighted Mininum Vector Variance
Hotelling T² control chart is an effective tool in statistical process control for multivariate environment. However, the performance of traditional Hotelling T² control chart using classical location and scatter estimators is usually marred by the masking and swamping effects. In order to alleviate the problem, robust estimators are recommended. The most popular and widely used robust estimator in the Hotelling T² control chart is the minimum covariance determinant (MCD). Recently, a new robust estimator known as minimum vector variance (MVV) was introduced. This estimator possesses high breakdown point, affine equivariance and is superior in terms of computational efficiency. Due to these nice properties, this study proposed to replace the classical estimators with the MVV location and scatter estimators in the construction of Hotelling T² control chart for individual observations in Phase II analysis. Nevertheless, some drawbacks such as inconsistency under normal distribution, biased for small sample size and low efficiency under high breakdown point were discovered. To improve the MVV estimators in terms of consistency and unbiasedness, the MVV scatter estimator was multiplied by consistency and correction factors respectively. To maintain the high breakdown point while having high statistical efficiency, a reweighted version of MVV estimator (RMVV) was proposed. Subsequently, the RMVV estimators were applied in the construction of Hotelling T² control chart. The new robust Hotelling T² chart produced positive impact in detecting outliers while simultaneously controlling false alarm rates. Apart from analysis of simulated data, analysis of real data also found that the new robust Hotelling T² chart was able to detect out of control observations better than the other charts investigated in this study. Based on the good performance on both simulated and real data analysis, the new robust Hotelling T² chart is a good alternative to the existing Hotelling T² charts
An Object-Oriented Framework for Robust Multivariate Analysis
Taking advantage of the S4 class system of the programming environment R, which facilitates the creation and maintenance of reusable and modular components, an object-oriented framework for robust multivariate analysis was developed. The framework resides in the packages robustbase and rrcov and includes an almost complete set of algorithms for computing robust multivariate location and scatter, various robust methods for principal component analysis as well as robust linear and quadratic discriminant analysis. The design of these methods follows common patterns which we call statistical design patterns in analogy to the design patterns widely used in software engineering. The application of the framework to data analysis as well as possible extensions by the development of new methods is demonstrated on examples which themselves are part of the package rrcov.
Méthodes statistiques de détection d’observations atypiques pour des données en grande dimension
La détection d’observations atypiques de manière non-supervisée est un enjeu crucial dans la pratique de la statistique. Dans le domaine de la détection de défauts industriels, cette tâche est d’une importance capitale pour assurer une production de haute qualité. Avec l’accroissement exponentiel du nombre de mesures effectuées sur les composants électroniques, la problématique de la grande dimension se pose lors de la recherche d’anomalies. Pour relever ce challenge, l’entreprise ippon innovation, spécialiste en statistique industrielle et détection d’anomalies, s’est associée au laboratoire de recherche TSE-R en finançant ce travail de thèse. Le premier chapitre commence par présenter le contexte du contrôle de qualité et les différentes procédures déjà mises en place, principalement dans les entreprises de semi-conducteurs pour l’automobile. Comme ces pratiques ne répondent pas aux nouvelles attentes requises par le traitement de données en grande dimension, d’autres solutions doivent être envisagées. La suite du chapitre résume l’ensemble des méthodes multivariées et non supervisées de détection d’observations atypiques existantes, en insistant tout particulièrement sur celles qui gèrent des données en grande dimension. Le Chapitre 2 montre théoriquement que la très connue distance de Mahalanobis n’est pas adaptée à la détection d’anomalies si celles-ci sont contenues dans un sous-espace de petite dimension alors que le nombre de variables est grand.Dans ce contexte, la méthode Invariant Coordinate Selection (ICS) est alors introduite comme une alternative intéressante à la mise en évidence de la structure des données atypiques. Une méthodologie pour sélectionner seulement les composantes d’intérêt est proposée et ses performances sont comparées aux standards habituels sur des simulations ainsi que sur des exemples réels industriels. Cette nouvelle procédure a été mise en oeuvre dans un package R, ICSOutlier, présenté dans le Chapitre 3 ainsi que dans une application R shiny (package ICSShiny) qui rend son utilisation plus simple et plus attractive.Une des conséquences directes de l’augmentation du nombre de dimensions est la singularité des estimateurs de dispersion multivariés, dès que certaines variables sont colinéaires ou que leur nombre excède le nombre d’individus. Or, la définition d’ICS par Tyler et al. (2009) se base sur des estimateurs de dispersion définis positifs. Le Chapitre 4 envisage différentes pistes pour adapter le critère d’ICS et investigue de manière théorique les propriétés de chacune des propositions présentées. La question de l’affine invariance de la méthode est en particulier étudiée. Enfin le dernier chapitre, se consacre à l’algorithme développé pour l’entreprise. Bien que cet algorithme soit confidentiel, le chapitre donne les idées générales et précise les challenges relevés, notamment numériques.The unsupervised outlier detection is a crucial issue in statistics. More specifically, in the industrial context of fault detection, this task is of great importance for ensuring a high quality production. With the exponential increase in the number of measurements on electronic components, the concern of high dimensional data arises in the identification of outlying observations. The ippon innovation company, an expert in industrial statistics and anomaly detection, wanted to deal with this new situation. So, it collaborated with the TSE-R research laboratory by financing this thesis work. The first chapter presents the quality control context and the different procedures mainly used in the automotive industry of semiconductors. However, these practices do not meet the new expectations required in dealing with high dimensional data, so other solutions need to be considered. The remainder of the chapter summarizes unsupervised multivariate methods for outlier detection, with a particular emphasis on those dealing with high dimensional data. Chapter 2 demonstrates that the well-known Mahalanobis distance presents some difficulties to detect the outlying observations that lie in a smaller subspace while the number of variables is large. In this context, the Invariant Coordinate Selection (ICS) method is introduced as an interesting alternative for highlighting the structure of outlierness. A methodology for selecting only the relevant components is proposed. A simulation study provides a comparison with benchmark methods. The performance of our proposal is also evaluated on real industrial data sets. This new procedure has been implemented in an R package, ICSOutlier, presented in Chapter 3, and in an R shiny application (package ICSShiny) that makes it more user-friendly. When the number of dimensions increases, the multivariate scatter matrices turn out to be singular as soon as some variables are collinear or if their number exceeds the number of individuals. However, in the presentation of ICS by Tyler et al. (2009), the scatter estimators are defined as positive definite matrices. Chapter 4 proposes three different ways for adapting the ICS method to singular scatter matrices and theoretically investigates their properties. The question of affine invariance is analyzed in particular. Finally, the last chapter is dedicated to the algorithm developed for the company. Although the algorithm is confidential, the chapter presents the main ideas and the challenges, mostly numerical, encountered during its development
Robust methods based on shrinkage
In this thesis, robust methods based on the notion of shrinkage are proposed for outlier
detection and robust regression. A collection of robust Mahalanobis distances is
proposed for multivariate outlier detection. The robust intensity and scaling factors,
needed to define the shrinkage of the robust estimators used in the distances, are optimally
estimated. Some properties are investigated, such as the affine equivariance
and the breakdown value. The performance of the proposal is illustrated through
the comparison to other robust techniques from the literature, in a simulation study
and with a real example of breast cancer data. The robust alternatives are also
reviewed, highlighting their advantages and disadvantages. The behavior when the
underlying distribution is heavy-tailed or skewed, shows the appropriateness of the
proposed method when we deviate from the common assumption of normality. The
resulting high true positive rates and low false positive rates in the vast majority of
cases, as well as the significantly smaller computational time show the advantages
of the proposal.
On the other hand, a robust estimator is proposed for the parameters that characterize
the linear regression problem. It is also based on the notion of shrinkages.
A thorough simulation study is conducted to investigate the efficiency with Normal
and heavy-tailed errors, the robustness under contamination, the computational
times, the affine equivariance and breakdown value of the regression estimator. It is
compared to the classical Ordinary Least Squares (OLS) approach and the robust
alternatives from the literature, which are also briefly reviewed in the thesis. Two
classical data-sets often used in the literature and a real socio-economic data-set
about the Living Environment Deprivation (LED) of areas in Liverpool (UK), are
studied. The results from the simulations and the real data examples show the
advantages of the proposed robust estimator in regression. Also, with the LED
data-set it is also shown that the proposed robust regression method has improved
performance than machine learning techniques previously used for this data, with
the advantage of interpretability.
Furthermore, an adaptive threshold, that depends on the sample size and the
dimension of the data, is introduced for the proposed robust Mahalanobis distance based on shrinkage estimators. The cut-off is different than the classical choice of
the 0.975 chi-square quantile providing a more accurate method to detect multivariate
outliers. A simulation study is done to check the performance improvement of
the new cut-off against the classical. The adjusted quantile shows improved performance,
even when the underlying distribution is heavy-tailed or skewed. The
method is illustrated using the LED data-set, and the results demonstrate the additional
advantages of the adaptive threshold for the regression problem.En esta tesis, se proponen métodos robustos basados en la noción de shrinkage para
la detección de atípicos y la regresión robusta. Se propone una colección de distancias
de Mahalanobis robustas para la detección de outliers multivariantes. Los
factores de intensidad y escala, necesarios para definir el shrinkage de los estimadores
robustos utilizados en las distancias, se estiman de manera óptima. Se investigan
algunas propiedades como la equivarianza afín y el breakdown value (valor de ruptura).
El desempeño de la propuesta se ilustra mediante la comparación con otras
técnicas robustas de la literatura, en un estudio de simulación y con un ejemplo
real de datos de cáncer de mama. Las alternativas robustas también se revisan,
destacando sus ventajas y desventajas. El comportamiento cuando la distribución
subyacente es de cola pesada o asimétrica, muestra lo apropiado que es el método
propuesto cuando nos apartamos de la suposición común de normalidad. Las altas
tasas de verdaderos positivos y las bajas tasas de falsos positivos, en la gran mayoría
de los casos, así como el tiempo de cómputo significativamente menor, muestran las
ventajas de la propuesta.
Por otro lado, se introduce un estimador robusto para los parámetros que caracterizan
la regresión lineal. También se basa en la noción de shrinkage. Se lleva
a cabo un estudio de simulación exhaustivo para investigar la eficiencia con errores
Normales y de cola pesada, la robustez bajo contaminación, los tiempos de
cómputo, la equivarianza afín y el valor de ruptura del estimador de regresión. Se
compara con el método Mínimos Cuadrados Ordinarios (OLS) clásico y las alternativas
sólidas de la literatura, que también se revisan brevemente en la tesis. Se
estudian dos conjuntos de datos clásicos que se utilizan a menudo en la literatura
y un conjunto de datos socioeconómicos reales sobre la privación del entorno vital
(LED) de las áreas de Liverpool (Reino Unido). Los resultados de las simulaciones y
los ejemplos de datos reales muestran las ventajas del estimador robusto propuesto
para regresión. Además, con el conjunto de datos LED también se muestra que el
método de regresión robusta propuesto presenta mejoras con respecto a las técnicas
de aprendizaje automático utilizadas anteriormente para estos datos, con la ventaja
de la interpretabilidad.
Además, se introduce un recorte adaptativo, que depende del tamaño de la muestra
y la dimensión de los datos, para la distancia robusta de Mahalanobis propuesta,
basada en estimadores shrinkage. El valor de recorte es diferente a la opción clásica
del cuantil 0.975 de la chi-cuadrado, y proporciona un método más preciso para
detectar valores atípicos multivariados. Se realiza un estudio de simulación para
verificar el rendimiento del nuevo punto de corte respecto al clásico. El cuantil ajustado
muestra un desempeño mejorado, incluso cuando la distribución subyacente es
de cola pesada o asimétrica. El método se ilustra utilizando el conjunto de datos
LED y los resultados demuestran las ventajas adicionales del recorte adaptativo para
el problema de regresión.I want to acknowledge the financial support received from the Spanish
Ministry of Economy and Competitiveness ECO2015-66593-P and the UC3M PIF
pre-doctoral scholarship.Programa de Doctorado en Ingeniería Matemática por la Universidad Carlos III de MadridPresidente: Fco. Javier Nogales Martín.- Secretario: Julio Rodríguez Puerta.- Vocal: José Manuel Mira Mcwilliam
Méthodes statistiques de détection d’observations atypiques pour des données en grande dimension
La détection d’observations atypiques de manière non-supervisée est un enjeu crucial dans la pratique de la statistique. Dans le domaine de la détection de défauts industriels, cette tâche est d’une importance capitale pour assurer une production de haute qualité. Avec l’accroissement exponentiel du nombre de mesures effectuées sur les composants électroniques, la problématique de la grande dimension se pose lors de la recherche d’anomalies. Pour relever ce challenge, l’entreprise ippon innovation, spécialiste en statistique industrielle et détection d’anomalies, s’est associée au laboratoire de recherche TSE-R en finançant ce travail de thèse. Le premier chapitre commence par présenter le contexte du contrôle de qualité et les différentes procédures déjà mises en place, principalement dans les entreprises de semi-conducteurs pour l’automobile. Comme ces pratiques ne répondent pas aux nouvelles attentes requises par le traitement de données en grande dimension, d’autres solutions doivent être envisagées. La suite du chapitre résume l’ensemble des méthodes multivariées et non supervisées de détection d’observations atypiques existantes, en insistant tout particulièrement sur celles qui gèrent des données en grande dimension. Le Chapitre 2 montre théoriquement que la très connue distance de Mahalanobis n’est pas adaptée à la détection d’anomalies si celles-ci sont contenues dans un sous-espace de petite dimension alors que le nombre de variables est grand.Dans ce contexte, la méthode Invariant Coordinate Selection (ICS) est alors introduite comme une alternative intéressante à la mise en évidence de la structure des données atypiques. Une méthodologie pour sélectionner seulement les composantes d’intérêt est proposée et ses performances sont comparées aux standards habituels sur des simulations ainsi que sur des exemples réels industriels. Cette nouvelle procédure a été mise en oeuvre dans un package R, ICSOutlier, présenté dans le Chapitre 3 ainsi que dans une application R shiny (package ICSShiny) qui rend son utilisation plus simple et plus attractive.Une des conséquences directes de l’augmentation du nombre de dimensions est la singularité des estimateurs de dispersion multivariés, dès que certaines variables sont colinéaires ou que leur nombre excède le nombre d’individus. Or, la définition d’ICS par Tyler et al. (2009) se base sur des estimateurs de dispersion définis positifs. Le Chapitre 4 envisage différentes pistes pour adapter le critère d’ICS et investigue de manière théorique les propriétés de chacune des propositions présentées. La question de l’affine invariance de la méthode est en particulier étudiée. Enfin le dernier chapitre, se consacre à l’algorithme développé pour l’entreprise. Bien que cet algorithme soit confidentiel, le chapitre donne les idées générales et précise les challenges relevés, notamment numériques.The unsupervised outlier detection is a crucial issue in statistics. More specifically, in the industrial context of fault detection, this task is of great importance for ensuring a high quality production. With the exponential increase in the number of measurements on electronic components, the concern of high dimensional data arises in the identification of outlying observations. The ippon innovation company, an expert in industrial statistics and anomaly detection, wanted to deal with this new situation. So, it collaborated with the TSE-R research laboratory by financing this thesis work. The first chapter presents the quality control context and the different procedures mainly used in the automotive industry of semiconductors. However, these practices do not meet the new expectations required in dealing with high dimensional data, so other solutions need to be considered. The remainder of the chapter summarizes unsupervised multivariate methods for outlier detection, with a particular emphasis on those dealing with high dimensional data. Chapter 2 demonstrates that the well-known Mahalanobis distance presents some difficulties to detect the outlying observations that lie in a smaller subspace while the number of variables is large. In this context, the Invariant Coordinate Selection (ICS) method is introduced as an interesting alternative for highlighting the structure of outlierness. A methodology for selecting only the relevant components is proposed. A simulation study provides a comparison with benchmark methods. The performance of our proposal is also evaluated on real industrial data sets. This new procedure has been implemented in an R package, ICSOutlier, presented in Chapter 3, and in an R shiny application (package ICSShiny) that makes it more user-friendly. When the number of dimensions increases, the multivariate scatter matrices turn out to be singular as soon as some variables are collinear or if their number exceeds the number of individuals. However, in the presentation of ICS by Tyler et al. (2009), the scatter estimators are defined as positive definite matrices. Chapter 4 proposes three different ways for adapting the ICS method to singular scatter matrices and theoretically investigates their properties. The question of affine invariance is analyzed in particular. Finally, the last chapter is dedicated to the algorithm developed for the company. Although the algorithm is confidential, the chapter presents the main ideas and the challenges, mostly numerical, encountered during its development
An alternative hotelling T^2 control chart based on Minimum Vector Variance (MVV)
The performance of traditional Hotelling T2 control chart using classical estimators in Phase I suffers from masking and swamping effect.To alleviate the problem, robust location and scale estimators are recommended.This paper proposed a robust Hotelling T2 control chart for individual observations based on minimum vector variance (MVV) estimators as an alternative to the traditional multivariate T2 control chart for Phase II data.MVV is a new robust estimator which possesses the good properties as in minimum covariance determinant (MCD) with better computational efficiency.Through simulation study, we evaluate the performance of the proposed chart in terms of probability of detection and false alarm rates and compared with the performance of the traditional charts and the chart issued from MCD estimators.The results showed that MVV control chart has competitive performance relative to MCD and traditional control charts even under certain location parameter shifts in Phase I data
Improving standards in brain-behavior correlation analyses
Associations between two variables, for instance between brain and behavioral measurements, are often studied using correlations, and in particular Pearson correlation. However, Pearson correlation is not robust: outliers can introduce false correlations or mask existing ones. These problems are exacerbated in brain imaging by a widespread lack of control for multiple comparisons, and several issues with data interpretations. We illustrate these important problems associated with brain-behavior correlations, drawing examples from published articles. We make several propositions to alleviate these problems
Robustness and Outliers
Producción CientíficaUnexpected deviations from assumed models as well as the presence of certain amounts of outlying data are common in most practical statistical applications. This fact could lead to undesirable solutions when applying non-robust statistical techniques. This is often the case in cluster analysis, too. The search for homogeneous groups with large heterogeneity between them can be spoiled due to the lack of robustness of standard clustering methods. For instance, the presence of (even few) outlying observations may result in heterogeneous clusters artificially joined together or in the detection of spurious clusters merely made up of outlying observations. In this chapter we will analyze the effects of different kinds of outlying data in cluster analysis and explore several alternative methodologies designed to avoid or minimize their undesirable effects.Ministerio de Economía, Industria y Competitividad (MTM2014-56235-C2-1-P)Junta de Castilla y León (programa de apoyo a proyectos de investigación – Ref. VA212U13
Multivariate outlier detection based on a robust Mahalanobis distance with shrinkage estimators
A collection of methods for multivariate outlier detection based on a robust Mahalanobis distance is proposed. The procedure consists on different combinations of robust estimates for location and covariance matrix based on shrinkage. The performance of our proposal is illustrated, through the comparison to other techniques from the literature, in a simulation study. The resulting high correct classification rates and low false classification rates in the vast majority of cases, and also the good computational times shows the goodness of our proposal. The performance is also illustrated with a real dataset example and some conclusions are established.This research was partially supported by Spanish Ministry grant ECO2015-66593-P
- …