144 research outputs found

    Efficient and Highly Robust Hotelling T² Control Charts Using Reweighted Mininum Vector Variance

    Get PDF
    Hotelling T² control chart is an effective tool in statistical process control for multivariate environment. However, the performance of traditional Hotelling T² control chart using classical location and scatter estimators is usually marred by the masking and swamping effects. In order to alleviate the problem, robust estimators are recommended. The most popular and widely used robust estimator in the Hotelling T² control chart is the minimum covariance determinant (MCD). Recently, a new robust estimator known as minimum vector variance (MVV) was introduced. This estimator possesses high breakdown point, affine equivariance and is superior in terms of computational efficiency. Due to these nice properties, this study proposed to replace the classical estimators with the MVV location and scatter estimators in the construction of Hotelling T² control chart for individual observations in Phase II analysis. Nevertheless, some drawbacks such as inconsistency under normal distribution, biased for small sample size and low efficiency under high breakdown point were discovered. To improve the MVV estimators in terms of consistency and unbiasedness, the MVV scatter estimator was multiplied by consistency and correction factors respectively. To maintain the high breakdown point while having high statistical efficiency, a reweighted version of MVV estimator (RMVV) was proposed. Subsequently, the RMVV estimators were applied in the construction of Hotelling T² control chart. The new robust Hotelling T² chart produced positive impact in detecting outliers while simultaneously controlling false alarm rates. Apart from analysis of simulated data, analysis of real data also found that the new robust Hotelling T² chart was able to detect out of control observations better than the other charts investigated in this study. Based on the good performance on both simulated and real data analysis, the new robust Hotelling T² chart is a good alternative to the existing Hotelling T² charts

    An Object-Oriented Framework for Robust Multivariate Analysis

    Get PDF
    Taking advantage of the S4 class system of the programming environment R, which facilitates the creation and maintenance of reusable and modular components, an object-oriented framework for robust multivariate analysis was developed. The framework resides in the packages robustbase and rrcov and includes an almost complete set of algorithms for computing robust multivariate location and scatter, various robust methods for principal component analysis as well as robust linear and quadratic discriminant analysis. The design of these methods follows common patterns which we call statistical design patterns in analogy to the design patterns widely used in software engineering. The application of the framework to data analysis as well as possible extensions by the development of new methods is demonstrated on examples which themselves are part of the package rrcov.

    Méthodes statistiques de détection d’observations atypiques pour des données en grande dimension

    Get PDF
    La détection d’observations atypiques de manière non-supervisée est un enjeu crucial dans la pratique de la statistique. Dans le domaine de la détection de défauts industriels, cette tâche est d’une importance capitale pour assurer une production de haute qualité. Avec l’accroissement exponentiel du nombre de mesures effectuées sur les composants électroniques, la problématique de la grande dimension se pose lors de la recherche d’anomalies. Pour relever ce challenge, l’entreprise ippon innovation, spécialiste en statistique industrielle et détection d’anomalies, s’est associée au laboratoire de recherche TSE-R en finançant ce travail de thèse. Le premier chapitre commence par présenter le contexte du contrôle de qualité et les différentes procédures déjà mises en place, principalement dans les entreprises de semi-conducteurs pour l’automobile. Comme ces pratiques ne répondent pas aux nouvelles attentes requises par le traitement de données en grande dimension, d’autres solutions doivent être envisagées. La suite du chapitre résume l’ensemble des méthodes multivariées et non supervisées de détection d’observations atypiques existantes, en insistant tout particulièrement sur celles qui gèrent des données en grande dimension. Le Chapitre 2 montre théoriquement que la très connue distance de Mahalanobis n’est pas adaptée à la détection d’anomalies si celles-ci sont contenues dans un sous-espace de petite dimension alors que le nombre de variables est grand.Dans ce contexte, la méthode Invariant Coordinate Selection (ICS) est alors introduite comme une alternative intéressante à la mise en évidence de la structure des données atypiques. Une méthodologie pour sélectionner seulement les composantes d’intérêt est proposée et ses performances sont comparées aux standards habituels sur des simulations ainsi que sur des exemples réels industriels. Cette nouvelle procédure a été mise en oeuvre dans un package R, ICSOutlier, présenté dans le Chapitre 3 ainsi que dans une application R shiny (package ICSShiny) qui rend son utilisation plus simple et plus attractive.Une des conséquences directes de l’augmentation du nombre de dimensions est la singularité des estimateurs de dispersion multivariés, dès que certaines variables sont colinéaires ou que leur nombre excède le nombre d’individus. Or, la définition d’ICS par Tyler et al. (2009) se base sur des estimateurs de dispersion définis positifs. Le Chapitre 4 envisage différentes pistes pour adapter le critère d’ICS et investigue de manière théorique les propriétés de chacune des propositions présentées. La question de l’affine invariance de la méthode est en particulier étudiée. Enfin le dernier chapitre, se consacre à l’algorithme développé pour l’entreprise. Bien que cet algorithme soit confidentiel, le chapitre donne les idées générales et précise les challenges relevés, notamment numériques.The unsupervised outlier detection is a crucial issue in statistics. More specifically, in the industrial context of fault detection, this task is of great importance for ensuring a high quality production. With the exponential increase in the number of measurements on electronic components, the concern of high dimensional data arises in the identification of outlying observations. The ippon innovation company, an expert in industrial statistics and anomaly detection, wanted to deal with this new situation. So, it collaborated with the TSE-R research laboratory by financing this thesis work. The first chapter presents the quality control context and the different procedures mainly used in the automotive industry of semiconductors. However, these practices do not meet the new expectations required in dealing with high dimensional data, so other solutions need to be considered. The remainder of the chapter summarizes unsupervised multivariate methods for outlier detection, with a particular emphasis on those dealing with high dimensional data. Chapter 2 demonstrates that the well-known Mahalanobis distance presents some difficulties to detect the outlying observations that lie in a smaller subspace while the number of variables is large. In this context, the Invariant Coordinate Selection (ICS) method is introduced as an interesting alternative for highlighting the structure of outlierness. A methodology for selecting only the relevant components is proposed. A simulation study provides a comparison with benchmark methods. The performance of our proposal is also evaluated on real industrial data sets. This new procedure has been implemented in an R package, ICSOutlier, presented in Chapter 3, and in an R shiny application (package ICSShiny) that makes it more user-friendly. When the number of dimensions increases, the multivariate scatter matrices turn out to be singular as soon as some variables are collinear or if their number exceeds the number of individuals. However, in the presentation of ICS by Tyler et al. (2009), the scatter estimators are defined as positive definite matrices. Chapter 4 proposes three different ways for adapting the ICS method to singular scatter matrices and theoretically investigates their properties. The question of affine invariance is analyzed in particular. Finally, the last chapter is dedicated to the algorithm developed for the company. Although the algorithm is confidential, the chapter presents the main ideas and the challenges, mostly numerical, encountered during its development

    Robust methods based on shrinkage

    Get PDF
    In this thesis, robust methods based on the notion of shrinkage are proposed for outlier detection and robust regression. A collection of robust Mahalanobis distances is proposed for multivariate outlier detection. The robust intensity and scaling factors, needed to define the shrinkage of the robust estimators used in the distances, are optimally estimated. Some properties are investigated, such as the affine equivariance and the breakdown value. The performance of the proposal is illustrated through the comparison to other robust techniques from the literature, in a simulation study and with a real example of breast cancer data. The robust alternatives are also reviewed, highlighting their advantages and disadvantages. The behavior when the underlying distribution is heavy-tailed or skewed, shows the appropriateness of the proposed method when we deviate from the common assumption of normality. The resulting high true positive rates and low false positive rates in the vast majority of cases, as well as the significantly smaller computational time show the advantages of the proposal. On the other hand, a robust estimator is proposed for the parameters that characterize the linear regression problem. It is also based on the notion of shrinkages. A thorough simulation study is conducted to investigate the efficiency with Normal and heavy-tailed errors, the robustness under contamination, the computational times, the affine equivariance and breakdown value of the regression estimator. It is compared to the classical Ordinary Least Squares (OLS) approach and the robust alternatives from the literature, which are also briefly reviewed in the thesis. Two classical data-sets often used in the literature and a real socio-economic data-set about the Living Environment Deprivation (LED) of areas in Liverpool (UK), are studied. The results from the simulations and the real data examples show the advantages of the proposed robust estimator in regression. Also, with the LED data-set it is also shown that the proposed robust regression method has improved performance than machine learning techniques previously used for this data, with the advantage of interpretability. Furthermore, an adaptive threshold, that depends on the sample size and the dimension of the data, is introduced for the proposed robust Mahalanobis distance based on shrinkage estimators. The cut-off is different than the classical choice of the 0.975 chi-square quantile providing a more accurate method to detect multivariate outliers. A simulation study is done to check the performance improvement of the new cut-off against the classical. The adjusted quantile shows improved performance, even when the underlying distribution is heavy-tailed or skewed. The method is illustrated using the LED data-set, and the results demonstrate the additional advantages of the adaptive threshold for the regression problem.En esta tesis, se proponen métodos robustos basados en la noción de shrinkage para la detección de atípicos y la regresión robusta. Se propone una colección de distancias de Mahalanobis robustas para la detección de outliers multivariantes. Los factores de intensidad y escala, necesarios para definir el shrinkage de los estimadores robustos utilizados en las distancias, se estiman de manera óptima. Se investigan algunas propiedades como la equivarianza afín y el breakdown value (valor de ruptura). El desempeño de la propuesta se ilustra mediante la comparación con otras técnicas robustas de la literatura, en un estudio de simulación y con un ejemplo real de datos de cáncer de mama. Las alternativas robustas también se revisan, destacando sus ventajas y desventajas. El comportamiento cuando la distribución subyacente es de cola pesada o asimétrica, muestra lo apropiado que es el método propuesto cuando nos apartamos de la suposición común de normalidad. Las altas tasas de verdaderos positivos y las bajas tasas de falsos positivos, en la gran mayoría de los casos, así como el tiempo de cómputo significativamente menor, muestran las ventajas de la propuesta. Por otro lado, se introduce un estimador robusto para los parámetros que caracterizan la regresión lineal. También se basa en la noción de shrinkage. Se lleva a cabo un estudio de simulación exhaustivo para investigar la eficiencia con errores Normales y de cola pesada, la robustez bajo contaminación, los tiempos de cómputo, la equivarianza afín y el valor de ruptura del estimador de regresión. Se compara con el método Mínimos Cuadrados Ordinarios (OLS) clásico y las alternativas sólidas de la literatura, que también se revisan brevemente en la tesis. Se estudian dos conjuntos de datos clásicos que se utilizan a menudo en la literatura y un conjunto de datos socioeconómicos reales sobre la privación del entorno vital (LED) de las áreas de Liverpool (Reino Unido). Los resultados de las simulaciones y los ejemplos de datos reales muestran las ventajas del estimador robusto propuesto para regresión. Además, con el conjunto de datos LED también se muestra que el método de regresión robusta propuesto presenta mejoras con respecto a las técnicas de aprendizaje automático utilizadas anteriormente para estos datos, con la ventaja de la interpretabilidad. Además, se introduce un recorte adaptativo, que depende del tamaño de la muestra y la dimensión de los datos, para la distancia robusta de Mahalanobis propuesta, basada en estimadores shrinkage. El valor de recorte es diferente a la opción clásica del cuantil 0.975 de la chi-cuadrado, y proporciona un método más preciso para detectar valores atípicos multivariados. Se realiza un estudio de simulación para verificar el rendimiento del nuevo punto de corte respecto al clásico. El cuantil ajustado muestra un desempeño mejorado, incluso cuando la distribución subyacente es de cola pesada o asimétrica. El método se ilustra utilizando el conjunto de datos LED y los resultados demuestran las ventajas adicionales del recorte adaptativo para el problema de regresión.I want to acknowledge the financial support received from the Spanish Ministry of Economy and Competitiveness ECO2015-66593-P and the UC3M PIF pre-doctoral scholarship.Programa de Doctorado en Ingeniería Matemática por la Universidad Carlos III de MadridPresidente: Fco. Javier Nogales Martín.- Secretario: Julio Rodríguez Puerta.- Vocal: José Manuel Mira Mcwilliam

    Méthodes statistiques de détection d’observations atypiques pour des données en grande dimension

    Get PDF
    La détection d’observations atypiques de manière non-supervisée est un enjeu crucial dans la pratique de la statistique. Dans le domaine de la détection de défauts industriels, cette tâche est d’une importance capitale pour assurer une production de haute qualité. Avec l’accroissement exponentiel du nombre de mesures effectuées sur les composants électroniques, la problématique de la grande dimension se pose lors de la recherche d’anomalies. Pour relever ce challenge, l’entreprise ippon innovation, spécialiste en statistique industrielle et détection d’anomalies, s’est associée au laboratoire de recherche TSE-R en finançant ce travail de thèse. Le premier chapitre commence par présenter le contexte du contrôle de qualité et les différentes procédures déjà mises en place, principalement dans les entreprises de semi-conducteurs pour l’automobile. Comme ces pratiques ne répondent pas aux nouvelles attentes requises par le traitement de données en grande dimension, d’autres solutions doivent être envisagées. La suite du chapitre résume l’ensemble des méthodes multivariées et non supervisées de détection d’observations atypiques existantes, en insistant tout particulièrement sur celles qui gèrent des données en grande dimension. Le Chapitre 2 montre théoriquement que la très connue distance de Mahalanobis n’est pas adaptée à la détection d’anomalies si celles-ci sont contenues dans un sous-espace de petite dimension alors que le nombre de variables est grand.Dans ce contexte, la méthode Invariant Coordinate Selection (ICS) est alors introduite comme une alternative intéressante à la mise en évidence de la structure des données atypiques. Une méthodologie pour sélectionner seulement les composantes d’intérêt est proposée et ses performances sont comparées aux standards habituels sur des simulations ainsi que sur des exemples réels industriels. Cette nouvelle procédure a été mise en oeuvre dans un package R, ICSOutlier, présenté dans le Chapitre 3 ainsi que dans une application R shiny (package ICSShiny) qui rend son utilisation plus simple et plus attractive.Une des conséquences directes de l’augmentation du nombre de dimensions est la singularité des estimateurs de dispersion multivariés, dès que certaines variables sont colinéaires ou que leur nombre excède le nombre d’individus. Or, la définition d’ICS par Tyler et al. (2009) se base sur des estimateurs de dispersion définis positifs. Le Chapitre 4 envisage différentes pistes pour adapter le critère d’ICS et investigue de manière théorique les propriétés de chacune des propositions présentées. La question de l’affine invariance de la méthode est en particulier étudiée. Enfin le dernier chapitre, se consacre à l’algorithme développé pour l’entreprise. Bien que cet algorithme soit confidentiel, le chapitre donne les idées générales et précise les challenges relevés, notamment numériques.The unsupervised outlier detection is a crucial issue in statistics. More specifically, in the industrial context of fault detection, this task is of great importance for ensuring a high quality production. With the exponential increase in the number of measurements on electronic components, the concern of high dimensional data arises in the identification of outlying observations. The ippon innovation company, an expert in industrial statistics and anomaly detection, wanted to deal with this new situation. So, it collaborated with the TSE-R research laboratory by financing this thesis work. The first chapter presents the quality control context and the different procedures mainly used in the automotive industry of semiconductors. However, these practices do not meet the new expectations required in dealing with high dimensional data, so other solutions need to be considered. The remainder of the chapter summarizes unsupervised multivariate methods for outlier detection, with a particular emphasis on those dealing with high dimensional data. Chapter 2 demonstrates that the well-known Mahalanobis distance presents some difficulties to detect the outlying observations that lie in a smaller subspace while the number of variables is large. In this context, the Invariant Coordinate Selection (ICS) method is introduced as an interesting alternative for highlighting the structure of outlierness. A methodology for selecting only the relevant components is proposed. A simulation study provides a comparison with benchmark methods. The performance of our proposal is also evaluated on real industrial data sets. This new procedure has been implemented in an R package, ICSOutlier, presented in Chapter 3, and in an R shiny application (package ICSShiny) that makes it more user-friendly. When the number of dimensions increases, the multivariate scatter matrices turn out to be singular as soon as some variables are collinear or if their number exceeds the number of individuals. However, in the presentation of ICS by Tyler et al. (2009), the scatter estimators are defined as positive definite matrices. Chapter 4 proposes three different ways for adapting the ICS method to singular scatter matrices and theoretically investigates their properties. The question of affine invariance is analyzed in particular. Finally, the last chapter is dedicated to the algorithm developed for the company. Although the algorithm is confidential, the chapter presents the main ideas and the challenges, mostly numerical, encountered during its development

    An alternative hotelling T^2 control chart based on Minimum Vector Variance (MVV)

    Get PDF
    The performance of traditional Hotelling T2 control chart using classical estimators in Phase I suffers from masking and swamping effect.To alleviate the problem, robust location and scale estimators are recommended.This paper proposed a robust Hotelling T2 control chart for individual observations based on minimum vector variance (MVV) estimators as an alternative to the traditional multivariate T2 control chart for Phase II data.MVV is a new robust estimator which possesses the good properties as in minimum covariance determinant (MCD) with better computational efficiency.Through simulation study, we evaluate the performance of the proposed chart in terms of probability of detection and false alarm rates and compared with the performance of the traditional charts and the chart issued from MCD estimators.The results showed that MVV control chart has competitive performance relative to MCD and traditional control charts even under certain location parameter shifts in Phase I data

    Improving standards in brain-behavior correlation analyses

    Get PDF
    Associations between two variables, for instance between brain and behavioral measurements, are often studied using correlations, and in particular Pearson correlation. However, Pearson correlation is not robust: outliers can introduce false correlations or mask existing ones. These problems are exacerbated in brain imaging by a widespread lack of control for multiple comparisons, and several issues with data interpretations. We illustrate these important problems associated with brain-behavior correlations, drawing examples from published articles. We make several propositions to alleviate these problems

    Robustness and Outliers

    Get PDF
    Producción CientíficaUnexpected deviations from assumed models as well as the presence of certain amounts of outlying data are common in most practical statistical applications. This fact could lead to undesirable solutions when applying non-robust statistical techniques. This is often the case in cluster analysis, too. The search for homogeneous groups with large heterogeneity between them can be spoiled due to the lack of robustness of standard clustering methods. For instance, the presence of (even few) outlying observations may result in heterogeneous clusters artificially joined together or in the detection of spurious clusters merely made up of outlying observations. In this chapter we will analyze the effects of different kinds of outlying data in cluster analysis and explore several alternative methodologies designed to avoid or minimize their undesirable effects.Ministerio de Economía, Industria y Competitividad (MTM2014-56235-C2-1-P)Junta de Castilla y León (programa de apoyo a proyectos de investigación – Ref. VA212U13

    Multivariate outlier detection based on a robust Mahalanobis distance with shrinkage estimators

    Get PDF
    A collection of methods for multivariate outlier detection based on a robust Mahalanobis distance is proposed. The procedure consists on different combinations of robust estimates for location and covariance matrix based on shrinkage. The performance of our proposal is illustrated, through the comparison to other techniques from the literature, in a simulation study. The resulting high correct classification rates and low false classification rates in the vast majority of cases, and also the good computational times shows the goodness of our proposal. The performance is also illustrated with a real dataset example and some conclusions are established.This research was partially supported by Spanish Ministry grant ECO2015-66593-P
    corecore