Search CORE

4 research outputs found

Robust Clustering Method for the Detection of Outliers: Using AIC to Select the Number of Clusters

Author: A Cerioli
A Jain
C Santos-Pereira
E Ronchetti
J Banfield
J Hardin
L Garcia-Escudero
L Kaufman
M Hubert
M Kumar
PJ Rousseeuw
PJ Rousseeuw
R Maronna
Y Sakamoto
Publication venue
Publication date: 01/01/2013
Field of study

In [14] we proposed a method to detect outliers in multivariate data basedon clustering and robust estimators. To implement this method in practice it is necessaryto choose a clustering method, a pair of location and scatter estimators, andthe number of clusters, k. After several simulation experiments it was possible togive a number of guidelines regarding the first two choices. However the choice ofthe number of clusters depends entirely on the structure of the particular data setunder study. Our suggestion is to try several values of k (e.g. from 1 to a maximumreasonable k which depends on the number of observations and on the number ofvariables) and select k minimizing an adapted AIC. In this paper we analyze thisAIC based criterion for choosing the number of clusters k (and also the clusteringmethod and the location and scatter estimators) by applying it to several simulateddata sets with and without outliers

Crossref

Repositório Aberto da Universidade do Porto

Improved robust estimator and clustering procedures for multivariate outliers detection

Author: Sharifah Sakinah Syed Abd Mutalib
Publication venue
Publication date: 01/07/2023
Field of study

Outlier detection for multivariate data has been one of the areas that garnered attention to study due to the difficulty that arises as the number of variables, p increases. Visual inspection is insufficient to detect outliers in multivariate data, unlike univariate data. One of the methods to detect outliers in multivariate data is by using distance-based methods, which is Mahalanobis distance (MD). However, the sample mean and covariance matrix in MD is bound to masking and swamping problems. Therefore, many studies use robust estimators to replace the sample mean and covariance matrix. The development of robust estimators still continues until now. Although the robust estimator can overcome the problem of MD, it is still limited to detecting single point outliers only. Therefore, cluster-based methods have been proposed and developed in previous studies to overcome this problem. Hence, the main objective of this study is to propose a robust estimator in order to develop an improved procedure for detecting outliers in multivariate data using robust clustering-based methods. Firstly, an improved robust estimator based on the equality of covariance matrices that is less sensitive to the presence of outliers is proposed and named as Test on Covariance (TOC). TOC is developed by modified Concentration-Step (C-Step) in the Fast Minimum Covariance Determinant (FMCD) algorithm. In this step, the equality of covariance matrices test is done, and TOC is obtained. Secondly, an improved single linkage robust clustering procedure is developed. The similarity measure used in this procedure is the robust distance using TOC, named RDT. The improved single linkage robust clustering is robustified using RDT. Then, the performance of the proposed robust estimator and clustering procedure in detecting outliers for multivariate data are investigated using simulation studies and historical datasets. A data generation procedure is formulated in the simulation study to create synthetic data with three Outlier Scenarios using the R language. Three Outlier Scenarios used in this study are the Mean-shift (Outlier Scenario 1), Variance-inflation (Outlier Scenario 2), and Mean-shift and variance-inflation (Outlier Scenario 3). Three measurements are used to assess the effectiveness of the proposed robust estimator and clustering procedure, which are the probability that all the outliers are successfully detected (pout), the probability that the outliers are falsely detected as inliers (pmask), and the probability of inliers detected as outliers (pswamp). In particular, five historical datasets are used, which are Stackloss, Brain and Weight, Bushfire, Hawkins-Bradu Kass, and Milk. In this study, the performance of TOC in detecting outliers is compared with other existing robust estimators, which are Fast Minimum Covariance Determinant (FMCD), Minimum Vector Variance (MVV), Covariance Matrix Equality (CME) and Index Set Equality (ISE). Based on the simulation study, TOC shows good results in pswamp for all Outlier Scenarios, which indicates TOC has the lowest probability of misclassifying inliers as outliers compared to other robust estimators. TOC also shows similar performance as other robust estimators in most conditions. If the three measurements are considered simultaneously, TOC is the better estimator for the sample size, n = 30,50,100,200, number of variables, p = 3,5,10 and all percentages of outliers, 1% ≤ ε ≤ 25%. TOC also has proven able to detect outliers, does not have a masking effect, and performs similarly to other robust estimators in the historical datasets. Meanwhile, the performance of the improved single linkage robust clustering procedure is compared with single linkage by using Euclidean (ED), Mahalanobis distance (MD),and TOC. Based on the simulation study, RDT only becomes the better similarity measure in a few conditions for pout, pmask and pswamp and performs similarly to other similarity measures in most conditions for all Outlier Scenarios. If the performance measurement of pout, pmask as well as pswamp are considered simultaneously for all Outlier Scenarios, RDT is the better similarity measure when n =50,100, p =3,5 and ε = 5%,10%,15%. Moreover, RDT is the better similarity measure when the historical dataset contains 19% outliers, p =3 and n ˂ 100. From the findings of the simulation study and historical datasets, both TOC and RDT did not perform well for large sample size. It is also found that TOC outperforms RDT’s ability to detect outliers in multivariate data. Therefore, this study concluded that TOC is a promising robust estimator and can be an alternative to other robust estimators for detecting outliers in multivariate data. RDT can also be used as an alternative similarity measure in clustering procedures and can also be used in other clustering methods. TOC can be further applied in other multivariate methods such as Principal Component Analysis, Factor Analysis and Discriminant Analysis. Furthermore, the improved single linkage robust clustering procedure in this study can be incorporated with Minimum Spanning Tree (MST)

UMP Institutional Repository