slides

Impact of contamination on empirical and theoretical error

Abstract

Classification analysis allows to group similar objects into a given number of groups by means of a classification rule. Many classification procedures are available : linear discrimination, logistic discrimination, etc. Focus in this poster will be on classification resulting from a clustering analysis. Indeed, among the outputs of classical clustering techniques, a classification rule is provided in order to classify the objects into one of the clusters. More precisely, let F denote the underlying distribution and assume that the generalized kmeans algorithm with penalty function is used to construct the k clusters C1(F), . . . ,Ck(F) with centers T1(F), . . . , Tk(F). When one feels that k true groups are existing among the data, classification might be the main objective of the statistical analysis. Performance of a particular classification technique can be measured by means of an error rate. Depending on the availability of data, two types of error rates may be computed: a theoretical one and a more empirical one. In the first case, the rule is estimated on a training sample with distribution F while the evaluation of the classification performance may be done through a test sample distributed according to a model distribution of interest, Fm say. In the second case, the same data are used to set up the rule and to evaluate the performance. Under contamination, one has to replace the distribution F of the training sample by a contaminated one, F(eps) say (where eps corresponds to the fraction of contamination). In that case, thetheoretical error rate will be corrupted since it relies on a contaminated rule but it may still consider a test sample distributed according to the model distribution. The empirical error rate will be affected twice: via the rule and also via the sample used for the evaluation of the classification performance. To measure the robustness of classification based on clustering, influence functions of the error rate may be computed. The idea has already been exploited by Croux et al (2008) and Croux et al (2008) in the context of linear and logistic discrimination. In the computation of influence functions, the contaminated distribution takes the form F(eps) = (1 − eps)*Fm + eps* Dx, where Dx is the Dirac distribution putting all its mass at x. It is interesting to note that the impact of the point mass x may be positive, i.e. may decrease the error rate, when the data at hand is used to evaluate the error

    Similar works

    Full text

    thumbnail-image