10 research outputs found

    Robust mixture regression using mean-shift penalisation

    Get PDF
    Mini Dissertation (MSc (Advanced Data Analytics))--University of Pretoria, 2021.The purpose of finite mixture regression (FMR) is to model the relationship between a response and feature variables in the presence of latent groups in the population. The different regression structures are quantified by the unique parameters of each latent group. The Gaussian mixture regression model is a method commonly used in FMR since it simplifies the estimation and interpretation of the model output. However, it is highly affected if outliers are present in the data. Failing to account for the outliers may distort the results and lead to inappropriate conclusions. We consider a mean-shift robust mixture regression approach to address this. This method uses a component specific mean-shift parameterisation which contributes towards both the successful identification of outliers as well as robust parameter estimation. The technique is demonstrated by a simulation study and a real-world application. The mean-shift regression method proves to be highly robust against outliers.StatisticsMSc (Advanced Data Analytics)Unrestricte

    Advances in robust clustering methods with applications

    Get PDF
    Robust methods in statistics are mainly concerned with deviations from model assumptions. As already pointed out in Huber (1981) and in Huber & Ronchetti (2009) \these assumptions are not exactly true since they are just a mathematically convenient rationalization of an often fuzzy knowledge or belief". For that reason \a minor error in the mathematical model should cause only a small error in the nal conclusions". Nevertheless it is well known that many classical statistical procedures are \excessively sensitive to seemingly minor deviations from the assumptions". All statistical methods based on the minimization of the average square loss may suer of lack of robustness. Illustrative examples of how outliers' in uence may completely alter the nal results in regression analysis and linear model context are provided in Atkinson & Riani (2012). A presentation of classical multivariate tools' robust counterparts is provided in Farcomeni & Greco (2015). The whole dissertation is focused on robust clustering models and the outline of the thesis is as follows. Chapter 1 is focused on robust methods. Robust methods are aimed at increasing the eciency when contamination appears in the sample. Thus a general denition of such (quite general) concept is required. To do so we give a brief account of some kinds of contamination we can encounter in real data applications. Secondly we introduce the \Spurious outliers model" (Gallegos & Ritter 2009a) which is the cornerstone of the robust model based clustering models. Such model is aimed at formalizing clustering problems when one has to deal with contaminated samples. The assumption standing behind the \Spurious outliers model" is that two dierent random mechanisms generate the data: one is assumed to generate the \clean" part while the another one generates the contamination. This idea is actually very common within robust models like the \Tukey-Huber model" which is introduced in Subsection 1.2.2. Outliers' recognition, especially in the multivariate case, plays a key role and is not straightforward as the dimensionality of the data increases. An overview of the most widely used (robust) methods for outliers detection is provided within Section 1.3. Finally, in Section 1.4, we provide a non technical review of the classical tools introduced in the Robust Statistics' literature aimed at evaluating the robustness properties of a methodology. Chapter 2 is focused on model based clustering methods and their robustness' properties. Cluster analysis, \the art of nding groups in the data" (Kaufman & Rousseeuw 1990), is one of the most widely used tools within the unsupervised learning context. A very popular method is the k-means algorithm (MacQueen et al. 1967) which is based on minimizing the Euclidean distance of each observation from the estimated clusters' centroids and therefore it is aected by lack of robustness. Indeed even a single outlying observation may completely alter centroids' estimation and simultaneously provoke a bias in the standard errors' estimation. Cluster's contours may be in ated and the \real" underlying clusterwise structure might be completely hidden. A rst attempt of robustifying the k- means algorithm appeared in Cuesta-Albertos et al. (1997), where a trimming step is inserted in the algorithm in order to avoid the outliers' exceeding in uence. It shall be noticed that k-means algorithm is ecient for detecting spherical homoscedastic clusters. Whenever more exible shapes are desired the procedure becomes inecient. In order to overcome this problem Gaussian model based clustering methods should be adopted instead of k-means algorithm. An example, among the other proposals described in Chapter 2, is the TCLUST methodology (Garca- Escudero et al. 2008), which is the cornerstone of the thesis. Such methodology is based on two main characteristics: trimming a xed proportion of observations and imposing a constraint on the estimates of the scatter matrices. As it will be explained in Chapter 2, trimming is used to protect the results from outliers' in uence while the constraint is involved as spurious maximizers may completely spoil the solution. Chapter 3 and 4 are mainly focused on extending the TCLUST methodology. In particular, in Chapter 3, we introduce a new contribution (compare Dotto et al. 2015 and Dotto et al. 2016b), based on the TCLUST approach, called reweighted TCLUST or RTCLUST for the sake of brevity. The idea standing behind such method is based on reweighting the observations initially agged as outlying. This is helpful both to gain eciency in the parameters' estimation process and to provide a reliable estimation of the true contamination level. Indeed, as the TCLUST is based on trimming a xed proportion of observations, a proper choice of the trimming level is required. Such choice, especially in the applications, can be cumbersome. As it will be claried later on, RTCLUST methodology allows the user to overcome such problem. Indeed, in the RTCLUST approach the user is only required to impose a high preventive trimming level. The procedure, by iterating through a sequence of decreasing trimming levels, is aimed at reinserting the discarded observations at each step and provides more precise estimation of the parameters and a nal estimation of the true contamination level ^. The theoretical properties of the methodology are studied in Section 3.6 and proved in Appendix A.1, while, Section 3.7, contains a simulation study aimed at evaluating the properties of the methodology and the advantages with respect to some other robust (reweigthed and single step procedures). Chapter 4 contains an extension of the TCLUST method for fuzzy linear clustering (Dotto et al. 2016a). Such contribution can be viewed as the extension of Fritz et al. (2013a) for linear clustering problems, or, equivalently, as the extension of Garca-Escudero, Gordaliza, Mayo-Iscar & San Martn (2010) to the fuzzy clustering framework. Fuzzy clustering is also useful to deal with contamination. Fuzziness is introduced to deal with overlapping between clusters and the presence of bridge points, to be dened in Section 1.1. Indeed bridge points may arise in case of overlapping between clusters and may completely alter the estimated cluster's parameters (i.e. the coecients of a linear model in each cluster). By introducing fuzziness such observations are suitably down weighted and the clusterwise structure can be correctly detected. On the other hand, robustness against gross outliers, as in the TCLUST methodology, is guaranteed by trimming a xed proportion of observations. Additionally a simulation study, aimed at comparing the proposed methodology with other proposals (both robust and non robust) is also provided in Section 4.4. Chapter 5 is entirely dedicated to real data applications of the proposed contributions. In particular, the RTCLUST method is applied to two dierent datasets. The rst one is the \Swiss Bank Note" dataset, a well known benchmark dataset for clustering models, and to a dataset collected by Gallup Organization, which is, to our knowledge, an original dataset, on which no other existing proposals have been applied yet. Section 5.3 contains an application of our fuzzy linear clustering proposal to allometry data. In our opinion such dataset, already considered in the robust linear clustering proposal appeared in Garca-Escudero, Gordaliza, Mayo-Iscar & San Martn (2010), is particularly useful to show the advantages of our proposed methodology. Indeed allometric quantities are often linked by a linear relationship but, at the same time, there may be overlap between dierent groups and outliers may often appear due to errors in data registration. Finally Chapter 6 contains the concluding remarks and the further directions of research. In particular we wish to mention an ongoing work (Dotto & Farcomeni, In preparation) in which we consider the possibility of implementing robust parsimonious Gaussian clustering models. Within the chapter, the algorithm is briefly described and some illustrative examples are also provided. The potential advantages of such proposals are the following. First of all, by considering the parsimonious models introduced in Celeux & Govaert (1995), the user is able to impose the shape of the detected clusters, which often, in the applications, plays a key role. Secondly, by constraining the shape of the detected clusters, the constraint on the eigenvalue ratio can be avoided. This leads to the removal of a tuning parameter of the procedure and, at the same time, allows the user to obtain ane equivariant estimators. Finally, since the possibility of trimming a xed proportion of observations is allowed, then the procedure is also formally robust

    Approaches for Outlier Detection in Sparse High-Dimensional Regression Models

    Get PDF
    Modern regression studies often encompass a very large number of potential predictors, possibly larger than the sample size, and sometimes growing with the sample size itself. This increases the chances that a substantial portion of the predictors is redundant, as well as the risk of data contamination. Tackling these problems is of utmost importance to facilitate scientific discoveries, since model estimates are highly sensitive both to the choice of predictors and to the presence of outliers. In this thesis, we contribute to this area considering the problem of robust model selection in a variety of settings, where outliers may arise both in the response and the predictors. Our proposals simplify model interpretation, guarantee predictive performance, and allow us to study and control the influence of outlying cases on the fit. First, we consider the co-occurrence of multiple mean-shift and variance-inflation outliers in low-dimensional linear models. We rely on robust estimation techniques to identify outliers of each type, exclude mean-shift outliers, and use restricted maximum likelihood estimation to down-weight and accommodate variance-inflation outliers into the model fit. Second, we extend our setting to high-dimensional linear models. We show that mean-shift and variance-inflation outliers can be modeled as additional fixed and random components, respectively, and evaluated independently. Specifically, we perform feature selection and mean-shift outlier detection through a robust class of nonconcave penalization methods, and variance-inflation outlier detection through the penalization of the restricted posterior mode. The resulting approach satisfies a robust oracle property for feature selection in the presence of data contamination – which allows the number of features to exponentially increase with the sample size – and detects truly outlying cases of each type with asymptotic probability one. This provides an optimal trade-off between a high breakdown point and efficiency. Third, focusing on high-dimensional linear models affected by meanshift outliers, we develop a general framework in which L0-constraints coupled with mixed-integer programming techniques are used to perform simultaneous feature selection and outlier detection with provably optimal guarantees. In particular, we provide necessary and sufficient conditions for a robustly strong oracle property, where again the number of features can increase exponentially with the sample size, and prove optimality for parameter estimation and the resulting breakdown point. Finally, we consider generalized linear models and rely on logistic slippage to perform outlier detection and removal in binary classification. Here we use L0-constraints and mixed-integer conic programming techniques to solve the underlying double combinatorial problem of feature selection and outlier detection, and the framework allows us again to pursue optimality guarantees. For all the proposed approaches, we also provide computationally lean heuristic algorithms, tuning procedures, and diagnostic tools which help to guide the analysis. We consider several real-world applications, including the study of the relationships between childhood obesity and the human microbiome, and of the main drivers of honey bee loss. All methods developed and data used, as well as the source code to replicate our analyses, are publicly available

    Robust clusterwise linear regression through trimming

    No full text
    The presence of clusters in a data set is sometimes due to the existence of certain relations among the measured variables which vary depending on some hidden factors. In these cases, observations could be grouped in a natural way around linear and nonlinear structures and, thus, the problem of doing robust clustering around linear affine subspaces has recently been tackled through the minimization of a trimmed sum of orthogonal residuals. This "orthogonal approach" implies that there is no privileged variable playing the role of response variable or output. However, there are problems where clearly one variable is wanted to be explained in terms of the other ones and the use of vertical residuals from classical linear regression seems to be more advisable. The so-called TCLUST methodology is extended to perform robust clusterwise linear regression and a feasible algorithm for the practical implementation is proposed. The algorithm includes a "second trimming" step aimed to diminishing the effect of leverage points.

    Eine neue Methode zum robusten Entwurf von Regressionsmodellen bei beschränkter Rohdatenqualität

    Get PDF

    Rainfall prediction in Australia : Clusterwise linear regression approach

    Get PDF
    Accurate rainfall prediction is a challenging task because of the complex physical processes involved. This complexity is compounded in Australia as the climate can be highly variable. Accurate rainfall prediction is immensely benecial for making informed policy, planning and management decisions, and can assist with the most sustainable operation of water resource systems. Short-term prediction of rainfall is provided by meteorological services; however, the intermediate to long-term prediction of rainfall remains challenging and contains much uncertainty. Many prediction approaches have been proposed in the literature, including statistical and computational intelligence approaches. However, finding a method to model the complex physical process of rainfall, especially in Australia where the climate is highly variable, is still a major challenge. The aims of this study are to: (a) develop an optimization based clusterwise linear regression method, (b) develop new prediction methods based on clusterwise linear regression, (c) assess the influence of geographic regions on the performance of prediction models in predicting monthly and weekly rainfall in Australia, (d) determine the combined influence of meteorological variables on rainfall prediction in Australia, and (e) carry out a comparative analysis of new and existing prediction techniques using Australian rainfall data. In this study, rainfall data with five input meteorological variables from 24 geographically diverse weather stations in Australia, over the period January 1970 to December 2014, have been taken from the Scientific Information for Land Owners (SILO). We also consider the climate zones when selecting weather stations, because Australia experiences a variety of climates due to its size. The data was divided into training and testing periods for evaluation purposes. In this study, optimization based clusterwise linear regression is modified and new prediction methods are developed for rainfall prediction. The proposed method is applied to predict monthly and weekly rainfall. The prediction performance of the clusterwise linear regression method was evaluated by comparing observed and predicted rainfall values using the performance measures: root mean squared error, the mean absolute error, the mean absolute scaled error and the Nash-Sutclie coefficient of efficiency. The proposed method is also compared with the clusterwise linear regression based on the maximum likelihood estimation, linear support vector machines for regression, support vector machines for regression with radial basis kernel function, multiple linear regression, artificial neural networks with and without hidden layer and k-nearest neighbours methods using computational results. Initially, to determine the appropriate input variables to be used in the investigation, we assessed all combinations of meteorological variables. The results confirm that single meteorological variables alone are unable to predict rainfall accurately. The prediction performance of all selected models was improved by adding the input variables in most locations. To assess the influence of geographic regions on the performance of prediction models and to compare the prediction performance of models, we trained models with the best combination of input variables and predicted monthly and weekly rainfall over the test periods. The results of this analysis confirm that the prediction performance of all selected models varied considerably with geographic regions for both weekly and monthly rainfall predictions. It is found that models have the lowest prediction error in the desert climate zone and highest in subtropical and tropical zones. The results also demonstrate that the proposed algorithm is capable of finding the patterns and trends of the observations for monthly and weekly rainfall predictions in all geographic regions. In desert, tropical and subtropical climate zones, the proposed method outperform other methods in most locations for both monthly and weekly rainfall predictions. In temperate and grassland zones the prediction performance of the proposed model is better in some locations while in the remaining locations it is slightly lower than the other models.Doctor of Philosoph
    corecore