61 research outputs found

    Advances in robust clustering methods with applications

    Get PDF
    Robust methods in statistics are mainly concerned with deviations from model assumptions. As already pointed out in Huber (1981) and in Huber & Ronchetti (2009) \these assumptions are not exactly true since they are just a mathematically convenient rationalization of an often fuzzy knowledge or belief". For that reason \a minor error in the mathematical model should cause only a small error in the nal conclusions". Nevertheless it is well known that many classical statistical procedures are \excessively sensitive to seemingly minor deviations from the assumptions". All statistical methods based on the minimization of the average square loss may suer of lack of robustness. Illustrative examples of how outliers' in uence may completely alter the nal results in regression analysis and linear model context are provided in Atkinson & Riani (2012). A presentation of classical multivariate tools' robust counterparts is provided in Farcomeni & Greco (2015). The whole dissertation is focused on robust clustering models and the outline of the thesis is as follows. Chapter 1 is focused on robust methods. Robust methods are aimed at increasing the eciency when contamination appears in the sample. Thus a general denition of such (quite general) concept is required. To do so we give a brief account of some kinds of contamination we can encounter in real data applications. Secondly we introduce the \Spurious outliers model" (Gallegos & Ritter 2009a) which is the cornerstone of the robust model based clustering models. Such model is aimed at formalizing clustering problems when one has to deal with contaminated samples. The assumption standing behind the \Spurious outliers model" is that two dierent random mechanisms generate the data: one is assumed to generate the \clean" part while the another one generates the contamination. This idea is actually very common within robust models like the \Tukey-Huber model" which is introduced in Subsection 1.2.2. Outliers' recognition, especially in the multivariate case, plays a key role and is not straightforward as the dimensionality of the data increases. An overview of the most widely used (robust) methods for outliers detection is provided within Section 1.3. Finally, in Section 1.4, we provide a non technical review of the classical tools introduced in the Robust Statistics' literature aimed at evaluating the robustness properties of a methodology. Chapter 2 is focused on model based clustering methods and their robustness' properties. Cluster analysis, \the art of nding groups in the data" (Kaufman & Rousseeuw 1990), is one of the most widely used tools within the unsupervised learning context. A very popular method is the k-means algorithm (MacQueen et al. 1967) which is based on minimizing the Euclidean distance of each observation from the estimated clusters' centroids and therefore it is aected by lack of robustness. Indeed even a single outlying observation may completely alter centroids' estimation and simultaneously provoke a bias in the standard errors' estimation. Cluster's contours may be in ated and the \real" underlying clusterwise structure might be completely hidden. A rst attempt of robustifying the k- means algorithm appeared in Cuesta-Albertos et al. (1997), where a trimming step is inserted in the algorithm in order to avoid the outliers' exceeding in uence. It shall be noticed that k-means algorithm is ecient for detecting spherical homoscedastic clusters. Whenever more exible shapes are desired the procedure becomes inecient. In order to overcome this problem Gaussian model based clustering methods should be adopted instead of k-means algorithm. An example, among the other proposals described in Chapter 2, is the TCLUST methodology (Garca- Escudero et al. 2008), which is the cornerstone of the thesis. Such methodology is based on two main characteristics: trimming a xed proportion of observations and imposing a constraint on the estimates of the scatter matrices. As it will be explained in Chapter 2, trimming is used to protect the results from outliers' in uence while the constraint is involved as spurious maximizers may completely spoil the solution. Chapter 3 and 4 are mainly focused on extending the TCLUST methodology. In particular, in Chapter 3, we introduce a new contribution (compare Dotto et al. 2015 and Dotto et al. 2016b), based on the TCLUST approach, called reweighted TCLUST or RTCLUST for the sake of brevity. The idea standing behind such method is based on reweighting the observations initially agged as outlying. This is helpful both to gain eciency in the parameters' estimation process and to provide a reliable estimation of the true contamination level. Indeed, as the TCLUST is based on trimming a xed proportion of observations, a proper choice of the trimming level is required. Such choice, especially in the applications, can be cumbersome. As it will be claried later on, RTCLUST methodology allows the user to overcome such problem. Indeed, in the RTCLUST approach the user is only required to impose a high preventive trimming level. The procedure, by iterating through a sequence of decreasing trimming levels, is aimed at reinserting the discarded observations at each step and provides more precise estimation of the parameters and a nal estimation of the true contamination level ^. The theoretical properties of the methodology are studied in Section 3.6 and proved in Appendix A.1, while, Section 3.7, contains a simulation study aimed at evaluating the properties of the methodology and the advantages with respect to some other robust (reweigthed and single step procedures). Chapter 4 contains an extension of the TCLUST method for fuzzy linear clustering (Dotto et al. 2016a). Such contribution can be viewed as the extension of Fritz et al. (2013a) for linear clustering problems, or, equivalently, as the extension of Garca-Escudero, Gordaliza, Mayo-Iscar & San Martn (2010) to the fuzzy clustering framework. Fuzzy clustering is also useful to deal with contamination. Fuzziness is introduced to deal with overlapping between clusters and the presence of bridge points, to be dened in Section 1.1. Indeed bridge points may arise in case of overlapping between clusters and may completely alter the estimated cluster's parameters (i.e. the coecients of a linear model in each cluster). By introducing fuzziness such observations are suitably down weighted and the clusterwise structure can be correctly detected. On the other hand, robustness against gross outliers, as in the TCLUST methodology, is guaranteed by trimming a xed proportion of observations. Additionally a simulation study, aimed at comparing the proposed methodology with other proposals (both robust and non robust) is also provided in Section 4.4. Chapter 5 is entirely dedicated to real data applications of the proposed contributions. In particular, the RTCLUST method is applied to two dierent datasets. The rst one is the \Swiss Bank Note" dataset, a well known benchmark dataset for clustering models, and to a dataset collected by Gallup Organization, which is, to our knowledge, an original dataset, on which no other existing proposals have been applied yet. Section 5.3 contains an application of our fuzzy linear clustering proposal to allometry data. In our opinion such dataset, already considered in the robust linear clustering proposal appeared in Garca-Escudero, Gordaliza, Mayo-Iscar & San Martn (2010), is particularly useful to show the advantages of our proposed methodology. Indeed allometric quantities are often linked by a linear relationship but, at the same time, there may be overlap between dierent groups and outliers may often appear due to errors in data registration. Finally Chapter 6 contains the concluding remarks and the further directions of research. In particular we wish to mention an ongoing work (Dotto & Farcomeni, In preparation) in which we consider the possibility of implementing robust parsimonious Gaussian clustering models. Within the chapter, the algorithm is briefly described and some illustrative examples are also provided. The potential advantages of such proposals are the following. First of all, by considering the parsimonious models introduced in Celeux & Govaert (1995), the user is able to impose the shape of the detected clusters, which often, in the applications, plays a key role. Secondly, by constraining the shape of the detected clusters, the constraint on the eigenvalue ratio can be avoided. This leads to the removal of a tuning parameter of the procedure and, at the same time, allows the user to obtain ane equivariant estimators. Finally, since the possibility of trimming a xed proportion of observations is allowed, then the procedure is also formally robust

    A Business Intelligence Framework for Network-level Traffic Safety Analyses

    Full text link
    Currently, there are both methodological and practical barriers that together preclude a substantial use of theoretically sound approaches, such as the ones recommended by the Highway Safety Manual (HSM), for traffic safety management. Although the state-of-the-art provides theoretically sound approaches such as the Empirical Bayes method, there are still various important capabilities missing. Methodological barriers include among others (i) lack of a theoretically sound approach for corridor-level network screening, (ii) lack of a comprehensive approach for estimation of Safety Performance Functions based on a simultaneous consideration of both crash patterns and associated explanatory variables, and (iii) lack of theoretically sound methods to forecast crash patterns at the regional level. In addition, the use of existing theoretically sound approaches such as the ones recommended by the HSM are associated with important practical barriers including 1) significant data integration requirements, 2) a special schema is needed to enable analysis using specialized software, 3) time-consuming and intensive processes are involved, 4) substantial technical knowledge is needed, 5) visualization capabilities are limited, and 6) coordination across various data owners is required. Considering the above barriers, most practitioners use theoretically unsound methodologies to perform traffic safety analyses for highway safety improvement programs. This research proposes a single comprehensive framework to address all the above barriers to enable the use of theoretically sound methodologies for network wide traffic safety analyses. The proposed framework provides access through a single platform, Business Intelligence (BI), to theoretically sound methods and associated algorithms, data management and integration tools, and visualization capabilities. That is, the proposed BI framework provides methods and mechanisms to integrate and process data, generate advanced and theoretically sound analytics, and visualize results through intuitive and interactive web-based dashboards and maps. The proposed BI framework integrates data using Extract-Load-Transform process and creates a traffic safety data warehouse. Algorithms are implemented to use the data warehouse for network screening analysis of roadway segments, intersections, ramps, and corridors. The methodology proposed and implemented here for corridor-level network screening represents an important expansion to the existing methods recommended by the HSM. Corridor-level network screening is important for decision makers because it enables to rank corridors rather than sites so as to provide homogenous infrastructure to minimize changes within relatively short distances. Improvements are recommended for long sections of roadways that could include multiple sites with the potential for safety improvements. Existing corridor screening methodologies use observed crash frequency as a performance measure which does not consider regression-to-the-mean bias. The proposed methodology uses expected crash frequency as a performance measure and searches corridors using a sliding window mechanism which addresses crash location reporting errors by considering the same section of roadway multiple times using overlapping windows. The proposed BI framework includes a comprehensive methodology for the estimation of SPFs considering simultaneously local crash patterns and site characteristics. The current state-of-the-art uses predefined crash site types to create single clusters of data to generate regression models, SPFs, for the estimation of predicted crash frequency. It is highly unlikely for all crash sites within a single predefined cluster/type to have similar crash patterns and associated explanatory characteristics. That is, there could be sites within a cluster/type with different crash patterns and explanatory characteristics. Hence, assigning a single predefined SPF to all sites within a type is not necessarily the best approach to minimize the estimation error. To address this issue, a mathematical program was formulated to determine simultaneously cluster memberships for crash sites and the corresponding SPFs. Cluster memberships are determined using both crash patterns and associated explanatory variables. A solution algorithm coupling simulation annealing and maximum log likely estimation was implemented and tested. Results indicated that multiple SPFs for a crash and/or facility type can maximize the probability of observing the available data to increase accuracy and reliability. The estimated SPFs using the proposed approach were implemented within the BI framework for network screening. The results illustrate that the gain in predicted crashes provided by the SPFs translates into superior rankings for sites and corridors with the potential for safety improvements. A performance-based safety program requires the forecasting, at the regional level, of safety performance measures and establish targets to reduce fatalities and serious injuries. This is in contrast to the analysis required for traffic safety management where forecasts are required at the site or corridor level. For regional level forecasting, historically, theoretically unsound methods such as extrapolation or simple moving-average models have been used. To address this issue, this study proposed deterministic and stochastic time series models to forecast performance measures for performance-based safety programs. Results indicated that stochastic time series, a seasonal autoregressive integrated moving average model, provides the required statistically sound forecasts. In summary, the fundamental contributions of this research include: (i) a theoretically sound methodology for corridor level network screening, (ii) a comprehensive methodology for the estimation of local SPFs considering simultaneously crash patterns and associated explanatory variables, and (iii) a theoretically sound methodology to forecast performance measures to set realistic targets for performance-based safety programs. In addition, this study implemented and tested the above contributions along with existing algorithms for traffic safety network screening within a single BI platform. The result is a single web-based BI framework to enable integration and management of source data, generation of theoretically sound analyses, and visualization capabilities through intuitive dashboards, drilldown menus, and interactive maps

    Fuzzy nominal classification using bipolar analysis

    Get PDF
    The process of assigning objects (candidates, projects, decisions, options, etc.) characterized by multiple attributes or criteria to predefined classes characterized by entrance conditions or constraints constitutes a subclass of multi-criteria decision making problems known as nominal or non-ordered classification problems as opposed to ordinal classification. In practice, class entrance conditions are not perfectly defined; they are rather fuzzily defined so that classification procedures must be design up to some uncertainty degree (doubt, indecision, imprecision, etc.). The purpose of this chapter is to expose recent advances related to this issue with particular highlights on bipolar analysis that consists in considering for a couple of object and class, two measures: classifiability measure that measures to what extent the former object can be considered for inclusion in the later class and rejectability measure, a degree that measures the extent to which one should avoid including this object into that class rendering final choice flexible and robust as many classes may be qualified for inclusion of an object. This apparent theoretical subject finds applications in almost any socio-economic domain and particularly in digital marketing. An application to supply chain management, where a certain number of potential suppliers of a company are to be classified in a number of classes in order to apply the appropriate strategic treatment to them, will be considered for illustration purpose
    • …
    corecore