108 research outputs found

    Generalized Clusterwise Regression for Simultaneous Estimation of Optimal Pavement Clusters and Performance Models

    Full text link
    The existing state-of-the-art approach of Clusterwise Regression (CR) to estimate pavement performance models (PPMs) pre-specifies explanatory variables without testing their significance; as an input, this approach requires the number of clusters for a given data set. Time-consuming ‘trial and error’ methods are required to determine the optimal number of clusters. A common objective function is the minimization of the total sum of squared errors (SSE). Given that SSE decreases monotonically as a function of the number of clusters, the optimal number of clusters with minimum SSE always is the total number of data points. Hence, the minimization of SSE is not the best objective function to seek for an optimal number of clusters. In previous studies, the PPMs were restricted to be either linear or nonlinear, irrespective of which functional form provided the best results. The existing mathematical programming formulations did not include constraints that ensured the minimum number of observations required in each cluster to achieve statistical significance. In addition, a pavement sample could be associated with multiple performance models. Hence, additional modeling was required to combine the results from multiple models. To address all these limitations, this research proposes a generalized CR that simultaneously 1) finds the optimal number of pavement clusters, 2) assigns pavement samples into clusters, 3) estimates the coefficients of cluster-specific explanatory variables, and 4) determines the best functional form between linear and nonlinear models. Linear and nonlinear functional forms were investigated to select the best model specification. A mixed-integer nonlinear mathematical program was formulated with the Bayesian Information Criteria (BIC) as the objective function. The advantage of using BIC is that it penalizes for including additional parameters (i.e., number of clusters and/or explanatory variables). Hence, the optimal CR models provided a balance between goodness of fit and model complexity. In addition, the search process for the best model specification using BIC has the property of consistency, which asymptotically selects this model with a probability of ‘1’. Comprehensive solution algorithms – Simulated Annealing coupled with Ordinary Least Squares for linear models and All Subsets Regression for nonlinear models – were implemented to solve the proposed mathematical problem. The algorithms selected the best model specification for each cluster after exploring all possible combinations of potentially significant explanatory variables. Potential multicollinearity issues were investigated and addressed as required. Variables identified as significant explanatory variables were average daily traffic, pavement age, rut depth along the pavement, annual average precipitation and minimum temperature, road functional class, prioritization category, and the number of lanes. All these variables were considered in the literature as the most critical factors for pavement deterioration. In addition, the predictive capability of the estimated models was investigated. The results showed that the models were robust without any overfitting issues, and provided small prediction errors. The models developed using the proposed approach provided superior explanatory power compared to those that were developed using the existing state-of-the-art approach of clusterwise regression. In particular, for the data set used in this research, nonlinear models provided better explanatory power than did the linear models. As expected, the results illustrated that different clusters might require different explanatory variables and associated coefficients. Similarly, determining the optimal number of clusters while estimating the corresponding PPMs contributed significantly to reduce the estimation error

    Cadre Modeling: Simultaneously Discovering Subpopulations and Predictive Models

    Full text link
    We consider the problem in regression analysis of identifying subpopulations that exhibit different patterns of response, where each subpopulation requires a different underlying model. Unlike statistical cohorts, these subpopulations are not known a priori; thus, we refer to them as cadres. When the cadres and their associated models are interpretable, modeling leads to insights about the subpopulations and their associations with the regression target. We introduce a discriminative model that simultaneously learns cadre assignment and target-prediction rules. Sparsity-inducing priors are placed on the model parameters, under which independent feature selection is performed for both the cadre assignment and target-prediction processes. We learn models using adaptive step size stochastic gradient descent, and we assess cadre quality with bootstrapped sample analysis. We present simulated results showing that, when the true clustering rule does not depend on the entire set of features, our method significantly outperforms methods that learn subpopulation-discovery and target-prediction rules separately. In a materials-by-design case study, our model provides state-of-the-art prediction of polymer glass transition temperature. Importantly, the method identifies cadres of polymers that respond differently to structural perturbations, thus providing design insight for targeting or avoiding specific transition temperature ranges. It identifies chemically meaningful cadres, each with interpretable models. Further experimental results show that cadre methods have generalization that is competitive with linear and nonlinear regression models and can identify robust subpopulations.Comment: 8 pages, 6 figure

    Advances in robust clustering methods with applications

    Get PDF
    Robust methods in statistics are mainly concerned with deviations from model assumptions. As already pointed out in Huber (1981) and in Huber & Ronchetti (2009) \these assumptions are not exactly true since they are just a mathematically convenient rationalization of an often fuzzy knowledge or belief". For that reason \a minor error in the mathematical model should cause only a small error in the nal conclusions". Nevertheless it is well known that many classical statistical procedures are \excessively sensitive to seemingly minor deviations from the assumptions". All statistical methods based on the minimization of the average square loss may suer of lack of robustness. Illustrative examples of how outliers' in uence may completely alter the nal results in regression analysis and linear model context are provided in Atkinson & Riani (2012). A presentation of classical multivariate tools' robust counterparts is provided in Farcomeni & Greco (2015). The whole dissertation is focused on robust clustering models and the outline of the thesis is as follows. Chapter 1 is focused on robust methods. Robust methods are aimed at increasing the eciency when contamination appears in the sample. Thus a general denition of such (quite general) concept is required. To do so we give a brief account of some kinds of contamination we can encounter in real data applications. Secondly we introduce the \Spurious outliers model" (Gallegos & Ritter 2009a) which is the cornerstone of the robust model based clustering models. Such model is aimed at formalizing clustering problems when one has to deal with contaminated samples. The assumption standing behind the \Spurious outliers model" is that two dierent random mechanisms generate the data: one is assumed to generate the \clean" part while the another one generates the contamination. This idea is actually very common within robust models like the \Tukey-Huber model" which is introduced in Subsection 1.2.2. Outliers' recognition, especially in the multivariate case, plays a key role and is not straightforward as the dimensionality of the data increases. An overview of the most widely used (robust) methods for outliers detection is provided within Section 1.3. Finally, in Section 1.4, we provide a non technical review of the classical tools introduced in the Robust Statistics' literature aimed at evaluating the robustness properties of a methodology. Chapter 2 is focused on model based clustering methods and their robustness' properties. Cluster analysis, \the art of nding groups in the data" (Kaufman & Rousseeuw 1990), is one of the most widely used tools within the unsupervised learning context. A very popular method is the k-means algorithm (MacQueen et al. 1967) which is based on minimizing the Euclidean distance of each observation from the estimated clusters' centroids and therefore it is aected by lack of robustness. Indeed even a single outlying observation may completely alter centroids' estimation and simultaneously provoke a bias in the standard errors' estimation. Cluster's contours may be in ated and the \real" underlying clusterwise structure might be completely hidden. A rst attempt of robustifying the k- means algorithm appeared in Cuesta-Albertos et al. (1997), where a trimming step is inserted in the algorithm in order to avoid the outliers' exceeding in uence. It shall be noticed that k-means algorithm is ecient for detecting spherical homoscedastic clusters. Whenever more exible shapes are desired the procedure becomes inecient. In order to overcome this problem Gaussian model based clustering methods should be adopted instead of k-means algorithm. An example, among the other proposals described in Chapter 2, is the TCLUST methodology (Garca- Escudero et al. 2008), which is the cornerstone of the thesis. Such methodology is based on two main characteristics: trimming a xed proportion of observations and imposing a constraint on the estimates of the scatter matrices. As it will be explained in Chapter 2, trimming is used to protect the results from outliers' in uence while the constraint is involved as spurious maximizers may completely spoil the solution. Chapter 3 and 4 are mainly focused on extending the TCLUST methodology. In particular, in Chapter 3, we introduce a new contribution (compare Dotto et al. 2015 and Dotto et al. 2016b), based on the TCLUST approach, called reweighted TCLUST or RTCLUST for the sake of brevity. The idea standing behind such method is based on reweighting the observations initially agged as outlying. This is helpful both to gain eciency in the parameters' estimation process and to provide a reliable estimation of the true contamination level. Indeed, as the TCLUST is based on trimming a xed proportion of observations, a proper choice of the trimming level is required. Such choice, especially in the applications, can be cumbersome. As it will be claried later on, RTCLUST methodology allows the user to overcome such problem. Indeed, in the RTCLUST approach the user is only required to impose a high preventive trimming level. The procedure, by iterating through a sequence of decreasing trimming levels, is aimed at reinserting the discarded observations at each step and provides more precise estimation of the parameters and a nal estimation of the true contamination level ^. The theoretical properties of the methodology are studied in Section 3.6 and proved in Appendix A.1, while, Section 3.7, contains a simulation study aimed at evaluating the properties of the methodology and the advantages with respect to some other robust (reweigthed and single step procedures). Chapter 4 contains an extension of the TCLUST method for fuzzy linear clustering (Dotto et al. 2016a). Such contribution can be viewed as the extension of Fritz et al. (2013a) for linear clustering problems, or, equivalently, as the extension of Garca-Escudero, Gordaliza, Mayo-Iscar & San Martn (2010) to the fuzzy clustering framework. Fuzzy clustering is also useful to deal with contamination. Fuzziness is introduced to deal with overlapping between clusters and the presence of bridge points, to be dened in Section 1.1. Indeed bridge points may arise in case of overlapping between clusters and may completely alter the estimated cluster's parameters (i.e. the coecients of a linear model in each cluster). By introducing fuzziness such observations are suitably down weighted and the clusterwise structure can be correctly detected. On the other hand, robustness against gross outliers, as in the TCLUST methodology, is guaranteed by trimming a xed proportion of observations. Additionally a simulation study, aimed at comparing the proposed methodology with other proposals (both robust and non robust) is also provided in Section 4.4. Chapter 5 is entirely dedicated to real data applications of the proposed contributions. In particular, the RTCLUST method is applied to two dierent datasets. The rst one is the \Swiss Bank Note" dataset, a well known benchmark dataset for clustering models, and to a dataset collected by Gallup Organization, which is, to our knowledge, an original dataset, on which no other existing proposals have been applied yet. Section 5.3 contains an application of our fuzzy linear clustering proposal to allometry data. In our opinion such dataset, already considered in the robust linear clustering proposal appeared in Garca-Escudero, Gordaliza, Mayo-Iscar & San Martn (2010), is particularly useful to show the advantages of our proposed methodology. Indeed allometric quantities are often linked by a linear relationship but, at the same time, there may be overlap between dierent groups and outliers may often appear due to errors in data registration. Finally Chapter 6 contains the concluding remarks and the further directions of research. In particular we wish to mention an ongoing work (Dotto & Farcomeni, In preparation) in which we consider the possibility of implementing robust parsimonious Gaussian clustering models. Within the chapter, the algorithm is briefly described and some illustrative examples are also provided. The potential advantages of such proposals are the following. First of all, by considering the parsimonious models introduced in Celeux & Govaert (1995), the user is able to impose the shape of the detected clusters, which often, in the applications, plays a key role. Secondly, by constraining the shape of the detected clusters, the constraint on the eigenvalue ratio can be avoided. This leads to the removal of a tuning parameter of the procedure and, at the same time, allows the user to obtain ane equivariant estimators. Finally, since the possibility of trimming a xed proportion of observations is allowed, then the procedure is also formally robust

    Market segmentation and ideal point identification for new product design using fuzzy data compression and fuzzy clustering methods

    Get PDF
    In product design, various methodologies have been proposed for market segmentation, which group consumers with similar customer requirements into clusters. Central points on market segments are always used as ideal points of customer requirements for product design, which reflects particular competitive strategies to effectively reach all consumers’ interests. However, existing methodologies ignore the fuzziness on consumers’ customer requirements. In this paper, a new methodology is proposed to perform market segmentation based on consumers’ customer requirements, which exist fuzziness. The methodology is an integration of a fuzzy compression technique for multi-dimension reduction and a fuzzy clustering technique. It first compresses the fuzzy data regarding customer requirements from high dimensions into two dimensions. After the fuzzy data is clustered into marketing segments, the centre points of market segments are used as ideal points for new product development. The effectiveness of the proposed methodology in market segmentation and identification of the ideal points for new product design is demonstrated using a case study of new digital camera design

    Clusterwise regression and market segmentation : developments and applications

    Get PDF
    The present work consists of two major parts. In the first part the literature on market segmentation is reviewed, in the second part a set of new methods for market segmentation are developed and applied.Part 1 starts with a discussion of the segmentation concept, and proceeds with a discussion on marketing strategies for segmented markets. A number of criteria for effective segmentation are summarized. Next, two major streams of segmentation research are identified on the basis of their theoretical foundation, which is either of a microeconomic or of a behavioral science nature. These two streams differ according to both the bases and the methods used for segmenting markets.After a discussion of the segmentation bases that have been put forward as the normative ideal but have been applied in practice very little, different bases are classified into four categories, according to their being observable or unobservable, and general or product- specific. The bases in each of the four categories are reviewed and discussed in terms of the criteria for effective segmentation. Product benefits are identified as one of the most effective bases by these criteria.Subsequently, the statistical methods available for segmentation are discussed, according to a classification into four categories, being either a priori or post hoc, and either descriptive or predictive. Post hoc (clustering) methods are appealing because they deal adequately with the complexity of markets, while the predictive methods within this class (AID, clusterwise regression) combine this advantage with prediction of purchase (predisposition).Within the two major segmentation streams, segmentation methods have been developed that are specifically tailored to the segmentation problems at hand. These are discussed. For the microeconomic school focus is upon recently developed latent class approaches that simultaneously estimate consumer segments and market characteristics (market shares, switching, elasticities) within these segments. For the behavioral science school focus is on benefit segmentation. Disadvantages of the traditional two-stage approach, in which consumers are clustered into segments on the basis of benefit importances estimated at the individual level, are revealed and procedures that have been addressed to one or more of these problems are reviewed.In Part 2, three new methods for benefit segmentation are developed: clusterwise regression, fuzzy clusterwise regression and generalized fuzzy clusterwise regression.The first method is a clustering method that simultaneously groups consumers in a number of nonoverlapping segments, and estimates the benefit importances within segments. The performance of the algorithm on synthetic data is investigated in a Monte Carlo study. Empirically, the method is shown to outperform the two-stage procedure. Special attention is paid to significance testing with Monte Carlo test procedures, and convergence to local optima. An application to segmentation of the meat-market in the Netherlands on the basis of data on elderly peoples preferences for meat products is given. Three segments are identified. The first segment weights sensory quality against exclusiveness (price), in the second segment quality is traded off against fatness. This segment, comprising predominantly of females, had the best knowledge of nutrition. In the third segment preference is based on quality only. Regional differences were identified among segments.Fuzzy clusterwise regression extends clusterwise regression in that it allows consumers to be a member of more than one segment. It simultaneously estimates the preference functions within segments, as well as the degree of membership of consumers in those segments. Using synthetic data, the performance of the method is evaluated. Empirical comparisons with two other methods are provided, and the cross-validity of the method with respect to classification and prediction is assessed. Attention is given in particular to the selection of the appropriate number of segments, the setting of the user defined fuzzy weight parameter, and Monte Carlo significance test procedures. An application to data on preferences for meatproducts used on bread in the Netherlands revealed three segments. In the first segment, taste and fitness for common use are important. In the second segment, taste overridingly determines preference, but products that are considered more exclusive and natural and less fat and salt are also preferred. In segment three the health related product benefits are even more important. The importance of taste decreases from segment one to three, while the importance of health-related aspects increases in that direction. The health oriented segments comprised more females, older people and people who attributed causality of their behavior more to themselves.The method was also applied to data on consumers image for stores that sell meat. Again three segments were revealed. The value shoppers, trade off quality and price.They come from smaller families and spend less on meat. In the largest segment store image is based upon product quality. Females have higher membership in this segment, that is more involved with the store where they buy meat. For service shoppers, both service and atmosphere are important. This segment tends to be more store-loyal.Next, a generalization of fuzzy clusterwise regression is proposed, which incorporates both benefit segmentation and market structuring within the framework of preference analysis. The method simultaneously estimates the preference functions within each of a number of clusters, and the parameters indicating the degree of membership of both subjects and products in these clusters. The performance of this method is assessed in a Monte Carlo study on synthetic data. The method is compared empirically with clusterwise regression and fuzzy clusterwise regression. The significance testing with Monte Carlo test procedures, and the selection of the fuzzy weight parameters is treated in detail. Two segments were revealed in an analysis of consumer preferences of butter and margarine brands. The segments differed mainly in the importance attached to exclusiveness and fitness for multiple purposes. The brands competing within these segments were revealed. Females and consumers with a higher socioeconomic status had higher memberships in the segments in which exclusiveness was important.Finally, the clusterwise regression methods developed in this work are compared with other recently developed procedures in terms of the assumptions involved. The substantive results obtained in the empirical studies concerning foods are summarized and their implications for future research are given. The implications and the contribution of the methods to the development of marketing strategies for segmented markets are discussed

    Modelling distributor firm and manufacturer firm working relationships

    Get PDF
    Past research has characterized business relationships using perceptual and sentiment constructs. This research has relied on an underlying assumption that ‘structure’, in the form of channel role, is more likely to explain firm behaviour in relationships than ‘strategy’ or any other formulation. A re-examination of the literature throws doubt on this assumption. An empirical study is used to explore an alternate hypothesis, with differentiated local models of distributor and manufacturing firms’ working relationships being found using a clusterwise regression technique. The results suggest that inter-firm cooperation is more effective than a self-centred approach to achieving relationship performance. In addition the results suggest that relationship ‘strategy’ is more important than ‘structure’ in modelling working relationships. Finally, directions for further research and management implications are considered.http://pandora.nla.gov.au/pan/25410/20060410-0000/smib.vuw.ac.nz_8081/WWW/ANZMAC2005/index.htm

    Onset of an outline map to get a hold on the wildwood of clustering methods

    Full text link
    The domain of cluster analysis is a meeting point for a very rich multidisciplinary encounter, with cluster-analytic methods being studied and developed in discrete mathematics, numerical analysis, statistics, data analysis and data science, and computer science (including machine learning, data mining, and knowledge discovery), to name but a few. The other side of the coin, however, is that the domain suffers from a major accessibility problem as well as from the fact that it is rife with division across many pretty isolated islands. As a way out, the present paper offers an outline map for the clustering domain as a whole, which takes the form of an overarching conceptual framework and a common language. With this framework we wish to contribute to structuring the domain, to characterizing methods that have often been developed and studied in quite different contexts, to identifying links between them, and to introducing a frame of reference for optimally setting up cluster analyses in data-analytic practice.Comment: 33 pages, 4 figure

    Mine evaluation optimisation

    Get PDF
    The definition of a mineral resource during exploration is a fundamental part of lease evaluation, which establishes the fair market value of the entire asset being explored in the open market. Since exact prediction of grades between sampled points is not currently possible by conventional methods, an exact agreement between predicted and actual grades will nearly always contain some error. These errors affect the evaluation of resources so impacting on characterisation of risks, financial projections and decisions about whether it is necessary to carry on with the further phases or not. The knowledge about minerals below the surface, even when it is based upon extensive geophysical analysis and drilling, is often too fragmentary to indicate with assurance where to drill, how deep to drill and what can be expected. Thus, the exploration team knows only the density of the rock and the grade along the core. The purpose of this study is to improve the process of resource evaluation in the exploration stage by increasing prediction accuracy and making an alternative assessment about the spatial characteristics of gold mineralisation. There is significant industrial interest in finding alternatives which may speed up the drilling phase, identify anomalies, worthwhile targets and help in establishing fair market value. Recent developments in nonconvex optimisation and high-dimensional statistics have led to the idea that some engineering problems such as predicting gold variability at the exploration stage can be solved with the application of clusterwise linear and penalised maximum likelihood regression techniques. This thesis attempts to solve the distribution of the mineralisation in the underlying geology using clusterwise linear regression and convex Least Absolute Shrinkage and Selection Operator (LASSO) techniques. The two presented optimisation techniques compute predictive solutions within a domain using physical data provided directly from drillholes. The decision-support techniques attempt a useful compromise between the traditional and recently introduced methods in optimisation and regression analysis that are developed to improve exploration targeting and to predict the gold occurrences at previously unsampled locations.Doctor of Philosoph
    • 

    corecore