3,924 research outputs found

    kamila: Clustering Mixed-Type Data in R and Hadoop

    Get PDF
    In this paper we discuss the challenge of equitably combining continuous (quantitative) and categorical (qualitative) variables for the purpose of cluster analysis. Existing techniques require strong parametric assumptions, or difficult-to-specify tuning parameters. We describe the kamila package, which includes a weighted k-means approach to clustering mixed-type data, a method for estimating weights for mixed-type data (ModhaSpangler weighting), and an additional semiparametric method recently proposed in the literature (KAMILA). We include a discussion of strategies for estimating the number of clusters in the data, and describe the implementation of one such method in the current R package. Background and usage of these clustering methods are presented. We then show how the KAMILA algorithm can be adapted to a map-reduce framework, and implement the resulting algorithm using Hadoop for clustering very large mixed-type data sets

    Disease Mapping via Negative Binomial Regression M-quantiles

    Full text link
    We introduce a semi-parametric approach to ecological regression for disease mapping, based on modelling the regression M-quantiles of a Negative Binomial variable. The proposed method is robust to outliers in the model covariates, including those due to measurement error, and can account for both spatial heterogeneity and spatial clustering. A simulation experiment based on the well-known Scottish lip cancer data set is used to compare the M-quantile modelling approach and a random effects modelling approach for disease mapping. This suggests that the M-quantile approach leads to predicted relative risks with smaller root mean square error than standard disease mapping methods. The paper concludes with an illustrative application of the M-quantile approach, mapping low birth weight incidence data for English Local Authority Districts for the years 2005-2010.Comment: 23 pages, 7 figure

    Spatial clustering and nonlinearities in the location of multinational firms

    Get PDF
    We propose a semiparametric geoadditive negative binomial model of industrial location which allows to simultaneously address some important methodological issues, such as spatial clustering and nonlinearities, which have been only partly addressed in previous studies. We apply this model to analyze location determinants of inward greenfield investments occurred over the 2003-2007 period in 249 European regions. The inclusion of a geoadditive component (a smooth spatial trend surface) allows to control for omitted variables which induce spatial clustering, and suggests that such unobserved factors may be related to regional policies towards foreign investors Allowing for nonlinearities reveals, in line with theoretical predictions, that the positive effect of agglomeration economies fades as the density of economic activities reaches some limit value.industrial location, negative binomial models, geoadditive models, european union.

    Multivariate Bayesian semiparametric models for authentication of food and beverages

    Full text link
    Food and beverage authentication is the process by which foods or beverages are verified as complying with its label description, for example, verifying if the denomination of origin of an olive oil bottle is correct or if the variety of a certain bottle of wine matches its label description. The common way to deal with an authentication process is to measure a number of attributes on samples of food and then use these as input for a classification problem. Our motivation stems from data consisting of measurements of nine chemical compounds denominated Anthocyanins, obtained from samples of Chilean red wines of grape varieties Cabernet Sauvignon, Merlot and Carm\'{e}n\`{e}re. We consider a model-based approach to authentication through a semiparametric multivariate hierarchical linear mixed model for the mean responses, and covariance matrices that are specific to the classification categories. Specifically, we propose a model of the ANOVA-DDP type, which takes advantage of the fact that the available covariates are discrete in nature. The results suggest that the model performs well compared to other parametric alternatives. This is also corroborated by application to simulated data.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS492 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org
    • …
    corecore