3,924 research outputs found
kamila: Clustering Mixed-Type Data in R and Hadoop
In this paper we discuss the challenge of equitably combining continuous (quantitative) and categorical (qualitative) variables for the purpose of cluster analysis. Existing techniques require strong parametric assumptions, or difficult-to-specify tuning parameters. We describe the kamila package, which includes a weighted k-means approach to clustering mixed-type data, a method for estimating weights for mixed-type data (ModhaSpangler weighting), and an additional semiparametric method recently proposed in the literature (KAMILA). We include a discussion of strategies for estimating the number of clusters in the data, and describe the implementation of one such method in the current R package. Background and usage of these clustering methods are presented. We then show how the KAMILA algorithm can be adapted to a map-reduce framework, and implement the resulting algorithm using Hadoop for clustering very large mixed-type data sets
Disease Mapping via Negative Binomial Regression M-quantiles
We introduce a semi-parametric approach to ecological regression for disease
mapping, based on modelling the regression M-quantiles of a Negative Binomial
variable. The proposed method is robust to outliers in the model covariates,
including those due to measurement error, and can account for both spatial
heterogeneity and spatial clustering. A simulation experiment based on the
well-known Scottish lip cancer data set is used to compare the M-quantile
modelling approach and a random effects modelling approach for disease mapping.
This suggests that the M-quantile approach leads to predicted relative risks
with smaller root mean square error than standard disease mapping methods. The
paper concludes with an illustrative application of the M-quantile approach,
mapping low birth weight incidence data for English Local Authority Districts
for the years 2005-2010.Comment: 23 pages, 7 figure
Spatial clustering and nonlinearities in the location of multinational firms
We propose a semiparametric geoadditive negative binomial model of industrial location which allows to simultaneously address some important methodological issues, such as spatial clustering and nonlinearities, which have been only partly addressed in previous studies. We apply this model to analyze location determinants of inward greenfield investments occurred over the 2003-2007 period in 249 European regions. The inclusion of a geoadditive component (a smooth spatial trend surface) allows to control for omitted variables which induce spatial clustering, and suggests that such unobserved factors may be related to regional policies towards foreign investors Allowing for nonlinearities reveals, in line with theoretical predictions, that the positive effect of agglomeration economies fades as the density of economic activities reaches some limit value.industrial location, negative binomial models, geoadditive models, european union.
Multivariate Bayesian semiparametric models for authentication of food and beverages
Food and beverage authentication is the process by which foods or beverages
are verified as complying with its label description, for example, verifying if
the denomination of origin of an olive oil bottle is correct or if the variety
of a certain bottle of wine matches its label description. The common way to
deal with an authentication process is to measure a number of attributes on
samples of food and then use these as input for a classification problem. Our
motivation stems from data consisting of measurements of nine chemical
compounds denominated Anthocyanins, obtained from samples of Chilean red wines
of grape varieties Cabernet Sauvignon, Merlot and Carm\'{e}n\`{e}re. We
consider a model-based approach to authentication through a semiparametric
multivariate hierarchical linear mixed model for the mean responses, and
covariance matrices that are specific to the classification categories.
Specifically, we propose a model of the ANOVA-DDP type, which takes advantage
of the fact that the available covariates are discrete in nature. The results
suggest that the model performs well compared to other parametric alternatives.
This is also corroborated by application to simulated data.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS492 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …