51,888 research outputs found
Clustering South African households based on their asset status using latent variable models
The Agincourt Health and Demographic Surveillance System has since 2001
conducted a biannual household asset survey in order to quantify household
socio-economic status (SES) in a rural population living in northeast South
Africa. The survey contains binary, ordinal and nominal items. In the absence
of income or expenditure data, the SES landscape in the study population is
explored and described by clustering the households into homogeneous groups
based on their asset status. A model-based approach to clustering the Agincourt
households, based on latent variable models, is proposed. In the case of
modeling binary or ordinal items, item response theory models are employed. For
nominal survey items, a factor analysis model, similar in nature to a
multinomial probit model, is used. Both model types have an underlying latent
variable structure - this similarity is exploited and the models are combined
to produce a hybrid model capable of handling mixed data types. Further, a
mixture of the hybrid models is considered to provide clustering capabilities
within the context of mixed binary, ordinal and nominal response data. The
proposed model is termed a mixture of factor analyzers for mixed data (MFA-MD).
The MFA-MD model is applied to the survey data to cluster the Agincourt
households into homogeneous groups. The model is estimated within the Bayesian
paradigm, using a Markov chain Monte Carlo algorithm. Intuitive groupings
result, providing insight to the different socio-economic strata within the
Agincourt region.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS726 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Change detection in categorical evolving data streams
Detecting change in evolving data streams is a central issue for accurate adaptive learning. In real world applications, data streams have categorical features, and changes induced in the data distribution of these categorical features have not been considered extensively so far. Previous work on change detection focused on detecting changes in the accuracy of the learners, but without considering changes in the data distribution.
To cope with these issues, we propose a new unsupervised change detection method, called CDCStream (Change Detection in Categorical Data Streams), well suited for categorical data streams. The proposed method is able to detect changes in a batch incremental scenario. It is based on the two following characteristics: (i) a summarization strategy is proposed to compress the actual batch by extracting a descriptive summary and (ii) a new segmentation algorithm is proposed to highlight changes and issue warnings for a data stream. To evaluate our proposal we employ it in a learning task over real world data and we compare its results with state of the art methods. We also report qualitative evaluation in order to show the behavior of CDCStream
Clustering and variable selection for categorical multivariate data
This article investigates unsupervised classification techniques for
categorical multivariate data. The study employs multivariate multinomial
mixture modeling, which is a type of model particularly applicable to
multilocus genotypic data. A model selection procedure is used to
simultaneously select the number of components and the relevant variables. A
non-asymptotic oracle inequality is obtained, leading to the proposal of a new
penalized maximum likelihood criterion. The selected model proves to be
asymptotically consistent under weak assumptions on the true probability
underlying the observations. The main theoretical result obtained in this study
suggests a penalty function defined to within a multiplicative parameter. In
practice, the data-driven calibration of the penalty function is made possible
by slope heuristics. Based on simulated data, this procedure is found to
improve the performance of the selection procedure with respect to classical
criteria such as BIC and AIC. The new criterion provides an answer to the
question "Which criterion for which sample size?" Examples of real dataset
applications are also provided
Enhancing the selection of a model-based clustering with external qualitative variables
In cluster analysis, it can be useful to interpret the partition built from
the data in the light of external categorical variables which were not directly
involved to cluster the data. An approach is proposed in the model-based
clustering context to select a model and a number of clusters which both fit
the data well and take advantage of the potential illustrative ability of the
external variables. This approach makes use of the integrated joint likelihood
of the data and the partitions at hand, namely the model-based partition and
the partitions associated to the external variables. It is noteworthy that each
mixture model is fitted by the maximum likelihood methodology to the data,
excluding the external variables which are used to select a relevant mixture
model only. Numerical experiments illustrate the promising behaviour of the
derived criterion
- âŠ