19,208 research outputs found
Bayesian Dependence Tests for Continuous, Binary and Mixed Continuous-Binary Variables
Tests for dependence of continuous, discrete and mixed continuous-discrete variables are ubiquitous in science. The goal of this paper is to derive Bayesian alternatives to frequentist null hypothesis significance tests for dependence. In particular, we will present three Bayesian tests for dependence of binary, continuous and mixed variables. These tests are nonparametric and based on the Dirichlet Process, which allows us to use the same prior model for all of them. Therefore, the tests are “consistent” among each other, in the sense that the probabilities that variables are dependent computed with these tests are commensurable across the different types of variables being tested. By means of simulations with artificial data, we show the effectiveness of the new tests
Penalized EM algorithm and copula skeptic graphical models for inferring networks for mixed variables
In this article, we consider the problem of reconstructing networks for
continuous, binary, count and discrete ordinal variables by estimating sparse
precision matrix in Gaussian copula graphical models. We propose two
approaches: penalized extended rank likelihood with Monte Carlo
Expectation-Maximization algorithm (copula EM glasso) and copula skeptic with
pair-wise copula estimation for copula Gaussian graphical models. The proposed
approaches help to infer networks arising from nonnormal and mixed variables.
We demonstrate the performance of our methods through simulation studies and
analysis of breast cancer genomic and clinical data and maize genetics data
Clustering South African households based on their asset status using latent variable models
The Agincourt Health and Demographic Surveillance System has since 2001
conducted a biannual household asset survey in order to quantify household
socio-economic status (SES) in a rural population living in northeast South
Africa. The survey contains binary, ordinal and nominal items. In the absence
of income or expenditure data, the SES landscape in the study population is
explored and described by clustering the households into homogeneous groups
based on their asset status. A model-based approach to clustering the Agincourt
households, based on latent variable models, is proposed. In the case of
modeling binary or ordinal items, item response theory models are employed. For
nominal survey items, a factor analysis model, similar in nature to a
multinomial probit model, is used. Both model types have an underlying latent
variable structure - this similarity is exploited and the models are combined
to produce a hybrid model capable of handling mixed data types. Further, a
mixture of the hybrid models is considered to provide clustering capabilities
within the context of mixed binary, ordinal and nominal response data. The
proposed model is termed a mixture of factor analyzers for mixed data (MFA-MD).
The MFA-MD model is applied to the survey data to cluster the Agincourt
households into homogeneous groups. The model is estimated within the Bayesian
paradigm, using a Markov chain Monte Carlo algorithm. Intuitive groupings
result, providing insight to the different socio-economic strata within the
Agincourt region.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS726 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Massively-Parallel Feature Selection for Big Data
We present the Parallel, Forward-Backward with Pruning (PFBP) algorithm for
feature selection (FS) in Big Data settings (high dimensionality and/or sample
size). To tackle the challenges of Big Data FS PFBP partitions the data matrix
both in terms of rows (samples, training examples) as well as columns
(features). By employing the concepts of -values of conditional independence
tests and meta-analysis techniques PFBP manages to rely only on computations
local to a partition while minimizing communication costs. Then, it employs
powerful and safe (asymptotically sound) heuristics to make early, approximate
decisions, such as Early Dropping of features from consideration in subsequent
iterations, Early Stopping of consideration of features within the same
iteration, or Early Return of the winner in each iteration. PFBP provides
asymptotic guarantees of optimality for data distributions faithfully
representable by a causal network (Bayesian network or maximal ancestral
graph). Our empirical analysis confirms a super-linear speedup of the algorithm
with increasing sample size, linear scalability with respect to the number of
features and processing cores, while dominating other competitive algorithms in
its class
Mixed Cumulative Distribution Networks
Directed acyclic graphs (DAGs) are a popular framework to express
multivariate probability distributions. Acyclic directed mixed graphs (ADMGs)
are generalizations of DAGs that can succinctly capture much richer sets of
conditional independencies, and are especially useful in modeling the effects
of latent variables implicitly. Unfortunately there are currently no good
parameterizations of general ADMGs. In this paper, we apply recent work on
cumulative distribution networks and copulas to propose one one general
construction for ADMG models. We consider a simple parameter estimation
approach, and report some encouraging experimental results.Comment: 11 pages, 4 figure
- …