1,219 research outputs found
Exploring dependence between categorical variables: benefits and limitations of using variable selection within Bayesian clustering in relation to log-linear modelling with interaction terms
This manuscript is concerned with relating two approaches that can be used to
explore complex dependence structures between categorical variables, namely
Bayesian partitioning of the covariate space incorporating a variable selection
procedure that highlights the covariates that drive the clustering, and
log-linear modelling with interaction terms. We derive theoretical results on
this relation and discuss if they can be employed to assist log-linear model
determination, demonstrating advantages and limitations with simulated and real
data sets. The main advantage concerns sparse contingency tables. Inferences
from clustering can potentially reduce the number of covariates considered and,
subsequently, the number of competing log-linear models, making the exploration
of the model space feasible. Variable selection within clustering can inform on
marginal independence in general, thus allowing for a more efficient
exploration of the log-linear model space. However, we show that the clustering
structure is not informative on the existence of interactions in a consistent
manner. This work is of interest to those who utilize log-linear models, as
well as practitioners such as epidemiologists that use clustering models to
reduce the dimensionality in the data and to reveal interesting patterns on how
covariates combine.Comment: Preprin
BClass: A Bayesian Approach Based on Mixture Models for Clustering and Classification of Heterogeneous Biological Data
Based on mixture models, we present a Bayesian method (called BClass) to classify biological entities (e.g. genes) when variables of quite heterogeneous nature are analyzed. Various statistical distributions are used to model the continuous/categorical data commonly produced by genetic experiments and large-scale genomic projects. We calculate the posterior probability of each entry to belong to each element (group) in the mixture. In this way, an original set of heterogeneous variables is transformed into a set of purely homogeneous characteristics represented by the probabilities of each entry to belong to the groups. The number of groups in the analysis is controlled dynamically by rendering the groups as 'alive' and 'dormant' depending upon the number of entities classified within them. Using standard Metropolis-Hastings and Gibbs sampling algorithms, we constructed a sampler to approximate posterior moments and grouping probabilities. Since this method does not require the definition of similarity measures, it is especially suitable for data mining and knowledge discovery in biological databases. We applied BClass to classify genes in RegulonDB, a database specialized in information about the transcriptional regulation of gene expression in the bacterium Escherichia coli. The classification obtained is consistent with current knowledge and allowed prediction of missing values for a number of genes. BClass is object-oriented and fully programmed in Lisp-Stat. The output grouping probabilities are analyzed and interpreted using graphical (dynamically linked plots) and query-based approaches. We discuss the advantages of using Lisp-Stat as a programming language as well as the problems we faced when the data volume increased exponentially due to the ever-growing number of genomic projects.
Bayesian Conditional Tensor Factorizations for High-Dimensional Classification
In many application areas, data are collected on a categorical response and
high-dimensional categorical predictors, with the goals being to build a
parsimonious model for classification while doing inferences on the important
predictors. In settings such as genomics, there can be complex interactions
among the predictors. By using a carefully-structured Tucker factorization, we
define a model that can characterize any conditional probability, while
facilitating variable selection and modeling of higher-order interactions.
Following a Bayesian approach, we propose a Markov chain Monte Carlo algorithm
for posterior computation accommodating uncertainty in the predictors to be
included. Under near sparsity assumptions, the posterior distribution for the
conditional probability is shown to achieve close to the parametric rate of
contraction even in ultra high-dimensional settings. The methods are
illustrated using simulation examples and biomedical applications
A Monte Carlo test of linkage disequilibrium for single nucleotide polymorphisms
<p>Abstract</p> <p>Background</p> <p>Genetic association studies, especially genome-wide studies, make use of linkage disequilibrium(LD) information between single nucleotide polymorphisms (SNPs). LD is also used for studying genome structure and has been valuable for evolutionary studies. The strength of LD is commonly measured by <it>r</it><sup>2</sup>, a statistic closely related to the Pearson's <it>χ</it><sup>2 </sup>statistic. However, the computation and testing of linkage disequilibrium using <it>r</it><sup>2 </sup>requires known haplotype counts of the SNP pair, which can be a problem for most population-based studies where the haplotype phase is unknown. Most statistical genetic packages use likelihood-based methods to infer haplotypes. However, the variability of haplotype estimation needs to be accounted for in the test for linkage disequilibrium.</p> <p>Findings</p> <p>We develop a Monte Carlo based test for LD based on the null distribution of the <it>r</it><sup>2 </sup>statistic. Our test is based on <it>r</it><sup>2 </sup>and can be reported together with <it>r</it><sup>2</sup>. Simulation studies show that it offers slightly better power than existing methods.</p> <p>Conclusions</p> <p>Our approach provides an alternative test for LD and has been implemented as a R program for ease of use. It also provides a general framework to account for other haplotype inference methods in LD testing.</p
Global permutation tests for multivariate ordinal data: alternatives, test statistics, and the null dilemma
We discuss two-sample global permutation tests for sets of multivariate ordinal data in possibly high-dimensional setups, motivated by the analysis of data collected by means of the World Health Organisation's International Classification of Functioning,
Disability and Health. The tests do not require any modelling of the multivariate dependence structure. Specifically, we consider testing for marginal inhomogeneity and
direction-independent marginal order. Max-T test statistics are known to lead to good
power against alternatives with few strong individual effects. We propose test statistics that can be seen as their counterparts for alternatives with many weak individual effects. Permutation tests are valid only if the two multivariate distributions are identical under the null hypothesis. By means of simulations, we examine the practical impact of violations of this exchangeability condition. Our simulations suggest that theoretically invalid permutation tests can still be 'practically valid'. In particular, they suggest that the degree of the permutation procedure's failure may be considered as a function of the difference in group-specific covariance matrices, the proportion between group sizes, the number of variables in the set, the test statistic used, and the number of levels per variable
Recommended from our members
Mini-Workshop: Recent Developments in Statistical Methods with Applications to Genetics and Genomics
Recent progress in high-throughput genomic technologies has revolutionized the field of human genetics and promises to lead to important scientific advances. With new improvements in massively parallel biotechnologies, it is becoming increasingly more efficient to generate vast amounts of information at the genomics, transcriptomics, proteomics, metabolomics etc. levels, opening up as yet unexplored opportunities in the search for the genetic causes of complex traits. Despite this tremendous progress in data generation, it remains very challenging to analyze, integrate and interpret these data. The resulting data are high-dimensional and very sparse, and efficient statistical methods are critical in order to extract the rich information contained in these data. The major focus of the mini-workshop, entitled “Recent Developments in Statistical Methods with Applications to Genetics and Genomics”, has been on integrative methods. Relevant research questions included the optimal study design for integrative genomic analyses; appropriate handling and pre-processing of different types of omics data; statistical methods for integration of multiple types of omics data; adjustment for confounding due to latent factors such as cell or tissue heterogeneity; the optimal use of omics data to enhance or make sense of results identified through genetic studies; and statistical and computational strategies for analysis of multiple types of high-dimensional data
- …