Search CORE

1,219 research outputs found

Exploring dependence between categorical variables: benefits and limitations of using variable selection within Bayesian clustering in relation to log-linear modelling with interaction terms

Author: Papathomas Michail
Richardson Sylvia
Publication venue
Publication date: 05/01/2016
Field of study

This manuscript is concerned with relating two approaches that can be used to explore complex dependence structures between categorical variables, namely Bayesian partitioning of the covariate space incorporating a variable selection procedure that highlights the covariates that drive the clustering, and log-linear modelling with interaction terms. We derive theoretical results on this relation and discuss if they can be employed to assist log-linear model determination, demonstrating advantages and limitations with simulated and real data sets. The main advantage concerns sparse contingency tables. Inferences from clustering can potentially reduce the number of covariates considered and, subsequently, the number of competing log-linear models, making the exploration of the model space feasible. Variable selection within clustering can inform on marginal independence in general, thus allowing for a more efficient exploration of the log-linear model space. However, we show that the clustering structure is not informative on the existence of interactions in a consistent manner. This work is of interest to those who utilize log-linear models, as well as practitioners such as epidemiologists that use clustering models to reduce the dimensionality in the data and to reveal interesting patterns on how covariates combine.Comment: Preprin

arXiv.org e-Print Archive

Elsevier - Publisher Connector

Crossref

PubMed Central

University of St. Andrews - Pure

St Andrews Research Repository

BClass: A Bayesian Approach Based on Mixture Models for Clustering and Classification of Heterogeneous Biological Data

Author: Arturo Medrano-Soto
J. Andres Christen
Julio Collado-vides
Publication venue
Publication date
Field of study

Based on mixture models, we present a Bayesian method (called BClass) to classify biological entities (e.g. genes) when variables of quite heterogeneous nature are analyzed. Various statistical distributions are used to model the continuous/categorical data commonly produced by genetic experiments and large-scale genomic projects. We calculate the posterior probability of each entry to belong to each element (group) in the mixture. In this way, an original set of heterogeneous variables is transformed into a set of purely homogeneous characteristics represented by the probabilities of each entry to belong to the groups. The number of groups in the analysis is controlled dynamically by rendering the groups as 'alive' and 'dormant' depending upon the number of entities classified within them. Using standard Metropolis-Hastings and Gibbs sampling algorithms, we constructed a sampler to approximate posterior moments and grouping probabilities. Since this method does not require the definition of similarity measures, it is especially suitable for data mining and knowledge discovery in biological databases. We applied BClass to classify genes in RegulonDB, a database specialized in information about the transcriptional regulation of gene expression in the bacterium Escherichia coli. The classification obtained is consistent with current knowledge and allowed prediction of missing values for a number of genes. BClass is object-oriented and fully programmed in Lisp-Stat. The output grouping probabilities are analyzed and interpreted using graphical (dynamically linked plots) and query-based approaches. We discuss the advantages of using Lisp-Stat as a programming language as well as the problems we faced when the data volume increased exponentially due to the ever-growing number of genomic projects.

Research Papers in Economics

Bayesian Conditional Tensor Factorizations for High-Dimensional Classification

Author: Dunson David B.
Yang Yun
Publication venue
Publication date: 21/01/2013
Field of study

In many application areas, data are collected on a categorical response and high-dimensional categorical predictors, with the goals being to build a parsimonious model for classification while doing inferences on the important predictors. In settings such as genomics, there can be complex interactions among the predictors. By using a carefully-structured Tucker factorization, we define a model that can characterize any conditional probability, while facilitating variable selection and modeling of higher-order interactions. Following a Bayesian approach, we propose a Markov chain Monte Carlo algorithm for posterior computation accommodating uncertainty in the predictors to be included. Under near sparsity assumptions, the posterior distribution for the conditional probability is shown to achieve close to the parametric rate of contraction even in ultra high-dimensional settings. The methods are illustrated using simulation examples and biomedical applications

arXiv.org e-Print Archive

CiteSeerX

A Monte Carlo test of linkage disequilibrium for single nucleotide polymorphisms

Author: George Varghese
Xu Hongyan
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Genetic association studies, especially genome-wide studies, make use of linkage disequilibrium(LD) information between single nucleotide polymorphisms (SNPs). LD is also used for studying genome structure and has been valuable for evolutionary studies. The strength of LD is commonly measured by <it>r</it>2, a statistic closely related to the Pearson's <it>χ</it>2 statistic. However, the computation and testing of linkage disequilibrium using <it>r</it>2 requires known haplotype counts of the SNP pair, which can be a problem for most population-based studies where the haplotype phase is unknown. Most statistical genetic packages use likelihood-based methods to infer haplotypes. However, the variability of haplotype estimation needs to be accounted for in the test for linkage disequilibrium. Findings We develop a Monte Carlo based test for LD based on the null distribution of the <it>r</it>2 statistic. Our test is based on <it>r</it>2 and can be reported together with <it>r</it>2. Simulation studies show that it offers slightly better power than existing methods. Conclusions Our approach provides an alternative test for LD and has been implemented as a R program for ease of use. It also provides a general framework to account for other haplotype inference methods in LD testing.</p

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Global permutation tests for multivariate ordinal data: alternatives, test statistics, and the null dilemma

Author: Cieza Alarcos
Jelizarow Monika
Mansmann Ulrich
Publication venue
Publication date: 17/04/2013
Field of study

We discuss two-sample global permutation tests for sets of multivariate ordinal data in possibly high-dimensional setups, motivated by the analysis of data collected by means of the World Health Organisation's International Classification of Functioning, Disability and Health. The tests do not require any modelling of the multivariate dependence structure. Specifically, we consider testing for marginal inhomogeneity and direction-independent marginal order. Max-T test statistics are known to lead to good power against alternatives with few strong individual effects. We propose test statistics that can be seen as their counterparts for alternatives with many weak individual effects. Permutation tests are valid only if the two multivariate distributions are identical under the null hypothesis. By means of simulations, we examine the practical impact of violations of this exchangeability condition. Our simulations suggest that theoretically invalid permutation tests can still be 'practically valid'. In particular, they suggest that the degree of the permutation procedure's failure may be considered as a function of the difference in group-specific covariance matrices, the proportion between group sizes, the number of variables in the set, the test statistic used, and the number of levels per variable

Open Access LMU

Recommended from our members

Mini-Workshop: Recent Developments in Statistical Methods with Applications to Genetics and Genomics

Author
Publication venue: Zürich : EMS Publ. House
Publication date: 01/01/2015
Field of study

Recent progress in high-throughput genomic technologies has revolutionized the field of human genetics and promises to lead to important scientific advances. With new improvements in massively parallel biotechnologies, it is becoming increasingly more efficient to generate vast amounts of information at the genomics, transcriptomics, proteomics, metabolomics etc. levels, opening up as yet unexplored opportunities in the search for the genetic causes of complex traits. Despite this tremendous progress in data generation, it remains very challenging to analyze, integrate and interpret these data. The resulting data are high-dimensional and very sparse, and efficient statistical methods are critical in order to extract the rich information contained in these data. The major focus of the mini-workshop, entitled “Recent Developments in Statistical Methods with Applications to Genetics and Genomics”, has been on integrative methods. Relevant research questions included the optimal study design for integrative genomic analyses; appropriate handling and pre-processing of different types of omics data; statistical methods for integration of multiple types of omics data; adjustment for confounding due to latent factors such as cell or tissue heterogeneity; the optimal use of omics data to enhance or make sense of results identified through genetic studies; and statistical and computational strategies for analysis of multiple types of high-dimensional data

Repositorium für Naturwissenschaften und Technik