954 research outputs found
Gene expression in large pedigrees: analytic approaches.
BackgroundWe currently have the ability to quantify transcript abundance of messenger RNA (mRNA), genome-wide, using microarray technologies. Analyzing genotype, phenotype and expression data from 20 pedigrees, the members of our Genetic Analysis Workshop (GAW) 19 gene expression group published 9 papers, tackling some timely and important problems and questions. To study the complexity and interrelationships of genetics and gene expression, we used established statistical tools, developed newer statistical tools, and developed and applied extensions to these tools.MethodsTo study gene expression correlations in the pedigree members (without incorporating genotype or trait data into the analysis), 2 papers used principal components analysis, weighted gene coexpression network analysis, meta-analyses, gene enrichment analyses, and linear mixed models. To explore the relationship between genetics and gene expression, 2 papers studied expression quantitative trait locus allelic heterogeneity through conditional association analyses, and epistasis through interaction analyses. A third paper assessed the feasibility of applying allele-specific binding to filter potential regulatory single-nucleotide polymorphisms (SNPs). Analytic approaches included linear mixed models based on measured genotypes in pedigrees, permutation tests, and covariance kernels. To incorporate both genotype and phenotype data with gene expression, 4 groups employed linear mixed models, nonparametric weighted U statistics, structural equation modeling, Bayesian unified frameworks, and multiple regression.Results and discussionRegarding the analysis of pedigree data, we found that gene expression is familial, indicating that at least 1 factor for pedigree membership or multiple factors for the degree of relationship should be included in analyses, and we developed a method to adjust for familiality prior to conducting weighted co-expression gene network analysis. For SNP association and conditional analyses, we found FaST-LMM (Factored Spectrally Transformed Linear Mixed Model) and SOLAR-MGA (Sequential Oligogenic Linkage Analysis Routines -Major Gene Analysis) have similar type 1 and type 2 errors and can be used almost interchangeably. To improve the power and precision of association tests, prior knowledge of DNase-I hypersensitivity sites or other relevant biological annotations can be incorporated into the analyses. On a biological level, eQTL (expression quantitative trait loci) are genetically complex, exhibiting both allelic heterogeneity and epistasis. Including both genotype and phenotype data together with measurements of gene expression was found to be generally advantageous in terms of generating improved levels of significance and in providing more interpretable biological models.ConclusionsPedigrees can be used to conduct analyses of and enhance gene expression studies
Generalized Species Sampling Priors with Latent Beta reinforcements
Many popular Bayesian nonparametric priors can be characterized in terms of
exchangeable species sampling sequences. However, in some applications,
exchangeability may not be appropriate. We introduce a {novel and
probabilistically coherent family of non-exchangeable species sampling
sequences characterized by a tractable predictive probability function with
weights driven by a sequence of independent Beta random variables. We compare
their theoretical clustering properties with those of the Dirichlet Process and
the two parameters Poisson-Dirichlet process. The proposed construction
provides a complete characterization of the joint process, differently from
existing work. We then propose the use of such process as prior distribution in
a hierarchical Bayes modeling framework, and we describe a Markov Chain Monte
Carlo sampler for posterior inference. We evaluate the performance of the prior
and the robustness of the resulting inference in a simulation study, providing
a comparison with popular Dirichlet Processes mixtures and Hidden Markov
Models. Finally, we develop an application to the detection of chromosomal
aberrations in breast cancer by leveraging array CGH data.Comment: For correspondence purposes, Edoardo M. Airoldi's email is
[email protected]; Federico Bassetti's email is
[email protected]; Michele Guindani's email is
[email protected] ; Fabrizo Leisen's email is
[email protected]. To appear in the Journal of the American
Statistical Associatio
A Comparison of Univariate Stochastic Volatility Models for U.S. Short Rates Using EMM Estimation
In this paper, the efficient method of moments (EMM) estimation using a seminonparametric (SNP) auxiliary model is employed to determine the best fitting model for the volatility dynamics of the U.S. weekly three-month interest rate. A variety of volatility models are considered, including one-factor diffusion models, two-factor and three-factor stochastic volatility (SV) models, non-Gaussian diffusion models with Stable distributed errors, and a variety of Markov regime switching (RS) models. The advantage of using EMM estimation is that all of the proposed structural models can be evaluated with respect to a common auxiliary model. We find that a continuous-time twofactor SV model, a continuous-time three-factor SV model, and a discrete-time RS-involatility model with level effect can well explain the salient features of the short rate as summarized by the auxiliary model. We also show that either an SV model with a level effect or a RS model with a level effect, but not both, is needed for explaining the data. Our EMM estimates of the level effect are much lower than unity, but around 1/2 after incorporating the SV effect or the RS effect.
Bayesian Conditional Tensor Factorizations for High-Dimensional Classification
In many application areas, data are collected on a categorical response and
high-dimensional categorical predictors, with the goals being to build a
parsimonious model for classification while doing inferences on the important
predictors. In settings such as genomics, there can be complex interactions
among the predictors. By using a carefully-structured Tucker factorization, we
define a model that can characterize any conditional probability, while
facilitating variable selection and modeling of higher-order interactions.
Following a Bayesian approach, we propose a Markov chain Monte Carlo algorithm
for posterior computation accommodating uncertainty in the predictors to be
included. Under near sparsity assumptions, the posterior distribution for the
conditional probability is shown to achieve close to the parametric rate of
contraction even in ultra high-dimensional settings. The methods are
illustrated using simulation examples and biomedical applications
A hierarchical Dirichlet process mixture model for haplotype reconstruction from multi-population data
The perennial problem of "how many clusters?" remains an issue of substantial
interest in data mining and machine learning communities, and becomes
particularly salient in large data sets such as populational genomic data where
the number of clusters needs to be relatively large and open-ended. This
problem gets further complicated in a co-clustering scenario in which one needs
to solve multiple clustering problems simultaneously because of the presence of
common centroids (e.g., ancestors) shared by clusters (e.g., possible descents
from a certain ancestor) from different multiple-cluster samples (e.g.,
different human subpopulations). In this paper we present a hierarchical
nonparametric Bayesian model to address this problem in the context of
multi-population haplotype inference. Uncovering the haplotypes of single
nucleotide polymorphisms is essential for many biological and medical
applications. While it is uncommon for the genotype data to be pooled from
multiple ethnically distinct populations, few existing programs have explicitly
leveraged the individual ethnic information for haplotype inference. In this
paper we present a new haplotype inference program, Haploi, which makes use of
such information and is readily applicable to genotype sequences with thousands
of SNPs from heterogeneous populations, with competent and sometimes superior
speed and accuracy comparing to the state-of-the-art programs. Underlying
Haploi is a new haplotype distribution model based on a nonparametric Bayesian
formalism known as the hierarchical Dirichlet process, which represents a
tractable surrogate to the coalescent process. The proposed model is
exchangeable, unbounded, and capable of coupling demographic information of
different populations.Comment: Published in at http://dx.doi.org/10.1214/08-AOAS225 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …