1,281 research outputs found
High-Dimensional Joint Estimation of Multiple Directed Gaussian Graphical Models
We consider the problem of jointly estimating multiple related directed
acyclic graph (DAG) models based on high-dimensional data from each graph. This
problem is motivated by the task of learning gene regulatory networks based on
gene expression data from different tissues, developmental stages or disease
states. We prove that under certain regularity conditions, the proposed
-penalized maximum likelihood estimator converges in Frobenius norm to
the adjacency matrices consistent with the data-generating distributions and
has the correct sparsity. In particular, we show that this joint estimation
procedure leads to a faster convergence rate than estimating each DAG model
separately. As a corollary, we also obtain high-dimensional consistency results
for causal inference from a mix of observational and interventional data. For
practical purposes, we propose \emph{jointGES} consisting of Greedy Equivalence
Search (GES) to estimate the union of all DAG models followed by variable
selection using lasso to obtain the different DAGs, and we analyze its
consistency guarantees. The proposed method is illustrated through an analysis
of simulated data as well as epithelial ovarian cancer gene expression data
A multivariate approach to the integration of multi-omics datasets
Background: To leverage the potential of multi-omics studies, exploratory data analysis methods that provide systematic integration and comparison of multiple layers of omics information are required. We describe multiple co-inertia analysis (MCIA), an exploratory data analysis method that identifies co-relationships between multiple high dimensional datasets. Based on a covariance optimization criterion, MCIA simultaneously projects several datasets into the same dimensional space, transforming diverse sets of features onto the same scale, to extract the most variant from each dataset and facilitate biological interpretation and pathway analysis. Results: We demonstrate integration of multiple layers of information using MCIA, applied to two typical “omics” research scenarios. The integration of transcriptome and proteome profiles of cells in the NCI-60 cancer cell line panel revealed distinct, complementary features, which together increased the coverage and power of pathway analysis. Our analysis highlighted the importance of the leukemia extravasation signaling pathway in leukemia that was not highly ranked in the analysis of any individual dataset. Secondly, we compared transcriptome profiles of high grade serous ovarian tumors that were obtained, on two different microarray platforms and next generation RNA-sequencing, to identify the most informative platform and extract robust biomarkers of molecular subtypes. We discovered that the variance of RNA-sequencing data processed using RPKM had greater variance than that with MapSplice and RSEM. We provided novel markers highly associated to tumor molecular subtype combined from four data platforms. MCIA is implemented and available in the R/Bioconductor “omicade4” package. Conclusion: We believe MCIA is an attractive method for data integration and visualization of several datasets of multi-omics features observed on the same set of individuals. The method is not dependent on feature annotation, and thus it can extract important features even when there are not present across all datasets. MCIA provides simple graphical representations for the identification of relationships between large datasets
Posterior Contraction Rates of the Phylogenetic Indian Buffet Processes
By expressing prior distributions as general stochastic processes,
nonparametric Bayesian methods provide a flexible way to incorporate prior
knowledge and constrain the latent structure in statistical inference. The
Indian buffet process (IBP) is such an example that can be used to define a
prior distribution on infinite binary features, where the exchangeability among
subjects is assumed. The phylogenetic Indian buffet process (pIBP), a
derivative of IBP, enables the modeling of non-exchangeability among subjects
through a stochastic process on a rooted tree, which is similar to that used in
phylogenetics, to describe relationships among the subjects. In this paper, we
study the theoretical properties of IBP and pIBP under a binary factor model.
We establish the posterior contraction rates for both IBP and pIBP and
substantiate the theoretical results through simulation studies. This is the
first work addressing the frequentist property of the posterior behaviors of
IBP and pIBP. We also demonstrated its practical usefulness by applying pIBP
prior to a real data example arising in the field of cancer genomics where the
exchangeability among subjects is violated
Direct Estimation of Differences in Causal Graphs
We consider the problem of estimating the differences between two causal
directed acyclic graph (DAG) models with a shared topological order given
i.i.d. samples from each model. This is of interest for example in genomics,
where changes in the structure or edge weights of the underlying causal graphs
reflect alterations in the gene regulatory networks. We here provide the first
provably consistent method for directly estimating the differences in a pair of
causal DAGs without separately learning two possibly large and dense DAG models
and computing their difference. Our two-step algorithm first uses invariance
tests between regression coefficients of the two data sets to estimate the
skeleton of the difference graph and then orients some of the edges using
invariance tests between regression residual variances. We demonstrate the
properties of our method through a simulation study and apply it to the
analysis of gene expression data from ovarian cancer and during T-cell
activation
A Distance-Based Test of Association Between Paired Heterogeneous Genomic Data
Due to rapid technological advances, a wide range of different measurements
can be obtained from a given biological sample including single nucleotide
polymorphisms, copy number variation, gene expression levels, DNA methylation
and proteomic profiles. Each of these distinct measurements provides the means
to characterize a certain aspect of biological diversity, and a fundamental
problem of broad interest concerns the discovery of shared patterns of
variation across different data types. Such data types are heterogeneous in the
sense that they represent measurements taken at very different scales or
described by very different data structures. We propose a distance-based
statistical test, the generalized RV (GRV) test, to assess whether there is a
common and non-random pattern of variability between paired biological
measurements obtained from the same random sample. The measurements enter the
test through distance measures which can be chosen to capture particular
aspects of the data. An approximate null distribution is proposed to compute
p-values in closed-form and without the need to perform costly Monte Carlo
permutation procedures. Compared to the classical Mantel test for association
between distance matrices, the GRV test has been found to be more powerful in a
number of simulation settings. We also report on an application of the GRV test
to detect biological pathways in which genetic variability is associated to
variation in gene expression levels in ovarian cancer samples, and present
results obtained from two independent cohorts
Sparse integrative clustering of multiple omics data sets
High resolution microarrays and second-generation sequencing platforms are
powerful tools to investigate genome-wide alterations in DNA copy number,
methylation and gene expression associated with a disease. An integrated
genomic profiling approach measures multiple omics data types simultaneously in
the same set of biological samples. Such approach renders an integrated data
resolution that would not be available with any single data type. In this
study, we use penalized latent variable regression methods for joint modeling
of multiple omics data types to identify common latent variables that can be
used to cluster patient samples into biologically and clinically relevant
disease subtypes. We consider lasso [J. Roy. Statist. Soc. Ser. B 58 (1996)
267-288], elastic net [J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005)
301-320] and fused lasso [J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005)
91-108] methods to induce sparsity in the coefficient vectors, revealing
important genomic features that have significant contributions to the latent
variables. An iterative ridge regression is used to compute the sparse
coefficient vectors. In model selection, a uniform design [Monographs on
Statistics and Applied Probability (1994) Chapman & Hall] is used to seek
"experimental" points that scattered uniformly across the search domain for
efficient sampling of tuning parameter combinations. We compared our method to
sparse singular value decomposition (SVD) and penalized Gaussian mixture model
(GMM) using both real and simulated data sets. The proposed method is applied
to integrate genomic, epigenomic and transcriptomic data for subtype analysis
in breast and lung cancer data sets.Comment: Published in at http://dx.doi.org/10.1214/12-AOAS578 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …