1,440 research outputs found
An integrative analysis of cancer gene expression studies using Bayesian latent factor modeling
We present an applied study in cancer genomics for integrating data and
inferences from laboratory experiments on cancer cell lines with observational
data obtained from human breast cancer studies. The biological focus is on
improving understanding of transcriptional responses of tumors to changes in
the pH level of the cellular microenvironment. The statistical focus is on
connecting experimentally defined biomarkers of such responses to clinical
outcome in observational studies of breast cancer patients. Our analysis
exemplifies a general strategy for accomplishing this kind of integration
across contexts. The statistical methodologies employed here draw heavily on
Bayesian sparse factor models for identifying, modularizing and correlating
with clinical outcome these signatures of aggregate changes in gene expression.
By projecting patterns of biological response linked to specific experimental
interventions into observational studies where such responses may be evidenced
via variation in gene expression across samples, we are able to define
biomarkers of clinically relevant physiological states and outcomes that are
rooted in the biology of the original experiment. Through this approach we
identify microenvironment-related prognostic factors capable of predicting long
term survival in two independent breast cancer datasets. These results suggest
possible directions for future laboratory studies, as well as indicate the
potential for therapeutic advances though targeted disruption of specific
pathway components.Comment: Published in at http://dx.doi.org/10.1214/09-AOAS261 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Recommended from our members
Quantitating the epigenetic transformation contributing to cholesterol homeostasis using Gaussian process.
To understand the impact of epigenetics on human misfolding disease, we apply Gaussian-process regression (GPR) based machine learning (ML) (GPR-ML) through variation spatial profiling (VSP). VSP generates population-based matrices describing the spatial covariance (SCV) relationships that link genetic diversity to fitness of the individual in response to histone deacetylases inhibitors (HDACi). Niemann-Pick C1 (NPC1) is a Mendelian disorder caused by >300 variants in the NPC1 gene that disrupt cholesterol homeostasis leading to the rapid onset and progression of neurodegenerative disease. We determine the sequence-to-function-to-structure relationships of the NPC1 polypeptide fold required for membrane trafficking and generation of a tunnel that mediates cholesterol flux in late endosomal/lysosomal (LE/Ly) compartments. HDACi treatment reveals unanticipated epigenomic plasticity in SCV relationships that restore NPC1 functionality. GPR-ML based matrices capture the epigenetic processes impacting information flow through central dogma, providing a framework for quantifying the effect of the environment on the healthspan of the individual
Recommended from our members
Cell-type-specific resolution epigenetics without the need for cell sorting or single-cell biology.
High costs and technical limitations of cell sorting and single-cell techniques currently restrict the collection of large-scale, cell-type-specific DNA methylation data. This, in turn, impedes our ability to tackle key biological questions that pertain to variation within a population, such as identification of disease-associated genes at a cell-type-specific resolution. Here, we show mathematically and empirically that cell-type-specific methylation levels of an individual can be learned from its tissue-level bulk data, conceptually emulating the case where the individual has been profiled with a single-cell resolution and then signals were aggregated in each cell population separately. Provided with this unprecedented way to perform powerful large-scale epigenetic studies with cell-type-specific resolution, we revisit previous studies with tissue-level bulk methylation and reveal novel associations with leukocyte composition in blood and with rheumatoid arthritis. For the latter, we further show consistency with validation data collected from sorted leukocyte sub-types
A Selective Review of Group Selection in High-Dimensional Models
Grouping structures arise naturally in many statistical modeling problems.
Several methods have been proposed for variable selection that respect grouping
structure in variables. Examples include the group LASSO and several concave
group selection methods. In this article, we give a selective review of group
selection concerning methodological developments, theoretical properties and
computational algorithms. We pay particular attention to group selection
methods involving concave penalties. We address both group selection and
bi-level selection methods. We describe several applications of these methods
in nonparametric additive models, semiparametric regression, seemingly
unrelated regressions, genomic data analysis and genome wide association
studies. We also highlight some issues that require further study.Comment: Published in at http://dx.doi.org/10.1214/12-STS392 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Identification of genes associated with multiple cancers via integrative analysis
<p>Abstract</p> <p>Background</p> <p>Advancement in gene profiling techniques makes it possible to measure expressions of thousands of genes and identify genes associated with development and progression of cancer. The identified cancer-associated genes can be used for diagnosis, prognosis prediction, and treatment selection. Most existing cancer microarray studies have been focusing on the identification of genes associated with a specific type of cancer. Recent biomedical studies suggest that different cancers may share common susceptibility genes. A comprehensive description of the associations between genes and cancers requires identification of not only multiple genes associated with a specific type of cancer but also genes associated with multiple cancers.</p> <p>Results</p> <p>In this article, we propose the Mc.TGD (Multi-cancer Threshold Gradient Descent), an integrative analysis approach capable of analyzing multiple microarray studies on different cancers. The Mc.TGD is the first regularized approach to conduct "two-dimensional" selection of genes with joint effects on cancer development. Simulation studies show that the Mc.TGD can more accurately identify genes associated with multiple cancers than meta analysis based on "one-dimensional" methods. As a byproduct, identification accuracy of genes associated with only one type of cancer may also be improved. We use the Mc.TGD to analyze seven microarray studies investigating development of seven different types of cancers. We identify one gene associated with six types of cancers and four genes associated with five types of cancers. In addition, we also identify 11, 9, 18, and 17 genes associated with 4 to 1 types of cancers, respectively. We evaluate prediction performance using a Leave-One-Out cross validation approach and find that only 4 (out of 570) subjects cannot be properly predicted.</p> <p>Conclusion</p> <p>The Mc.TGD can identify a short list of genes associated with one or multiple types of cancers. The identified genes are considerably different from those identified using meta analysis or analysis of marginal effects.</p
Testing significance of features by lassoed principal components
We consider the problem of testing the significance of features in
high-dimensional settings. In particular, we test for differentially-expressed
genes in a microarray experiment. We wish to identify genes that are associated
with some type of outcome, such as survival time or cancer type. We propose a
new procedure, called Lassoed Principal Components (LPC), that builds upon
existing methods and can provide a sizable improvement. For instance, in the
case of two-class data, a standard (albeit simple) approach might be to compute
a two-sample -statistic for each gene. The LPC method involves projecting
these conventional gene scores onto the eigenvectors of the gene expression
data covariance matrix and then applying an penalty in order to de-noise
the resulting projections. We present a theoretical framework under which LPC
is the logical choice for identifying significant genes, and we show that LPC
can provide a marked reduction in false discovery rates over the conventional
methods on both real and simulated data. Moreover, this flexible procedure can
be applied to a variety of types of data and can be used to improve many
existing methods for the identification of significant features.Comment: Published in at http://dx.doi.org/10.1214/08-AOAS182 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
DLMM as a lossless one-shot algorithm for collaborative multi-site distributed linear mixed models
Linear mixed models are commonly used in healthcare-based association analyses for analyzing multi-site data with heterogeneous site-specific random effects. Due to regulations for protecting patients\u27 privacy, sensitive individual patient data (IPD) typically cannot be shared across sites. We propose an algorithm for fitting distributed linear mixed models (DLMMs) without sharing IPD across sites. This algorithm achieves results identical to those achieved using pooled IPD from multiple sites (i.e., the same effect size and standard error estimates), hence demonstrating the lossless property. The algorithm requires each site to contribute minimal aggregated data in only one round of communication. We demonstrate the lossless property of the proposed DLMM algorithm by investigating the associations between demographic and clinical characteristics and length of hospital stay in COVID-19 patients using administrative claims from the UnitedHealth Group Clinical Discovery Database. We extend this association study by incorporating 120,609 COVID-19 patients from 11 collaborative data sources worldwide
Graph Kernels
We present a unified framework to study graph kernels, special cases of which include the random
walk (Gärtner et al., 2003; Borgwardt et al., 2005) and marginalized (Kashima et al., 2003, 2004;
Mahé et al., 2004) graph kernels. Through reduction to a Sylvester equation we improve the time
complexity of kernel computation between unlabeled graphs with n vertices from O(n^6) to O(n^3).
We find a spectral decomposition approach even more efficient when computing entire kernel matrices.
For labeled graphs we develop conjugate gradient and fixed-point methods that take O(dn^3)
time per iteration, where d is the size of the label set. By extending the necessary linear algebra to
Reproducing Kernel Hilbert Spaces (RKHS) we obtain the same result for d-dimensional edge kernels,
and O(n^4) in the infinite-dimensional case; on sparse graphs these algorithms only take O(n^2)
time per iteration in all cases. Experiments on graphs from bioinformatics and other application
domains show that these techniques can speed up computation of the kernel by an order of magnitude
or more. We also show that certain rational kernels (Cortes et al., 2002, 2003, 2004) when
specialized to graphs reduce to our random walk graph kernel. Finally, we relate our framework to
R-convolution kernels (Haussler, 1999) and provide a kernel that is close to the optimal assignment
kernel of Fröhlich et al. (2006) yet provably positive semi-definite
Recommended from our members
Simulating multiple faceted variability in single cell RNA sequencing.
The abundance of new computational methods for processing and interpreting transcriptomes at a single cell level raises the need for in silico platforms for evaluation and validation. Here, we present SymSim, a simulator that explicitly models the processes that give rise to data observed in single cell RNA-Seq experiments. The components of the SymSim pipeline pertain to the three primary sources of variation in single cell RNA-Seq data: noise intrinsic to the process of transcription, extrinsic variation indicative of different cell states (both discrete and continuous), and technical variation due to low sensitivity and measurement noise and bias. We demonstrate how SymSim can be used for benchmarking methods for clustering, differential expression and trajectory inference, and for examining the effects of various parameters on their performance. We also show how SymSim can be used to evaluate the number of cells required to detect a rare population under various scenarios
- …