241 research outputs found
Data and Statistical Methods To Analyze the Human Microbiome
The Waldron lab for computational biostatistics bridges the areas of cancer genomics and microbiome studies for public health, developing methods to exploit publicly available data resources and to integrate-omics studies
BAYESIAN NONPARAMETRIC CROSS-STUDY VALIDATION OF PREDICTION METHODS
We consider comparisons of statistical learning algorithms using multiple data sets, via leave-one-in cross-study validation: each of the algorithms is trained on one data set; the resulting model is then validated on each remaining data set. This poses two statistical challenges that need to be addressed simultaneously. The first is the assessment of study heterogeneity, with the aim of identifying a subset of studies within which algorithm comparisons can be reliably carried out. The second is the comparison of algorithms using the ensemble of data sets. We address both problems by integrating clustering and model comparison. We formulate a Bayesian model for the array of cross-study validation statistics, which defines clusters of studies with similar properties and provides the basis for meaningful algorithm comparison in the presence of study heterogeneity. We illustrate our approach through simulations involving studies with varying severity of systematic errors, and in the context of medical prognosis for patients diagnosed with cancer, using high-throughput measurements of the transcriptional activity of the tumorâs genes
Recommended from our members
Report on emerging technologies for translational bioinformatics: a symposium on gene expression profiling for archival tissues
Background: With over 20 million formalin-fixed, paraffin-embedded (FFPE) tissue samples archived each year in the United States alone, archival tissues remain a vast and under-utilized resource in the genomic study of cancer. Technologies have recently been introduced for whole-transcriptome amplification and microarray analysis of degraded mRNA fragments from FFPE samples, and studies of these platforms have only recently begun to enter the published literature
Lineage-specific interface proteins match up the cell cycle and differentiation in embryo stem cells.
The shortage of molecular information on cell cycle changes along embryonic stem cell (ESC) differentiation prompts an in silico approach, which may provide a novel way to identify candidate genes or mechanisms acting in coordinating the two programs. We analyzed germ layer specific gene expression changes during the cell cycle and ESC differentiation by combining four human cell cycle transcriptome profiles with thirteen in vitro human ESC differentiation studies. To detect cross-talk mechanisms we then integrated the transcriptome data that displayed differential regulation with protein interaction data. A new class of non-transcriptionally regulated genes was identified, encoding proteins which interact systematically with proteins corresponding to genes regulated during the cell cycle or cell differentiation, and which therefore can be seen as interface proteins coordinating the two programs. Functional analysis gathered insights in fate-specific candidates of interface functionalities. The non-transcriptionally regulated interface proteins were found to be highly regulated by post-translational ubiquitylation modification, which may synchronize the transition between cell proliferation and differentiation in ESCs
Assessment of statistical methods from single cell, bulk RNA-seq, and metagenomics applied to microbiome data
BackgroundThe correct identification of differentially abundant microbial taxa between experimental conditions is a methodological and computational challenge. Recent work has produced methods to deal with the high sparsity and compositionality characteristic of microbiome data, but independent benchmarks comparing these to alternatives developed for RNA-seq data analysis are lacking.ResultsWe compare methods developed for single-cell and bulk RNA-seq, and specifically for microbiome data, in terms of suitability of distributional assumptions, ability to control false discoveries, concordance, power, and correct identification of differentially abundant genera. We benchmark these methods using 100 manually curated datasets from 16S and whole metagenome shotgun sequencing.ConclusionsThe multivariate and compositional methods developed specifically for microbiome analysis did not outperform univariate methods developed for differential expression analysis of RNA-seq data. We recommend a careful exploratory data analysis prior to application of any inferential model and we present a framework to help scientists make an informed choice of analysis methods in a dataset-specific manner
Inferring random change point from left-censored longitudinal data by segmented mechanistic nonlinear models, with application in HIV surveillance study
The primary goal of public health efforts to control HIV epidemics is to
diagnose and treat people with HIV infection as soon as possible after
seroconversion. The timing of initiation of antiretroviral therapy (ART)
treatment after HIV diagnosis is, therefore, a critical population-level
indicator that can be used to measure the effectiveness of public health
programs and policies at local and national levels. However, population-based
data on ART initiation are unavailable because ART initiation and prescription
are typically measured indirectly by public health departments (e.g., with
viral suppression as a proxy). In this paper, we present a random change-point
model to infer the time of ART initiation utilizing routinely reported
individual-level HIV viral load from an HIV surveillance system. To deal with
the left-censoring and the nonlinear trajectory of viral load data, we
formulate a flexible segmented nonlinear mixed effects model and propose a
Stochastic version of EM (StEM) algorithm, coupled with a Gibbs sampler for the
inference. We apply the method to a random subset of HIV surveillance data to
infer the timing of ART initiation since diagnosis and to gain additional
insights into the viral load dynamics. Simulation studies are also performed to
evaluate the properties of the proposed method
Recommended from our members
Metagenomic microbial community profiling using unique clade-specific marker genes
Metagenomic shotgun sequencing data can identify microbes populating a microbial community and their proportions, but existing taxonomic profiling methods are inefficient for increasingly large datasets. We present an approach that uses clade-specific marker genes to unambiguously assign reads to microbial clades more accurately and >50Ă faster than current approaches. We validated MetaPhlAn on terabases of short reads and provide the largest metagenomic profiling to date of the human gu
Cross-study validation for the assessment of prediction algorithms
Motivation: Numerous competing algorithms for prediction in high-dimensional settings have been developed in the statistical and machine-learning literature. Learning algorithms and the prediction models they generate are typically evaluated on the basis of cross-validation error estimates in a few exemplary datasets. However, in most applications, the ultimate goal of prediction modeling is to provide accurate predictions for independent samples obtained in different settings. Cross-validation within exemplary datasets may not adequately reflect performance in the broader application context. Methods: We develop and implement a systematic approach to âcross-study validationâ, to replace or supplement conventional cross-validation when evaluating high-dimensional prediction models in independent datasets. We illustrate it via simulations and in a collection of eight estrogen-receptor positive breast cancer microarray gene-expression datasets, where the objective is predicting distant metastasis-free survival (DMFS). We computed the C-index for all pairwise combinations of training and validation datasets. We evaluate several alternatives for summarizing the pairwise validation statistics, and compare these to conventional cross-validation. Results: Our data-driven simulations and our application to survival prediction with eight breast cancer microarray datasets, suggest that standard cross-validation produces inflated discrimination accuracy for all algorithms considered, when compared to cross-study validation. Furthermore, the ranking of learning algorithms differs, suggesting that algorithms performing best in cross-validation may be suboptimal when evaluated through independent validation. Availability: The survHD: Survival in High Dimensions package (http://www.bitbucket.org/lwaldron/survhd) will be made available through Bioconductor. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online
Metagenomic biomarker discovery and explanation
This study describes and validates a new method for metagenomic biomarker discovery by way of class comparison, tests of biological consistency and effect size estimation. This addresses the challenge of finding organisms, genes, or pathways that consistently explain the differences between two or more microbial communities, which is a central problem to the study of metagenomics. We extensively validate our method on several microbiomes and a convenient online interface for the method is provided at http://huttenhower.sph.harvard.edu/lefse/.National Institute of Dental and Craniofacial Research (U.S.) (grant DE017106)National Institutes of Health (U.S.) (NIH grant AI078942)Burroughs Wellcome FundNational Institutes of Health (U.S.) (NIH 1R01HG005969
- âŠ