93 research outputs found
The tspair package for finding top scoring pair classifiers in R
Summary: Top scoring pairs (TSPs) are pairs of genes whose relative rankings can be used to accurately classify individuals into one of two classes. TSPs have two main advantages over many standard classifiers used in gene expression studies: (i) a TSP is based on only two genes, which leads to easily interpretable and inexpensive diagnostic tests and (ii) TSP classifiers are based on gene rankings, so they are more robust to variation in technical factors or normalization than classifiers based on expression levels of individual genes. Here I describe the R package, tspair, which can be used to quickly identify and assess TSP classifiers for gene expression data
Removing batch effects for prediction problems with frozen surrogate variable analysis
Batch effects are responsible for the failure of promising genomic prognos-
tic signatures, major ambiguities in published genomic results, and retractions
of widely-publicized findings. Batch effect corrections have been developed to
re- move these artifacts, but they are designed to be used in population
studies. But genomic technologies are beginning to be used in clinical
applications where sam- ples are analyzed one at a time for diagnostic,
prognostic, and predictive applica- tions. There are currently no batch
correction methods that have been developed specifically for prediction. In
this paper, we propose an new method called frozen surrogate variable analysis
(fSVA) that borrows strength from a training set for individual sample batch
correction. We show that fSVA improves prediction ac- curacy in simulations and
in public genomic studies. fSVA is available as part of the sva Bioconductor
package
Gene set bagging for estimating replicability of gene set analyses
Background: Significance analysis plays a major role in identifying and
ranking genes, transcription factor binding sites, DNA methylation regions, and
other high-throughput features for association with disease. We propose a new
approach, called gene set bagging, for measuring the stability of ranking
procedures using predefined gene sets. Gene set bagging involves resampling the
original high-throughput data, performing gene-set analysis on the resampled
data, and confirming that biological categories replicate. This procedure can
be thought of as bootstrapping gene-set analysis and can be used to determine
which are the most reproducible gene sets. Results: Here we apply this approach
to two common genomics applications: gene expression and DNA methylation. Even
with state-of-the-art statistical ranking procedures, significant categories in
a gene set enrichment analysis may be unstable when subjected to resampling.
Conclusions: We demonstrate that gene lists are not necessarily stable, and
therefore additional steps like gene set bagging can improve biological
inference of gene set analysis.Comment: 3 Figure
Cloud-scale RNA-sequencing differential expression analysis with Myrna
As sequencing throughput approaches dozens of gigabases per day, there is a growing need for efficient software for analysis of transcriptome sequencing (RNA-Seq) data. Myrna is a cloud-computing pipeline for calculating differential gene expression in large RNA-Seq datasets. We apply Myrna to the analysis of publicly available data sets and assess the goodness of fit of standard statistical models. Myrna is available from http://bowtie-bio.sf.net/myrna
ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count datasets
<p>Abstract</p> <p>1 Background</p> <p>RNA sequencing is a flexible and powerful new approach for measuring gene, exon, or isoform expression. To maximize the utility of RNA sequencing data, new statistical methods are needed for clustering, differential expression, and other analyses. A major barrier to the development of new statistical methods is the lack of RNA sequencing datasets that can be easily obtained and analyzed in common statistical software packages such as R. To speed up the development process, we have created a resource of analysis-ready RNA-sequencing datasets.</p> <p>2 Description</p> <p>ReCount is an online resource of RNA-seq gene count tables and auxilliary data. Tables were built from raw RNA sequencing data from 18 different published studies comprising 475 samples and over 8 billion reads. Using the Myrna package, reads were aligned, overlapped with gene models and tabulated into gene-by-sample count tables that are ready for statistical analysis. Count tables and phenotype data were combined into Bioconductor ExpressionSet objects for ease of analysis. ReCount also contains the Myrna manifest files and R source code used to process the samples, allowing statistical and computational scientists to consider alternative parameter values.</p> <p>3 Conclusions</p> <p>By combining datasets from many studies and providing data that has already been processed from. fastq format into ready-to-use. RData and. txt files, ReCount facilitates analysis and methods development for RNA-seq count data. We anticipate that ReCount will also be useful for investigators who wish to consider cross-study comparisons and alternative normalization strategies for RNA-seq.</p
- …