35,016 research outputs found
OMICfpp: a fuzzy approach for paired RNA-Seq counts
BACKGROUND: RNA sequencing is a widely used technology for differential expression analysis. However, the RNA-Seq do not provide accurate absolute measurements and the results can be different for each pipeline used. The major problem in statistical analysis of RNA-Seq and in the omics data in general, is the small sample size with respect to the large number of variables. In addition, experimental design must be taken into account and few tools consider it. RESULTS: We propose OMICfpp, a method for the statistical analysis of RNA-Seq paired design data. First, we obtain a p-value for each case-control pair using a binomial test. These p-values are aggregated using an ordered weighted average (OWA) with a given orness previously chosen. The aggregated p-value from the original data is compared with the aggregated p-value obtained using the same method applied to random pairs. These new pairs are generated using between-pairs and complete randomization distributions. This randomization p-value is used as a raw p-value to test the differential expression of each gene. The OMICfpp method is evaluated using public data sets of 68 sample pairs from patients with colorectal cancer. We validate our results through bibliographic search of the reported genes and using simulated data set. Furthermore, we compared our results with those obtained by the methods edgeR and DESeq2 for paired samples. Finally, we propose new target genes to validate these as gene expression signatures in colorectal cancer. OMICfpp is available at http://www.uv.es/ayala/software/OMICfpp_0.2.tar.gz . CONCLUSIONS: Our study shows that OMICfpp is an accurate method for differential expression analysis in RNA-Seq data with paired design. In addition, we propose the use of randomized p-values pattern graphic as a powerful and robust method to select the target genes for experimental validation
On de novo Bridging Paired-end RNA-seq Data
The high-throughput short-reads RNA-seq protocols often produce paired-end
reads, with the middle portion of the fragments being unsequenced. We explore
if the full-length fragments can be computationally reconstructed from the
sequenced two ends in the absence of the reference genome - a problem here we
refer to as de novo bridging. Solving this problem provides longer, more
informative RNA-seq reads, and benefits downstream RNA-seq analysis such as
transcript assembly, expression quantification, and splicing differential
analysis. However, de novo bridging is a challenging and complicated task owing
to alternative splicing, transcript noises, and sequencing errors. It remains
unclear if the data provides sufficient information for accurate bridging, let
alone efficient algorithms that determine the true bridges. Methods have been
proposed to bridge paired-end reads in the presence of reference genome (called
reference-based bridging), but the algorithms are far away from scaling for de
novo bridging as the underlying compacted de Bruijn graph(cdBG) used in the
latter task often contains millions of vertices and edges. We designed a new
truncated Dijkstra's algorithm for this problem, and proposed a novel algorithm
that reuses the shortest path tree to avoid running the truncated Dijkstra's
algorithm from scratch for all vertices for further speeding up. These
innovative techniques result in scalable algorithms that can bridge all
paired-end reads in a cdBG with millions of vertices. Our experiments showed
that paired-end RNA-seq reads can be accurately bridged to a large extent. The
resulting tool is freely available at
https://github.com/Shao-Group/rnabridge-denovo.Comment: 10 pages, 4 figure
Identification of differentially distributed gene expression and distinct sets of cancer-related genes identified by changes in mean and variability.
There is increasing evidence that changes in the variability or overall distribution of gene expression are important both in normal biology and in diseases, particularly cancer. Genes whose expression differs in variability or distribution without a difference in mean are ignored by traditional differential expression-based analyses. Using a Bayesian hierarchical model that provides tests for both differential variability and differential distribution for bulk RNA-seq data, we report here an investigation into differential variability and distribution in cancer. Analysis of eight paired tumour-normal datasets from The Cancer Genome Atlas confirms that differential variability and distribution analyses are able to identify cancer-related genes. We further demonstrate that differential variability identifies cancer-related genes that are missed by differential expression analysis, and that differential expression and differential variability identify functionally distinct sets of potentially cancer-related genes. These results suggest that differential variability analysis may provide insights into genetic aspects of cancer that would not be revealed by differential expression, and that differential distribution analysis may allow for more comprehensive identification of cancer-related genes than analyses based on changes in mean or variability alone
Effect of method of deduplication on estimation of differential gene expression using RNA-seq
© 2017 Klepikova et al.Background. RNA-seq is a useful tool for analysis of gene expression. However, its robustness is greatly affected by a number of artifacts. One of them is the presence of duplicated reads. Results. To infer the influence of different methods of removal of duplicated reads on estimation of gene expression in cancer genomics, we analyzed paired samples of hepatocellular carcinoma (HCC) and non-tumor liver tissue. Four protocols of data analysis were applied to each sample: processing without deduplication, deduplication using a method implemented in samtools, and deduplication based on one or two molecular indices (MI).Wealso analyzed the influence of sequencing layout (single read or paired end) and read length. We found that deduplication without MI greatly affects estimated expression values; this effect is the most pronounced for highly expressed genes. Conclusion. The use of unique molecular identifiers greatly improves accuracy of RNA-seq analysis, especially for highly expressed genes. We developed a set of scripts that enable handling of MI and their incorporation into RNA-seq analysis pipelines. Deduplication without MI affects results of differential gene expression analysis, producing a high proportion of false negative results. The absence of duplicate read removal is biased towards false positives. In those cases where using MI is not possible, we recommend using paired-end sequencing layout
Evaluating the Impacts of Sequencing Depth on Transcriptome Profiling in Human Adipose
Recent advances in RNA sequencing (RNA-Seq) have enabled the discovery of novel transcriptomic variations that are not possible with traditional microarray-based methods. Tissue and cell specific transcriptome changes during pathophysiological stress in disease cases versus controls and in response to therapies are of particular interest to investigators studying cardiometabolic diseases. Thus, knowledge on the relationships between sequencing depth and detection of transcriptomic variation is needed for designing RNA-Seq experiments and for interpreting results of analyses. Using deeply sequenced Illumina HiSeq 2000 101 bp paired-end RNA-Seq data derived from adipose of a healthy individual before and after systemic administration of endotoxin (LPS), we investigated the sequencing depths needed for studies of gene expression and alternative splicing (AS). In order to detect expressed genes and AS events, we found that ∼100 to 150 million (M) filtered reads were needed. However, the requirement on sequencing depth for the detection of LPS modulated differential expression (DE) and differential alternative splicing (DAS) was much higher. To detect 80% of events, ∼300 M filtered reads were needed for DE analysis whereas at least 400 M filtered reads were necessary for detecting DAS. Although the majority of expressed genes and AS events can be detected with modest sequencing depths (∼100 M filtered reads), the estimated gene expression levels and exon/intron inclusion levels were less accurate. We report the first study that evaluates the relationship between RNA-Seq depth and the ability to detect DE and DAS in human adipose. Our results suggest that a much higher sequencing depth is needed to reliably identify DAS events than for DE genes
Methods to study splicing from high-throughput RNA Sequencing data
The development of novel high-throughput sequencing (HTS) methods for RNA
(RNA-Seq) has provided a very powerful mean to study splicing under multiple
conditions at unprecedented depth. However, the complexity of the information
to be analyzed has turned this into a challenging task. In the last few years,
a plethora of tools have been developed, allowing researchers to process
RNA-Seq data to study the expression of isoforms and splicing events, and their
relative changes under different conditions. We provide an overview of the
methods available to study splicing from short RNA-Seq data. We group the
methods according to the different questions they address: 1) Assignment of the
sequencing reads to their likely gene of origin. This is addressed by methods
that map reads to the genome and/or to the available gene annotations. 2)
Recovering the sequence of splicing events and isoforms. This is addressed by
transcript reconstruction and de novo assembly methods. 3) Quantification of
events and isoforms. Either after reconstructing transcripts or using an
annotation, many methods estimate the expression level or the relative usage of
isoforms and/or events. 4) Providing an isoform or event view of differential
splicing or expression. These include methods that compare relative
event/isoform abundance or isoform expression across two or more conditions. 5)
Visualizing splicing regulation. Various tools facilitate the visualization of
the RNA-Seq data in the context of alternative splicing. In this review, we do
not describe the specific mathematical models behind each method. Our aim is
rather to provide an overview that could serve as an entry point for users who
need to decide on a suitable tool for a specific analysis. We also attempt to
propose a classification of the tools according to the operations they do, to
facilitate the comparison and choice of methods.Comment: 31 pages, 1 figure, 9 tables. Small corrections adde
Recommended from our members
Cost effective, experimentally robust differential-expression analysis for human/mammalian, pathogen and dual-species transcriptomics.
As sequencing read length has increased, researchers have quickly adopted longer reads for their experiments. Here, we examine 14 pathogen or host-pathogen differential gene expression data sets to assess whether using longer reads is warranted. A variety of data sets was used to assess what genomic attributes might affect the outcome of differential gene expression analysis including: gene density, operons, gene length, number of introns/exons and intron length. No genome attribute was found to influence the data in principal components analysis, hierarchical clustering with bootstrap support, or regression analyses of pairwise comparisons that were undertaken on the same reads, looking at all combinations of paired and unpaired reads trimmed to 36, 54, 72 and 101 bp. Read pairing had the greatest effect when there was little variation in the samples from different conditions or in their replicates (e.g. little differential gene expression). But overall, 54 and 72 bp reads were typically most similar. Given differences in costs and mapping percentages, we recommend 54 bp reads for organisms with no or few introns and 72 bp reads for all others. In a third of the data sets, read pairing had absolutely no effect, despite paired reads having twice as much data. Therefore, single-end reads seem robust for differential-expression analyses, but in eukaryotes paired-end reads are likely desired to analyse splice variants and should be preferred for data sets that are acquired with the intent to be community resources that might be used in secondary data analyses
Models for transcript quantification from RNA-Seq
RNA-Seq is rapidly becoming the standard technology for transcriptome
analysis. Fundamental to many of the applications of RNA-Seq is the
quantification problem, which is the accurate measurement of relative
transcript abundances from the sequenced reads. We focus on this problem, and
review many recently published models that are used to estimate the relative
abundances. In addition to describing the models and the different approaches
to inference, we also explain how methods are related to each other. A key
result is that we show how inference with many of the models results in
identical estimates of relative abundances, even though model formulations can
be very different. In fact, we are able to show how a single general model
captures many of the elements of previously published methods. We also review
the applications of RNA-Seq models to differential analysis, and explain why
accurate relative transcript abundance estimates are crucial for downstream
analyses
- …