1,175 research outputs found
Transcriptome annotation using tandem SAGE tags
Analysis of several million expressed gene signatures (tags) revealed an increasing number of different sequences, largely exceeding that of annotated genes in mammalian genomes. Serial analysis of gene expression (SAGE) can reveal new Poly(A) RNAs transcribed from previously unrecognized chromosomal regions. However, conventional SAGE tags are too short to identify unambiguously unique sites in large genomes. Here, we design a novel strategy with tags anchored on two different restrictions sites of cDNAs. New transcripts are then tentatively defined by the two SAGE tags in tandem and by the spanning sequence read on the genome between these tagged sites. Having developed a new algorithm to locate these tag-delimited genomic sequences (TDGS), we first validated its capacity to recognize known genes and its ability to reveal new transcripts with two SAGE libraries built in parallel from a single RNA sample. Our algorithm proves fast enough to experiment this strategy at a large scale. We then collected and processed the complete sets of human SAGE tags to predict yet unknown transcripts. A cross-validation with tiling arrays data shows that 47% of these TDGS overlap transcriptional active regions. Our method provides a new and complementary approach for complex transcriptome annotation
An improved zebrafish transcriptome annotation for sensitive and comprehensive detection of cell type-specific genes
The zebrafish is ideal for studying embryogenesis and is increasingly applied to model human disease. In these contexts, RNA-sequencing (RNA-seq) provides mechanistic insights by identifying transcriptome changes between experimental conditions. Application of RNA-seq relies on accurate transcript annotation for a genome of interest. Here, we find discrepancies in analysis from RNA-seq datasets quantified using Ensembl and RefSeq zebrafish annotations. These issues were due, in part, to variably annotated 3\u27 untranslated regions and thousands of gene models missing from each annotation. Since these discrepancies could compromise downstream analyses and biological reproducibility, we built a more comprehensive zebrafish transcriptome annotation that addresses these deficiencies. Our annotation improves detection of cell type-specific genes in both bulk and single cell RNA-seq datasets, where it also improves resolution of cell clustering. Thus, we demonstrate that our new transcriptome annotation can outperform existing annotations, providing an important resource for zebrafish researchers
Methods to study splicing from high-throughput RNA Sequencing data
The development of novel high-throughput sequencing (HTS) methods for RNA
(RNA-Seq) has provided a very powerful mean to study splicing under multiple
conditions at unprecedented depth. However, the complexity of the information
to be analyzed has turned this into a challenging task. In the last few years,
a plethora of tools have been developed, allowing researchers to process
RNA-Seq data to study the expression of isoforms and splicing events, and their
relative changes under different conditions. We provide an overview of the
methods available to study splicing from short RNA-Seq data. We group the
methods according to the different questions they address: 1) Assignment of the
sequencing reads to their likely gene of origin. This is addressed by methods
that map reads to the genome and/or to the available gene annotations. 2)
Recovering the sequence of splicing events and isoforms. This is addressed by
transcript reconstruction and de novo assembly methods. 3) Quantification of
events and isoforms. Either after reconstructing transcripts or using an
annotation, many methods estimate the expression level or the relative usage of
isoforms and/or events. 4) Providing an isoform or event view of differential
splicing or expression. These include methods that compare relative
event/isoform abundance or isoform expression across two or more conditions. 5)
Visualizing splicing regulation. Various tools facilitate the visualization of
the RNA-Seq data in the context of alternative splicing. In this review, we do
not describe the specific mathematical models behind each method. Our aim is
rather to provide an overview that could serve as an entry point for users who
need to decide on a suitable tool for a specific analysis. We also attempt to
propose a classification of the tools according to the operations they do, to
facilitate the comparison and choice of methods.Comment: 31 pages, 1 figure, 9 tables. Small corrections adde
Comparative validation of the D. melanogaster modENCODE transcriptome annotation
Accurate gene model annotation of reference genomes is critical for making them useful. The modENCODE project has improved the D. melanogaster genome annotation by using deep and diverse high-throughput data. Since transcriptional activity that has been evolutionarily conserved is likely to have an advantageous function, we have performed large-scale interspecific comparisons to increase confidence in predicted annotations. To support comparative genomics, we filled in divergence gaps in the Drosophila phylogeny by generating draft genomes for eight new species. For comparative transcriptome analysis, we generated mRNA expression profiles on 81 samples from multiple tissues and developmental stages of 15 Drosophila species, and we performed cap analysis of gene expression in D. melanogaster and D. pseudoobscura. We also describe conservation of four distinct core promoter structures composed of combinations of elements at three positions. Overall, each type of genomic feature shows a characteristic divergence rate relative to neutral models, highlighting the value of multispecies alignment in annotating a target genome that should prove useful in the annotation of other high priority genomes, especially human and other mammalian genomes that are rich in noncoding sequences. We report that the vast majority of elements in the annotation are evolutionarily conserved, indicating that the annotation will be an important springboard for functional genetic testing by the Drosophila community
Recommended from our members
Improving nanopore read accuracy with the R2C2 method enables the sequencing of highly multiplexed full-length single-cell cDNA.
High-throughput short-read sequencing has revolutionized how transcriptomes are quantified and annotated. However, while Illumina short-read sequencers can be used to analyze entire transcriptomes down to the level of individual splicing events with great accuracy, they fall short of analyzing how these individual events are combined into complete RNA transcript isoforms. Because of this shortfall, long-distance information is required to complement short-read sequencing to analyze transcriptomes on the level of full-length RNA transcript isoforms. While long-read sequencing technology can provide this long-distance information, there are issues with both Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) long-read sequencing technologies that prevent their widespread adoption. Briefly, PacBio sequencers produce low numbers of reads with high accuracy, while ONT sequencers produce higher numbers of reads with lower accuracy. Here, we introduce and validate a long-read ONT-based sequencing method. At the same cost, our Rolling Circle Amplification to Concatemeric Consensus (R2C2) method generates more accurate reads of full-length RNA transcript isoforms than any other available long-read sequencing method. These reads can then be used to generate isoform-level transcriptomes for both genome annotation and differential expression analysis in bulk or single-cell samples
IsoDOT Detects Differential RNA-isoform Expression/Usage with respect to a Categorical or Continuous Covariate with High Sensitivity and Specificity
We have developed a statistical method named IsoDOT to assess differential
isoform expression (DIE) and differential isoform usage (DIU) using RNA-seq
data. Here isoform usage refers to relative isoform expression given the total
expression of the corresponding gene. IsoDOT performs two tasks that cannot be
accomplished by existing methods: to test DIE/DIU with respect to a continuous
covariate, and to test DIE/DIU for one case versus one control. The latter task
is not an uncommon situation in practice, e.g., comparing paternal and maternal
allele of one individual or comparing tumor and normal sample of one cancer
patient. Simulation studies demonstrate the high sensitivity and specificity of
IsoDOT. We apply IsoDOT to study the effects of haloperidol treatment on mouse
transcriptome and identify a group of genes whose isoform usages respond to
haloperidol treatment
Evaluating Characteristics of De Novo Assembly Software on 454 Transcriptome Data: A Simulation Approach
Background: The quantity of transcriptome data is rapidly increasing for non-model organisms. As sequencing technology advances, focus shifts towards solving bioinformatic challenges, of which sequence read assembly is the first task. Recent studies have compared the performance of different software to establish a best practice for transcriptome assembly. Here, we adapted a simulation approach to evaluate specific features of assembly programs on 454 data. The novelty of our study is that the simulation allows us to calculate a model assembly as reference point for comparison. Findings: The simulation approach allows us to compare basic metrics of assemblies computed by different software applications (CAP3, MIRA, Newbler, and Oases) to a known optimal solution. We found MIRA and CAP3 are conservative in merging reads. This resulted in comparably high number of short contigs. In contrast, Newbler more readily merged reads into longer contigs, while Oases produced the overall shortest assembly. Due to the simulation approach, reads could be traced back to their correct placement within the transcriptome. Together with mapping reads onto the assembled contigs, we were able to evaluate ambiguity in the assemblies. This analysis further supported the conservative nature of MIRA and CAP3, which resulted in low proportions of chimeric contigs, but high redundancy. Newbler produced less redundancy, but the proportion of chimeric contigs was higher. Conclusion: Our evaluation of four assemblers suggested that MIRA and Newbler slightly outperformed the othe
Keep Me Around: Intron Retention Detection and Analysis
We present a tool, keep me around (kma), a suite of python scripts and an R
package that finds retained introns in RNA-Seq experiments and incorporates
biological replicates to reduce the number of false positives when detecting
retention events. kma uses the results of existing quantification tools that
probabilistically assign multi-mapping reads, thus interfacing easily with
transcript quantification pipelines. The data is represented in a convenient,
database style format that allows for easy aggregation across introns, genes,
samples, and conditions to allow for further exploratory analysis
Computational methods for transcriptome annotation and quantification using RNA-seq
High-throughput RNA sequencing (RNA-seq) promises a comprehensive picture of the transcriptome, allowing for the complete annotation and quantification of all genes and their isoforms across samples. Realizing this promise requires increasingly complex computational methods. These computational challenges fall into three main categories: (i) read mapping, (ii) transcriptome reconstruction and (iii) expression quantification. Here we explain the major conceptual and practical challenges, and the general classes of solutions for each category. Finally, we highlight the interdependence between these categories and discuss the benefits for different biological applications
- …