19 research outputs found
A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification
Alternative splicing is widely acknowledged to be a crucial regulator of gene expression and is a key contributor to both normal developmental processes and disease states. While cost-effective and accurate for quantification, short-read RNA-seq lacks the ability to resolve full-length transcript isoforms despite increasingly sophisticated computational methods. Long-read sequencing platforms such as Pacific Biosciences (PacBio) and Oxford Nanopore (ONT) bypass the transcript reconstruction challenges of short-reads. Here we describe TALON, the ENCODE4 pipeline for analyzing PacBio cDNA and ONT direct-RNA transcriptomes. We apply TALON to three human ENCODE Tier 1 cell lines and show that while both technologies perform well at full-transcript discovery and quantification, each technology has its distinct artifacts. We further apply TALON to mouse cortical and hippocampal transcriptomes and find that a substantial proportion of neuronal genes have more reads associated with novel isoforms than annotated ones. The TALON pipeline for technology-agnostic, long-read transcriptome discovery and quantification tracks both known and novel transcript models as well as expression levels across datasets for both simple studies and larger projects such as ENCODE that seek to decode transcriptional regulation in the human and mouse genomes to predict more accurate expression levels of genes and transcripts than possible with short-reads alone
A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification
Alternative splicing is widely acknowledged to be a crucial regulator of gene expression and is a key contributor to both normal developmental processes and disease states. While cost-effective and accurate for quantification, short-read RNA-seq lacks the ability to resolve full-length transcript isoforms despite increasingly sophisticated computational methods. Long-read sequencing platforms such as Pacific Biosciences (PacBio) and Oxford Nanopore (ONT) bypass the transcript reconstruction challenges of short-reads. Here we describe TALON, the ENCODE4 pipeline for analyzing PacBio cDNA and ONT direct-RNA transcriptomes. We apply TALON to three human ENCODE Tier 1 cell lines and show that while both technologies perform well at full-transcript discovery and quantification, each technology has its distinct artifacts. We further apply TALON to mouse cortical and hippocampal transcriptomes and find that a substantial proportion of neuronal genes have more reads associated with novel isoforms than annotated ones. The TALON pipeline for technology-agnostic, long-read transcriptome discovery and quantification tracks both known and novel transcript models as well as expression levels across datasets for both simple studies and larger projects such as ENCODE that seek to decode transcriptional regulation in the human and mouse genomes to predict more accurate expression levels of genes and transcripts than possible with short-reads alone
Facilitation through altered resource availability in a mixed-species rodent malaria infection
A major challenge in disease ecology is to understand how co‐infecting parasite species interact. We manipulate in vivo resources and immunity to explain interactions between two rodent malaria parasites, Plasmodium chabaudi and P. yoelii. These species have analogous resource‐use strategies to the human parasites Plasmodium falciparum and P. vivax: P. chabaudi and P. falciparum infect red blood cells (RBC) of all ages (RBC generalist); P. yoelii and P. vivax preferentially infect young RBCs (RBC specialist). We find that: (1) recent infection with the RBC generalist facilitates the RBC specialist (P. yoelii density is enhanced ~10 fold). This occurs because the RBC generalist increases availability of the RBC specialist's preferred resource; (2) co‐infections with the RBC generalist and RBC specialist are highly virulent; (3) and the presence of an RBC generalist in a host population can increase the prevalence of an RBC specialist. Thus, we show that resources shape how parasite species interact and have epidemiological consequences
Systematic assessment of long-read RNA-seq methods for transcript identification and quantification
The Long-read RNA-Seq Genome Annotation Assessment Project (LRGASP) Consortium was formed to evaluate the effectiveness of long-read approaches for transcriptome analysis. The consortium generated over 427 million long-read sequences from cDNA and direct RNA datasets, encompassing human, mouse, and manatee species, using different protocols and sequencing platforms. These data were utilized by developers to address challenges in transcript isoform detection and quantification, as well as de novo transcript isoform identification. The study revealed that libraries with longer, more accurate sequences produce more accurate transcripts than those with increased read depth, whereas greater read depth improved quantification accuracy. In well-annotated genomes, tools based on reference sequences demonstrated the best performance. When aiming to detect rare and novel transcripts or when using reference-free approaches, incorporating additional orthogonal data and replicate samples are advised. This collaborative study offers a benchmark for current practices and provides direction for future method development in transcriptome analysis
Recommended from our members
Characterizing transcript diversity using long-read RNA sequencing
Alternative transcripts arise from the same gene via alternative TSS usage, splicing, and polyA site choice. Such transcripts can give rise to functional disparities in protein structure, post-transcriptional regulation, and translational efficiency. Moreover, their expression in appropriate spatiotemporal contexts is a key feature of eukaryotic genomes. However, detecting and quantifying these transcript isoforms across tissues, cell types, and species has been challenging due to their longer lengths compared to the short reads typical of standard RNA-seq. In contrast, long-read RNA-seq (LR-RNA-seq) provides complete transcript structures, enabling investigation of transcript features and usage with greater fidelity. Here, I describe my work on application of LR-RNA-seq to characterizing and comparing full-length transcriptomes. First, I describe Swan, a software library I developed to facilitate visualization of full-length transcripts and to compare transcript usage between biological conditions. Next, I describe the ENCODE4 human and mouse LR-RNA-seq datasets, where I applied a novel triplet-based framework to harmonize and classify transcripts that share transcript start sites, exon junction chains, and transcript end sites. Lastly, I discuss the application of our single-nucleus LR-RNA-seq technique (LR-Split-seq) on two geneticallydistinct mouse strains to uncover cell type and genotype-specific transcript usage patterns. Collectively, these projects form a solid foundation for future analyses of long read transcriptomes to quantify changes in transcript diversity and transcript usage between samples, cell types, and genotypes within and between species
Recommended from our members
Exon size and sequence conservation improves identification of splice-altering nucleotides.
Pre-mRNA splicing is regulated through multiple trans-acting splicing factors. These regulators interact with the pre-mRNA at intronic and exonic positions. Given that most exons are protein coding, the evolution of exons must be modulated by a combination of selective coding and splicing pressures. It has previously been demonstrated that selective splicing pressures are more easily deconvoluted when phylogenetic comparisons are made for exons of identical size, suggesting that exon size-filtered sequence alignments may improve identification of nucleotides evolved to mediate efficient exon ligation. To test this hypothesis, an exon size database was created, filtering 76 vertebrate sequence alignments based on exon size conservation. In addition to other genomic parameters, such as splice-site strength, gene position, or flanking intron length, this database permits the identification of exons that are size- and/or sequence-conserved. Highly size-conserved exons are always sequence-conserved. However, sequence conservation does not necessitate exon size conservation. Our analysis identified evolutionarily young exons and demonstrated that length conservation is a strong predictor of alternative splicing. A published data set of approximately 5000 exonic SNPs associated with disease was analyzed to test the hypothesis that exon size-filtered sequence comparisons increase detection of splice-altering nucleotides. Improved splice predictions could be achieved when mutations occur at the third codon position, especially when a mutation decreases exon inclusion efficiency. The results demonstrate that coding pressures dominate nucleotide composition at invariable codon positions and that exon size-filtered sequence alignments permit identification of splice-altering nucleotides at wobble positions
Recommended from our members
Exon size and sequence conservation improves identification of splice-altering nucleotides.
Pre-mRNA splicing is regulated through multiple trans-acting splicing factors. These regulators interact with the pre-mRNA at intronic and exonic positions. Given that most exons are protein coding, the evolution of exons must be modulated by a combination of selective coding and splicing pressures. It has previously been demonstrated that selective splicing pressures are more easily deconvoluted when phylogenetic comparisons are made for exons of identical size, suggesting that exon size-filtered sequence alignments may improve identification of nucleotides evolved to mediate efficient exon ligation. To test this hypothesis, an exon size database was created, filtering 76 vertebrate sequence alignments based on exon size conservation. In addition to other genomic parameters, such as splice-site strength, gene position, or flanking intron length, this database permits the identification of exons that are size- and/or sequence-conserved. Highly size-conserved exons are always sequence-conserved. However, sequence conservation does not necessitate exon size conservation. Our analysis identified evolutionarily young exons and demonstrated that length conservation is a strong predictor of alternative splicing. A published data set of approximately 5000 exonic SNPs associated with disease was analyzed to test the hypothesis that exon size-filtered sequence comparisons increase detection of splice-altering nucleotides. Improved splice predictions could be achieved when mutations occur at the third codon position, especially when a mutation decreases exon inclusion efficiency. The results demonstrate that coding pressures dominate nucleotide composition at invariable codon positions and that exon size-filtered sequence alignments permit identification of splice-altering nucleotides at wobble positions
Systematic assessment of long-read RNA-seq methods for transcript identification and quantification
Francisco Pardo-Palacios, Fairlie Reese, Silvia Carbonell-Sala: et al.With increased usage of long-read sequencing technologies to perform transcriptome analyses, there becomes a greater need to evaluate different methodologies including library preparation, sequencing platform, and computational analysis tools. Here, we report the study design of a community effort called the Long-read RNA-Seq Genome Annotation Assessment Project (LRGASP) Consortium, whose goals are characterizing the strengths and remaining challenges in using long-read approaches to identify and quantify the transcriptomes of both model and non-model organisms. The LRGASP organizers have generated cDNA and direct RNA datasets in human, mouse, and manatee samples using different protocols followed by sequencing on Illumina, Pacific Biosciences, and Oxford Nanopore Technologies platforms. Participants will use the provided data to submit predictions for three challenges: transcript isoform detection with a high-quality genome, transcript isoform quantification, and de novo transcript isoform identification. Evaluators from different institutions will determine which pipelines have the highest accuracy for a variety of metrics using benchmarks that include spike-in synthetic transcripts, simulated data, and a set of undisclosed, manually curated transcripts by GENCODE. We also describe plans for experimental validation of predictions that are platform-specific and computational tool-specific. We believe that a community effort to evaluate long-read RNA-seq methods will help move the field toward a better consensus on the best approaches to use for transcriptome analyses.N