24 research outputs found
A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification
Alternative splicing is widely acknowledged to be a crucial regulator of gene expression and is a key contributor to both normal developmental processes and disease states. While cost-effective and accurate for quantification, short-read RNA-seq lacks the ability to resolve full-length transcript isoforms despite increasingly sophisticated computational methods. Long-read sequencing platforms such as Pacific Biosciences (PacBio) and Oxford Nanopore (ONT) bypass the transcript reconstruction challenges of short-reads. Here we describe TALON, the ENCODE4 pipeline for analyzing PacBio cDNA and ONT direct-RNA transcriptomes. We apply TALON to three human ENCODE Tier 1 cell lines and show that while both technologies perform well at full-transcript discovery and quantification, each technology has its distinct artifacts. We further apply TALON to mouse cortical and hippocampal transcriptomes and find that a substantial proportion of neuronal genes have more reads associated with novel isoforms than annotated ones. The TALON pipeline for technology-agnostic, long-read transcriptome discovery and quantification tracks both known and novel transcript models as well as expression levels across datasets for both simple studies and larger projects such as ENCODE that seek to decode transcriptional regulation in the human and mouse genomes to predict more accurate expression levels of genes and transcripts than possible with short-reads alone
A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification
Alternative splicing is widely acknowledged to be a crucial regulator of gene expression and is a key contributor to both normal developmental processes and disease states. While cost-effective and accurate for quantification, short-read RNA-seq lacks the ability to resolve full-length transcript isoforms despite increasingly sophisticated computational methods. Long-read sequencing platforms such as Pacific Biosciences (PacBio) and Oxford Nanopore (ONT) bypass the transcript reconstruction challenges of short-reads. Here we describe TALON, the ENCODE4 pipeline for analyzing PacBio cDNA and ONT direct-RNA transcriptomes. We apply TALON to three human ENCODE Tier 1 cell lines and show that while both technologies perform well at full-transcript discovery and quantification, each technology has its distinct artifacts. We further apply TALON to mouse cortical and hippocampal transcriptomes and find that a substantial proportion of neuronal genes have more reads associated with novel isoforms than annotated ones. The TALON pipeline for technology-agnostic, long-read transcriptome discovery and quantification tracks both known and novel transcript models as well as expression levels across datasets for both simple studies and larger projects such as ENCODE that seek to decode transcriptional regulation in the human and mouse genomes to predict more accurate expression levels of genes and transcripts than possible with short-reads alone
Generation of a humanized Aβ expressing mouse demonstrating aspects of Alzheimer's disease-like pathology.
The majority of Alzheimer’s disease (AD) cases are late-onset and occur sporadically, however most mouse models of the disease harbor pathogenic mutations, rendering them better representations of familial autosomal-dominant forms of the disease. Here, we generated knock-in mice that express wildtype human Aβ under control of the mouse App locus. Remarkably, changing 3 amino acids in the mouse Aβ sequence to its wild-type human counterpart leads to age-dependent impairments in cognition and synaptic plasticity, brain volumetric changes, inflammatory alterations, the appearance of Periodic Acid-Schiff (PAS) granules and changes in gene expression. In addition, when exon 14 encoding the Aβ sequence was flanked by loxP sites we show that Cre-mediated excision of exon 14 ablates hAβ expression, rescues cognition and reduces the formation of PAS granules
Systematic assessment of long-read RNA-seq methods for transcript identification and quantification
The Long-read RNA-Seq Genome Annotation Assessment Project (LRGASP) Consortium was formed to evaluate the effectiveness of long-read approaches for transcriptome analysis. The consortium generated over 427 million long-read sequences from cDNA and direct RNA datasets, encompassing human, mouse, and manatee species, using different protocols and sequencing platforms. These data were utilized by developers to address challenges in transcript isoform detection and quantification, as well as de novo transcript isoform identification. The study revealed that libraries with longer, more accurate sequences produce more accurate transcripts than those with increased read depth, whereas greater read depth improved quantification accuracy. In well-annotated genomes, tools based on reference sequences demonstrated the best performance. When aiming to detect rare and novel transcripts or when using reference-free approaches, incorporating additional orthogonal data and replicate samples are advised. This collaborative study offers a benchmark for current practices and provides direction for future method development in transcriptome analysis
Systematic assessment of long-read RNA-seq methods for transcript identification and quantification
The Long-read RNA-Seq Genome Annotation Assessment Project Consortium was formed to evaluate the effectiveness of long-read approaches for transcriptome analysis. Using different protocols and sequencing platforms, the consortium generated over 427 million long-read sequences from complementary DNA and direct RNA datasets, encompassing human, mouse and manatee species. Developers utilized these data to address challenges in transcript isoform detection, quantification and de novo transcript detection. The study revealed that libraries with longer, more accurate sequences produce more accurate transcripts than those with increased read depth, whereas greater read depth improved quantification accuracy. In well-annotated genomes, tools based on reference sequences demonstrated the best performance. Incorporating additional orthogonal data and replicate samples is advised when aiming to detect rare and novel transcripts or using reference-free approaches. This collaborative study offers a benchmark for current practices and provides direction for future method development in transcriptome analysis
Multi-tissue integrative analysis of personal epigenomes
Evaluating the impact of genetic variants on transcriptional regulation is a central goal in biological science that has been constrained by reliance on a single reference genome. To address this, we constructed phased, diploid genomes for four cadaveric donors (using long-read sequencing) and systematically charted noncoding regulatory elements and transcriptional activity across more than 25 tissues from these donors. Integrative analysis revealed over a million variants with allele-specific activity, coordinated, locus-scale allelic imbalances, and structural variants impacting proximal chromatin structure. We relate the personal genome analysis to the ENCODE encyclopedia, annotating allele- and tissue-specific elements that are strongly enriched for variants impacting expression and disease phenotypes. These experimental and statistical approaches, and the corresponding EN-TEx resource, provide a framework for personalized functional genomics
Recommended from our members
Transcriptome dynamics of neurodegeneration using single-cell and long-read approaches
Alzheimer’s disease is characterized by plaques and tangles that lead to neurodegeneration and dementia. Clinical trials for AD drugs have a high failure rate and could benefit from better mouse models of late onset AD. Changes in gene expression, alternative splicing and chromatin profiles have been described as indicators of the pathology. The focus of this thesis is to characterize already available models of AD using single-cell and long-read transcriptomics. Chapter 2 is a time course of neurodegeneration in the 3xTg-AD mouse, which is the only mouse AD model that has plaques and tangles, similar to human AD. We use bulk RNA-seq in the hippocampus of 3xTg mice to identify distinct gene modules associated with microglia and oligodendrocytes that increase with aging and pathology. We further investigate the changes in cell populations using single-nucleus RNA-seq of the hippocampus and cortex of 3xTg and 5xFAD mice to detect major changes in astrocytes and oligodendrocytes groups. We recover a common path of astrocyte activation with the 5xFAD mouse and find that 3xTg derived astrocytes seem to be at an earlier stage of activation. In order to investigate the activation of microglia in 3xTG, we also generated a single-cell RNA-seq dataset of microglial cells and found multiple subtypes, including a set of microglia with distinct transcription factor expression profile that is associated with an early increase in Csf1 expression before the full onset of DAM gene expression. Finally, scATAC-seq reveals a set of chromatin accessible areas shared across multiple activation states found in the scRNA-seq that matches glial activation processes. Overall, differences between the main glial groups point to a slower activation process in the 3xTg model when compared to the 5xFAD. Our study contributes to the identification of progressive transcriptional changes of glial cells in a model that has plaques and tangles.Single-cell microfluidic systems are optimized for smaller cell types than most cells in the brain, which are also difficult to dissociate. The Split-seq barcode strategy without any microfluidics and fixation steps before cell labeling allows for multiplexed cells and nuclei to be sequenced at the same time. We use Split-seq in Chapter 3 to sequence the transcriptome of 24,270 nuclei as well as single-cell microglia from the cortex and hippocampus of one 24mo female 3xTg-AD mouse. Comparison of Split-seq cell clusters against clusters from our existing time course study of 3xTg-AD (Chapter 2), we recover all of the main cell types and detect genes that were problematic, such as Gfap in astrocytes. However, nuclei from derived microglia lack the major identifiers of DAM, which were detectable at low levels in single-cells. Sub-clustering of Astrocytes recovers 11 distinct clusters including an activation cluster that overlaps not only with previously identified markers such as Gfap but also novel markers such as Thy1 expression. The Split-seq protocol show promise for scaling up future single-cell transcriptomics studies of AD.
AD has been extensively characterized using short-read sequencing. However, most studies focus on gene expression changes and rarely analyze isoform changes. Full-length, high-throughput mRNA sequencing using long-read technologies is the best way to explore transcript isoform diversity, as regular short-reads do not provide enough information about the connectivity between distant exons. We explore in Chapter 4 the transcriptome of the mouse C57BL6/J and 5xFAD cortex and hippocampus at 8 months of age. We recover >90% of genes previously associated with the 5xFAD genotype. We further detect 244 and 471 isoform switches in cortex and hippocampus respectively. We also found 194 genes with TSS switches and 714 for TES switches relevant for the 5xFAD genotype. Genes presenting isoform changes include genes such as Csf2ra, Csf1 and Lamp2. Long-read transcriptome analysis of mouse models of disease can provide additional insights into how isoform switches can alter gene activity during disease progression
Recommended from our members
Transcriptome dynamics of neurodegeneration using single-cell and long-read approaches
Alzheimer’s disease is characterized by plaques and tangles that lead to neurodegeneration and dementia. Clinical trials for AD drugs have a high failure rate and could benefit from better mouse models of late onset AD. Changes in gene expression, alternative splicing and chromatin profiles have been described as indicators of the pathology. The focus of this thesis is to characterize already available models of AD using single-cell and long-read transcriptomics. Chapter 2 is a time course of neurodegeneration in the 3xTg-AD mouse, which is the only mouse AD model that has plaques and tangles, similar to human AD. We use bulk RNA-seq in the hippocampus of 3xTg mice to identify distinct gene modules associated with microglia and oligodendrocytes that increase with aging and pathology. We further investigate the changes in cell populations using single-nucleus RNA-seq of the hippocampus and cortex of 3xTg and 5xFAD mice to detect major changes in astrocytes and oligodendrocytes groups. We recover a common path of astrocyte activation with the 5xFAD mouse and find that 3xTg derived astrocytes seem to be at an earlier stage of activation. In order to investigate the activation of microglia in 3xTG, we also generated a single-cell RNA-seq dataset of microglial cells and found multiple subtypes, including a set of microglia with distinct transcription factor expression profile that is associated with an early increase in Csf1 expression before the full onset of DAM gene expression. Finally, scATAC-seq reveals a set of chromatin accessible areas shared across multiple activation states found in the scRNA-seq that matches glial activation processes. Overall, differences between the main glial groups point to a slower activation process in the 3xTg model when compared to the 5xFAD. Our study contributes to the identification of progressive transcriptional changes of glial cells in a model that has plaques and tangles.Single-cell microfluidic systems are optimized for smaller cell types than most cells in the brain, which are also difficult to dissociate. The Split-seq barcode strategy without any microfluidics and fixation steps before cell labeling allows for multiplexed cells and nuclei to be sequenced at the same time. We use Split-seq in Chapter 3 to sequence the transcriptome of 24,270 nuclei as well as single-cell microglia from the cortex and hippocampus of one 24mo female 3xTg-AD mouse. Comparison of Split-seq cell clusters against clusters from our existing time course study of 3xTg-AD (Chapter 2), we recover all of the main cell types and detect genes that were problematic, such as Gfap in astrocytes. However, nuclei from derived microglia lack the major identifiers of DAM, which were detectable at low levels in single-cells. Sub-clustering of Astrocytes recovers 11 distinct clusters including an activation cluster that overlaps not only with previously identified markers such as Gfap but also novel markers such as Thy1 expression. The Split-seq protocol show promise for scaling up future single-cell transcriptomics studies of AD.
AD has been extensively characterized using short-read sequencing. However, most studies focus on gene expression changes and rarely analyze isoform changes. Full-length, high-throughput mRNA sequencing using long-read technologies is the best way to explore transcript isoform diversity, as regular short-reads do not provide enough information about the connectivity between distant exons. We explore in Chapter 4 the transcriptome of the mouse C57BL6/J and 5xFAD cortex and hippocampus at 8 months of age. We recover >90% of genes previously associated with the 5xFAD genotype. We further detect 244 and 471 isoform switches in cortex and hippocampus respectively. We also found 194 genes with TSS switches and 714 for TES switches relevant for the 5xFAD genotype. Genes presenting isoform changes include genes such as Csf2ra, Csf1 and Lamp2. Long-read transcriptome analysis of mouse models of disease can provide additional insights into how isoform switches can alter gene activity during disease progression