490 research outputs found

    Functional RNA-RNA interactions in the context of circular RNAs

    Get PDF
    The purpose of my project is to identify novel functions of circRNAs with a particular focus on the effects of RNA--RNA interactions (RRI) on RNA processing. Computational prediction of RRI has revealed the biological function and mechanism of action of multiple genes. However, computational RRI prediction is limited by 2 major challenges: knowing the full sequence of the transcript and a high false-positive rate. Discovering the full sequence identity of circRNA has been a challenging task for bioinformatics in the last decade. In addition, the lack of knowledge of the full sequence of the transcripts in a sample leads to skewed quantification based on RNA-seq data, as well as incorrect results from analyses of NGS-derived techniques (e.g., CLIP-seq, SPLASH etc.). The problem of false discovery of new RRIs can be mitigated by dedicated experimental datasets. To overcome the first hurdle of my project, I developed CYCLeR , a computational tool that compares ribo-depleted and circRNA enriched RNA-seq libraries and outputs a high-confidence set of circRNA transcripts. The true strength of CYCLeR is the quantification module that can robustly estimate the abundances of both circular and linear transcripts. I have shown the advantage of CYCLeR over alternative tools in terms of transcript assembly and quantification. I have also shown that CYCLeR has is the only tool suitable to search for the functional association of circRNA transcripts. The second part of my work focuses on predicting functional RRIs that influence pluripotency. A co-expression network based on the output of CYCLeR can show the association of circRNA with known biological pathways and significantly facilitate the discovery of the function of circRNA. In vivo RNA proximity ligation experiments provide information on the dynamics of RNA-RNA interaction inside the cell. The combination of RNA-seq and RNA interactome data allows me to significantly enhance the strength of computational predictions. I build a co-expression network based on time series experiment of H1ESC treated with retinoic acid. I combine the co-expression information with results from analysis of RNA-RNA proximity ligation data (SPLASH). The analysis is supplemented with localisation information based on RNA-seq libraries specific for nuclear localisation. The results two circRNAs that participate in functional RRIs. circFIRRE is significantly enriched in SPLASH data, indicating a high probability of interaction with other RNAs. Interestingly, circFIRRE is one of the few circRNAs specifically enriched in the nucleus. The enrichment can be explained by the binding site for the hnRNPU protein, which keeps the circRNA in the nucleus. Knockout of the circFIRRE locus in human leads to a viral response. Multiple interaction sites of circFIRRE with ALU-specific sequences indicate that the viral response is triggered by disruption of A-to-I editing in cells. circLARP7 is another nuclear-specific circRNA. circLARP7 is co-expressed with all major markers for pluripotency. It is also expressed in high proximity to MIR302CHG -- a microRNA host gene related to maintaining the pluripotent state. High complementarity and conservation of a duplex between the circLARP7 and the nascent MIR302CHG indicate that circLARP7 might be related to the processing of the microRNAs from the miR-302/367 cluster.Das Ziel meines Projekts ist es, neue Funktionen von circRNAs zu identifizieren, mit besonderem Fokus auf die Auswirkungen von RNA--RNA-Interaktionen (RRI) auf die RNA-Verarbeitung. Die computergestützte Vorhersage von RRI hat die biologische Funktion und den Wirkungsmechanismus mehrerer Gene offenbart. Jedoch wird die Vorhersage von RRI durch zwei wesentliche Herausforderungen beschränkt: die Kenntnis der vollständigen Sequenz des Transkripts und eine hohe falsch-positive Rate. Die Aufschlüsselung der vollständen Sequenz von circRNA stellte in den letzten zehn Jahren eine große Herausforderung für Bioinformatiker dar. Darüber hinaus führt die mangelnde Kenntnis der vollständigen Sequenz der Transkripte in einer Probe zu einer verzerrten Quantifizierung auf der Grundlage von RNA-seq-Daten sowie zu falschen Ergebnissen aus Analysen von NGS-abgeleiteten Techniken (z. B. CLIP-seq, SPLASH usw.). Das Problem einer hohen Falscherkennungsrate neuer RRIs kann durch Nutzung geeigneter experimenteller Datensätze begrenzt werden. Um die erste Hürde meines Projekts zu überwinden, habe ich CYCLeR entwickelt, ein Computertool, das Ribo-abgereicherte und circRNA-angereicherte RNA-seq-Bibliotheken vergleicht und einen Reihe von circRNA-Transkripten mit hoher Zuverlässigkeit ausgibt. Die wahre Stärke von CYCLeR ist das Quantifizierungsmodul, das die Häufigkeit von sowohl kreisförmigen als auch linearen Transkripten zuverlässig berechnen kann. Ich habe den Vorteil von CYCLeR gegenüber alternativen Tools in Bezug auf Transkript-Zusammenstellung und Quantifizierung aufgezeigt. Ich habe auch gezeigt, dass CYCLeR das einzige geeignete Werkzeug ist, um nach der funktionellen Verbindung von circRNA-Transkripten zu suchen. Der zweite Teil meiner Arbeit konzentriert sich auf die Vorhersage funktioneller RRIs, die die Pluripotenz beeinflussen. Ein auf der Ausgabe von CYCLeR basierendes Koexpressionsnetzwerk kann die Verbindung von circRNA mit bekannten biologischen Signalwegen aufzeigen und die Entdeckung der Funktion von circRNA erheblich erleichtern. In-vivo-RNA-Proximity-Ligation-Experimente liefern Informationen über die Dynamik der RNA-RNA-Interaktion innerhalb der Zelle. Die Kombination von RNA-Seq- und RNA-Interaktom-Daten ermöglicht es mir, die Aussagekraft von Computervorhersagen erheblich zu verbessern. Ich baue ein Koexpressionsnetzwerk basierend auf einem longitudinalen Experiment mit H1ESC Zellen, welche mit Retinsäure behandelt wurden und kombiniere die Koexpressionsinformationen mit Ergebnissen aus der Analyse von RNA-RNA-Proximity-Ligation-Daten (SPLASH). Die Analyse wird durch Lokalisierungsinformationen basierend auf RNA-seq-Bibliotheken ergänzt, die für die Kernlokalisierung spezifisch sind. Die Ergebnisse weisen auf zwei circRNAs hin, die an funktionellen RRIs beteiligt sind. circFIRRE ist in SPLASH-Daten signifikant angereichert, was auf eine hohe Wahrscheinlichkeit einer Wechselwirkung mit anderen RNAs hinweist. Interessanterweise ist circFIRRE eine der wenigen circRNAs, die spezifisch im Zellkern angereichert sind, was sich mit der Bindungsstelle für das hnRNPU-Protein erklären lässt, das die circRNA im Zellkern hält. Der Knockout des circFIRRE-Locus im Menschen führt zu einer viralen Reaktion. Mehrere Interaktionsstellen von circFIRRE mit ALU-spezifischen Sequenzen weisen darauf hin, dass die virale Reaktion durch Unterbrechung der A-zu-I-Editierung in Zellen ausgelöst wird. circLARP7 ist eine weitere kernspezifische circRNA und wird mit allen wichtigen Markern für Pluripotenz koexprimiert. Es wird auch in großer Nähe zu MIR302CHG exprimiert – einem Mikro-RNA-Wirtsgen, das mit der Aufrechterhaltung des pluripotenten Zustands in Zusammenhang steht. Hohe Komplementarität und Konservierung eines Duplex zwischen dem circLARP7 und dem entstehenden MIR302CHG deuten darauf hin, dass circLARP7 mit der Prozessierung der microRNAs aus dem miR-302/367-Cluster zusammenhängen könnte

    The mapping task and its various applications in next-generation sequencing

    Get PDF
    The aim of this thesis is the development and benchmarking of computational methods for the analysis of high-throughput data from tiling arrays and next-generation sequencing. Tiling arrays have been a mainstay of genome-wide transcriptomics, e.g., in the identification of functional elements in the human genome. Due to limitations of existing methods for the data analysis of this data, a novel statistical approach is presented that identifies expressed segments as significant differences from the background distribution and thus avoids dataset-specific parameters. This method detects differentially expressed segments in biological data with significantly lower false discovery rates and equivalent sensitivities compared to commonly used methods. In addition, it is also clearly superior in the recovery of exon-intron structures. Moreover, the search for local accumulations of expressed segments in tiling array data has led to the identification of very large expressed regions that may constitute a new class of macroRNAs. This thesis proceeds with next-generation sequencing for which various protocols have been devised to study genomic, transcriptomic, and epigenomic features. One of the first crucial steps in most NGS data analyses is the mapping of sequencing reads to a reference genome. This work introduces algorithmic methods to solve the mapping tasks for three major NGS protocols: DNA-seq, RNA-seq, and MethylC-seq. All methods have been thoroughly benchmarked and integrated into the segemehl mapping suite. First, mapping of DNA-seq data is facilitated by the core mapping algorithm of segemehl. Since the initial publication, it has been continuously updated and expanded. Here, extensive and reproducible benchmarks are presented that compare segemehl to state-of-the-art read aligners on various data sets. The results indicate that it is not only more sensitive in finding the optimal alignment with respect to the unit edit distance but also very specific compared to most commonly used alternative read mappers. These advantages are observable for both real and simulated reads, are largely independent of the read length and sequencing technology, but come at the cost of higher running time and memory consumption. Second, the split-read extension of segemehl, presented by Hoffmann, enables the mapping of RNA-seq data, a computationally more difficult form of the mapping task due to the occurrence of splicing. Here, the novel tool lack is presented, which aims to recover missed RNA-seq read alignments using de novo splice junction information. It performs very well in benchmarks and may thus be a beneficial extension to RNA-seq analysis pipelines. Third, a novel method is introduced that facilitates the mapping of bisulfite-treated sequencing data. This protocol is considered the gold standard in genome-wide studies of DNA methylation, one of the major epigenetic modifications in animals and plants. The treatment of DNA with sodium bisulfite selectively converts unmethylated cytosines to uracils, while methylated ones remain unchanged. The bisulfite extension developed here performs seed searches on a collapsed alphabet followed by bisulfite-sensitive dynamic programming alignments. Thus, it is insensitive to bisulfite-related mismatches and does not rely on post-processing, in contrast to other methods. In comparison to state-of-the-art tools, this method achieves significantly higher sensitivities and performs time-competitive in mapping millions of sequencing reads to vertebrate genomes. Remarkably, the increase in sensitivity does not come at the cost of decreased specificity and thus may finally result in a better performance in calling the methylation rate. Lastly, the potential of mapping strategies for de novo genome assemblies is demonstrated with the introduction of a new guided assembly procedure. It incorporates mapping as major component and uses the additional information (e.g., annotation) as guide. With this method, the complete mitochondrial genome of Eulimnogammarus verrucosus has been successfully assembled even though the sequencing library has been heavily dominated by nuclear DNA. In summary, this thesis introduces algorithmic methods that significantly improve the analysis of tiling array, DNA-seq, RNA-seq, and MethylC-seq data, and proposes standards for benchmarking NGS read aligners. Moreover, it presents a new guided assembly procedure that has been successfully applied in the de novo assembly of a crustacean mitogenome.Diese Arbeit befasst sich mit der Entwicklung und dem Benchmarken von Verfahren zur Analyse von Daten aus Hochdurchsatz-Technologien, wie Tiling Arrays oder Hochdurchsatz-Sequenzierung. Tiling Arrays bildeten lange Zeit die Grundlage für die genomweite Untersuchung des Transkriptoms und kamen beispielsweise bei der Identifizierung funktioneller Elemente im menschlichen Genom zum Einsatz. In dieser Arbeit wird ein neues statistisches Verfahren zur Auswertung von Tiling Array-Daten vorgestellt. Darin werden Segmente als exprimiert klassifiziert, wenn sich deren Signale signifikant von der Hintergrundverteilung unterscheiden. Dadurch werden keine auf den Datensatz abgestimmten Parameterwerte benötigt. Die hier vorgestellte Methode erkennt differentiell exprimierte Segmente in biologischen Daten bei gleicher Sensitivität mit geringerer Falsch-Positiv-Rate im Vergleich zu den derzeit hauptsächlich eingesetzten Verfahren. Zudem ist die Methode bei der Erkennung von Exon-Intron Grenzen präziser. Die Suche nach Anhäufungen exprimierter Segmente hat darüber hinaus zur Entdeckung von sehr langen Regionen geführt, welche möglicherweise eine neue Klasse von macroRNAs darstellen. Nach dem Exkurs zu Tiling Arrays konzentriert sich diese Arbeit nun auf die Hochdurchsatz-Sequenzierung, für die bereits verschiedene Sequenzierungsprotokolle zur Untersuchungen des Genoms, Transkriptoms und Epigenoms etabliert sind. Einer der ersten und entscheidenden Schritte in der Analyse von Sequenzierungsdaten stellt in den meisten Fällen das Mappen dar, bei dem kurze Sequenzen (Reads) auf ein großes Referenzgenom aligniert werden. Die vorliegende Arbeit stellt algorithmische Methoden vor, welche das Mapping-Problem für drei wichtige Sequenzierungsprotokolle (DNA-Seq, RNA-Seq und MethylC-Seq) lösen. Alle Methoden wurden ausführlichen Benchmarks unterzogen und sind in der segemehl-Suite integriert. Als Erstes wird hier der Kern-Algorithmus von segemehl vorgestellt, welcher das Mappen von DNA-Sequenzierungsdaten ermöglicht. Seit der ersten Veröffentlichung wurde dieser kontinuierlich optimiert und erweitert. In dieser Arbeit werden umfangreiche und auf Reproduzierbarkeit bedachte Benchmarks präsentiert, in denen segemehl auf zahlreichen Datensätzen mit bekannten Mapping-Programmen verglichen wird. Die Ergebnisse zeigen, dass segemehl nicht nur sensitiver im Auffinden von optimalen Alignments bezüglich der Editierdistanz sondern auch sehr spezifisch im Vergleich zu anderen Methoden ist. Diese Vorteile sind in realen und simulierten Daten unabhängig von der Sequenzierungstechnologie oder der Länge der Reads erkennbar, gehen aber zu Lasten einer längeren Laufzeit und eines höheren Speicherverbrauchs. Als Zweites wird das Mappen von RNA-Sequenzierungsdaten untersucht, welches bereits von der Split-Read-Erweiterung von segemehl unterstützt wird. Aufgrund von Spleißen ist diese Form des Mapping-Problems rechnerisch aufwendiger. In dieser Arbeit wird das neue Programm lack vorgestellt, welches darauf abzielt, fehlende Read-Alignments mit Hilfe von de novo Spleiß-Information zu finden. Es erzielt hervorragende Ergebnisse und stellt somit eine sinnvolle Ergänzung zu Analyse-Pipelines für RNA-Sequenzierungsdaten dar. Als Drittes wird eine neue Methode zum Mappen von Bisulfit-behandelte Sequenzierungsdaten vorgestellt. Dieses Protokoll gilt als Goldstandard in der genomweiten Untersuchung der DNA-Methylierung, einer der wichtigsten epigenetischen Modifikationen in Tieren und Pflanzen. Dabei wird die DNA vor der Sequenzierung mit Natriumbisulfit behandelt, welches selektiv nicht methylierte Cytosine zu Uracilen konvertiert, während Methylcytosine davon unberührt bleiben. Die hier vorgestellte Bisulfit-Erweiterung führt die Seed-Suche auf einem reduziertem Alphabet durch und verifiziert die erhaltenen Treffer mit einem auf dynamischer Programmierung basierenden Bisulfit-sensitiven Alignment-Algorithmus. Das verwendete Verfahren ist somit unempfindlich gegenüber Bisulfit-Konvertierungen und erfordert im Gegensatz zu anderen Verfahren keine weitere Nachverarbeitung. Im Vergleich zu aktuell eingesetzten Programmen ist die Methode sensitiver und benötigt eine vergleichbare Laufzeit beim Mappen von Millionen von Reads auf große Genome. Bemerkenswerterweise wird die erhöhte Sensitivität bei gleichbleibend guter Spezifizität erreicht. Dadurch könnte diese Methode somit auch bessere Ergebnisse bei der präzisen Bestimmung der Methylierungsraten erreichen. Schließlich wird noch das Potential von Mapping-Strategien für Assemblierungen mit der Einführung eines neuen, Kristallisation-genanntes Verfahren zur unterstützten Assemblierung aufgezeigt. Es enthält Mapping als Hauptbestandteil und nutzt Zusatzinformation (z.B. Annotationen) als Unterstützung. Dieses Verfahren ermöglichte die erfolgreiche Assemblierung des kompletten mitochondrialen Genoms von Eulimnogammarus verrucosus trotz einer vorwiegend aus nukleärer DNA bestehenden genomischen Bibliothek. Zusammenfassend stellt diese Arbeit algorithmische Methoden vor, welche die Analysen von Tiling Array, DNA-Seq, RNA-Seq und MethylC-Seq Daten signifikant verbessern. Es werden zudem Standards für den Vergleich von Programmen zum Mappen von Daten der Hochdurchsatz-Sequenzierung vorgeschlagen. Darüber hinaus wird ein neues Verfahren zur unterstützten Genom-Assemblierung vorgestellt, welches erfolgreich bei der de novo-Assemblierung eines mitochondrialen Krustentier-Genoms eingesetzt wurde

    FusionSeq: a modular framework for finding gene fusions by analyzing paired-end RNA-sequencing data

    Get PDF
    We have developed FusionSeq to identify fusion transcripts from paired-end RNA-sequencing. FusionSeq includes filters to remove spurious candidate fusions with artifacts, such as misalignment or random pairing of transcript fragments, and it ranks candidates according to several statistics. It also has a module to identify exact sequences at breakpoint junctions. FusionSeq detected known and novel fusions in a specially sequenced calibration data set, including eight cancers with and without known rearrangements

    Computational identification of regulatory features affecting splicing in the human brain

    Get PDF
    RNA splicing has enabled a dramatic increase in species complexity. Splicing occurs in over 95% of mam- malian genes allowing the development of exceptional cellular diversity without an increase in raw gene numbers. This is highlighted by the fact that human and nematodes have the same number of genes (20,000 human genes versus 19,000 genes in Caenorhabditis elegans). Although the mechanistic process of splicing is now well understood there remains a multitude of unexplored dynamics that have only become visible with the power of next generation sequencing (NGS). The human brain is one of the best examples of an intricate cellular structure. Neuronal cell types are incredibly diverse and specialised, regulated through various transcriptional mechanisms. Recently, long genes (150kb+) have been implicated as crucial to neuronal function and their impairment has been attributed to several neurological disorders. I explore this relationship further by showing that long genes are more highly expressed in the brain than other tissues. Long genes are also distinct in that they are deficient in H3k36me3, a histone mark largely associated with splicing and active transcription. Through analysis of brain RNA-seq data, a novel splicing mechanism known as recursive splicing was identified in long introns. Recursive splice sites (RSS) consist of an intronic 3’splice site followed immediately by a 5’ splice site. These sites result in a zero-length exon that regulates the use of cryptic promoters ensuring only the functional isoform is expressed. This discovery lead me to question if other non-canonical forms of splicing are common in the brain. Backsplicing is a recently discovered splicing mechanism pervasive in the tree of life. This occurs when a 3’ end of a downstream exon is spliced onto the 5’ end of an upstream exon resulting in a circular RNA molecule (hereafter: circRNA). circRNA are enriched in neuronal genes and mediated by RNA binding factors. I have identified and quantified the presence of circRNA within the brain, identifying a large number of highly expressed novel circRNA. From these findings I identify a subset of highly expressed backsplice junctions that occur between two proximal genes from the same family. vii In order to understand the function of these splicing reactions I inspected the splicing features themselves, namely; the 5’ and 3’ splice sites and the branchpoint. The branchpoint remains a poorly char- acterised feature and until recently very few have been experimentally validated. I explore these features through the ExAC and UCLex consortia, using cumulative variant ratios to annotate invariant positions within the branchpoint and splice sites. By identifying invariant positions I could then investigate how vari- ation impacts splicing efficiency by integrating whole exome and RNA sequence data from the GEUVADIS consortium. Findings show that exon expression is a poor indicator of splicing dysfunction, showing a three fold lower sensitivity than direct analysis of splice junction reads. I also devise a variant effect score that captures a significant portion of change in splice site efficiency enabling improved prediction of deleterious variants. Together, this thesis hints at the massive potential of NGS to investigate the diversity of splicing related features while identifying novel features that could be implicated in neurological dysfunction

    Knowledge-Based Reconstruction of mRNA Transcripts with Short Sequencing Reads for Transcriptome Research

    Get PDF
    While most transcriptome analyses in high-throughput clinical studies focus on gene level expression, the existence of alternative isoforms of gene transcripts is a major source of the diversity in the biological functionalities of the human genome. It is, therefore, essential to annotate isoforms of gene transcripts for genome-wide transcriptome studies. Recently developed mRNA sequencing technology presents an unprecedented opportunity to discover new forms of transcripts, and at the same time brings bioinformatic challenges due to its short read length and incomplete coverage for the transcripts. In this work, we proposed a computational approach to reconstruct new mRNA transcripts from short sequencing reads with reference information of known transcripts in existing databases. The prior knowledge helped to define exon boundaries and fill in the transcript regions not covered by sequencing data. This approach was demonstrated using a deep sequencing data set of human muscle tissue with transcript annotations in RefSeq as prior knowledge. We identified 2,973 junctions, 7,471 exons, and 7,571 transcripts not previously annotated in RefSeq. 73% of these new transcripts found supports from UCSC Known Genes, Ensembl or EST transcript annotations. In addition, the reconstructed transcripts were much longer than those from de novo approaches that assume no prior knowledge. These previously un-annotated transcripts can be integrated with known transcript annotations to improve both the design of microarrays and the follow-up analyses of isoform expression. The overall results demonstrated that incorporating transcript annotations from genomic databases significantly helps the reconstruction of novel transcripts from short sequencing reads for transcriptome research

    Fast and Accurate mapping of Next Generation Sequencing Data

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Computational Methods for Sequencing and Analysis of Heterogeneous RNA Populations

    Get PDF
    Next-generation sequencing (NGS) and mass spectrometry technologies bring unprecedented throughput, scalability and speed, facilitating the studies of biological systems. These technologies allow to sequence and analyze heterogeneous RNA populations rather than single sequences. In particular, they provide the opportunity to implement massive viral surveillance and transcriptome quantification. However, in order to fully exploit the capabilities of NGS technology we need to develop computational methods able to analyze billions of reads for assembly and characterization of sampled RNA populations. In this work we present novel computational methods for cost- and time-effective analysis of sequencing data from viral and RNA samples. In particular, we describe: i) computational methods for transcriptome reconstruction and quantification; ii) method for mass spectrometry data analysis; iii) combinatorial pooling method; iv) computational methods for analysis of intra-host viral populations

    Co-Transcriptional Splicing in Murine Erythroblasts

    Get PDF
    Eukaryotic genes contain non-coding sequences called introns. The removal ofintrons from pre-mRNAs, termed splicing, is carried out by the spliceosome, a multi-megadalton molecular complex of proteins and RNAs. Splicing occurs cotranscriptionally across multiple cell types and species. The Neugebauer lab has developed single molecule nascent RNA sequencing methods—including single molecule intron tracking (SMIT) and long-read sequencing (LRS) of nascentRNA— to visualize the precursors, intermediates, and products of transcription and splicing in budding and fission yeasts. Using these methods, the lab was able to estimate the kinetics of single intron removal in both yeasts by relating the 30 end of nascent RNA (the position of RNA Polymerase II) to progress of the splicing reaction. In both species of yeast, splicing proceeded rapidly and co-transcriptionally. In comparison to yeast, mammalian genes are much more complex—on average they contain eight long introns surrounded by short exons. It was unclear how the presence of many more long introns, often with more poorly conserved splice site sequences, would affect how splicing and transcription are coordinated. Thus, I have optimized new methods to isolate nascentRNAand analyze co-transcriptional splicing in mammalian cells. To determine how splicing is integrated with transcription elongation and 30 end formation in mammalian cells, I performed long-read sequencing of individual nascent RNAs and PRO-seq during murine erythropoiesis. I chose murine erythroid leukemia (MEL) cells as a model system, as they can be easily differentiated in vitro, and they express a subset of erythroid-specific genes at high levels. Many studies of gene expression have historically been carried out in erythroblasts, and the biogenesis of -globin mRNA—the most highly expressed transcript in erythroblasts—was the focus of many seminal studies on the mechanisms of premRNAsplicing. I isolated nascent, chromatin-associated RNAs from MEL cells before and after induction of terminal erythroid differentiation and performed long-read sequencing on the Pacific Biosciences Sequel platform. Splicing was not accompanied by transcriptional pausing and was detected when RNA polymerase II (PolII) was within 75 – 300 nucleotides of 30 splice sites, often during transcription of the downstream exon. Interestingly, several hundred introns displayed abundant splicing intermediates, suggesting that splicing delays can take place between the two catalytic steps of splicing. Overall, splicing efficiencies were correlated among introns within the same transcript, and intron retention was associated with inefficient 30 end cleavage. Remarkably, a thalassemia patient-derived mutation introducing a cryptic 30 splice site improves both splicing and 30 end cleavage of individual-globin transcripts, demonstrating functional coupling between the two co-transcriptional processes as a determinant of productive gene output. Thus, I conclude that highly expressed pre-mRNAs in MEL cells are largely spliced co-transcriptionally, and that the mammalian spliceosome can assemble and act rapidly on this set of pre-mRNAs. A previously unappreciated level of cross-talk between splicing and 30 end cleavage efficiencies is involved in erythroid development. Together, this work provides a high-resolution description of mammalian gene expression and shows that short-read RNA sequencing of bulk RNA can conceal coordinated behaviours that can only be observed at the level of individual nascent transcripts
    corecore