4,066 research outputs found

    Identification and Functional Annotation of Alternatively Spliced Isoforms

    Full text link
    Alternative splicing is a key mechanism for increasing the complexity of transcriptome and proteome in eukaryotic cells. A large portion of multi-exon genes in humans undergo alternative splicing, and this can have significant functional consequences as the proteins translated from alternatively spliced mRNA might have different amino acid sequences and structures. The study of alternative splicing events has been accelerated by the next-generation sequencing technology. However, reconstruction of transcripts from short-read RNA sequencing is not sufficiently accurate. Recent progress in single-molecule long-read sequencing has provided researchers alternative ways to help solve this problem. With the help of both short and long RNA sequencing technologies, tens of thousands of splice isoforms have been catalogued in humans and other species, but relatively few of the protein products of splice isoforms have been characterized functionally, structurally and biochemically. The scope of this dissertation includes using short and long RNA sequencing reads together for the purpose of transcript reconstruction, and using high-throughput RNA-sequencing data and gene ontology functional annotations on gene level to predict functions for alternatively spliced isoforms in mouse and human. In the first chapter, I give an introduction of alternative splicing and discuss the existing studies where next generation sequencing is used for transcript identification. Then, I define the isoform function prediction problem, and explain how it differs from better known gene function prediction problem. In the second chapter of this dissertation, I describe our study where the overall transcriptome of kidney is studied using both long reads from PacBio platform and RNA-seq short reads from Illumina platform. We used short reads to validate full-length transcripts found by long PacBio reads, and generated two high quality sets of transcript isoforms that are expressed in glomerular and tubulointerstitial compartments. In the third chapter, I describe our generic framework, where we implemented and evaluated several related algorithms for isoform function prediction for mouse isoforms. We tested these algorithms through both computational evaluation and experimental validation of the predicted ‘responsible’ isoform(s) and the predicted disparate functions of the isoforms of Cdkn2a and of Anxa6. Our algorithm is the first effort to predict and differentiate isoform functions through large-scale genomic data integration. In the fourth chapter, I present the extension of isoform function prediction study to the protein coding isoforms in human. We used a similar multiple instance learning (MIL)-based approach for predicting the function of protein coding splice variants in human. We evaluated our predictions using literature evidence of ADAM15, LMNA/C, and DMXL2 genes. And in the fifth and final chapter, I give a summary of previous chapters and outline the future directions for alternatively spliced isoform reconstruction and function prediction studies.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/144017/1/ridvan_1.pd

    An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs.

    Get PDF
    Reconstructing full-length transcript isoforms from sequence fragments (such as ESTs) is a major interest and challenge for bioinformatic analysis of pre-mRNA alternative splicing. This problem has been formulated as finding traversals across the splice graph, which is a directed acyclic graph (DAG) representation of gene structure and alternative splicing. In this manuscript we introduce a probabilistic formulation of the isoform reconstruction problem, and provide an expectation-maximization (EM) algorithm for its maximum likelihood solution. Using a series of simulated data and expressed sequences from real human genes, we demonstrate that our EM algorithm can correctly handle various situations of fragmentation and coupling in the input data. Our work establishes a general probabilistic framework for splice graph-based reconstructions of full-length isoforms

    Genome-wide identification of splicing quantitative trait loci (sQTLs) in diverse ecotypes of Arabidopsis thaliana

    Get PDF
    Alternative splicing (AS) of pre-mRNAs contributes to transcriptome diversity and enables plants to generate different protein isoforms from a single gene and/or fine-tune gene expression during different development stages and environmental changes. Although AS is pervasive, the genetic basis for differential isoform usage in plants is still emerging. In this study, we performed genome-wide analysis in 666 geographically distributed diverse ecotypes of Arabidopsis thaliana to identify genomic regions [splicing quantitative trait loci (sQTLs)] that may regulate differential AS. These ecotypes belong to different microclimatic conditions and are part of the relict and non-relict populations. Although sQTLs were spread across the genome, we observed enrichment for trans-sQTL (trans-sQTLs hotspots) on chromosome one. Furthermore, we identified several sQTL (911) that co-localized with trait-linked single nucleotide polymorphisms (SNP) identified in the Arabidopsis genome-wide association studies (AraGWAS). Many sQTLs were enriched among circadian clock, flowering, and stress-responsive genes, suggesting a role for differential isoform usage in regulating these important processes in diverse ecotypes of Arabidopsis. In conclusion, the current study provides a deep insight into SNPs affecting isoform ratios/genes and facilitates a better mechanistic understanding of trait-associated SNPs in GWAS studies. To the best of our knowledge, this is the first report of sQTL analysis in a large set of Arabidopsis ecotypes and can be used as a reference to perform sQTL analysis in the Brassicaceae family. Since whole genome and transcriptome datasets are available for these diverse ecotypes, it could serve as a powerful resource for the biological interpretation of trait-associated loci, splice isoform ratios, and their phenotypic consequences to help produce more resilient and high yield crop varieties

    Identification of novel genes and proteoforms in Angiostrongylus costaricensis through a proteogenomic approach

    Get PDF
    RNA sequencing (RNA-Seq) and mass-spectrometry-based proteomics data are often integrated in proteogenomic studies to assist in the prediction of eukaryote genome features, such as genes, splicing, single-nucleotide (SNVs), and single-amino-acid variants (SAAVs). Most genomes of parasite nematodes are draft versions that lack transcript- and protein-level information and whose gene annotations rely only on computational predictions

    Gene space completeness in complex plant genomes

    Get PDF
    Genome annotations offer ample opportunities to study gene functions, biochemical and regulatory pathways, or quantitative trait loci in plants. Determining the quality and completeness of a genome annotation, and maintaining the balance between them, are major challenges, even for genomes of well-studied model organisms. In this review, we present a historical overview of the complexity in different plant genomes and discuss the hurdles and possible solutions in obtaining a complete and high-quality genome annotation. We illustrate there is no clear-cut answer to solve these challenges for different gene types, but provide tips on guiding the iterative process of generating a superior genome annotation, which is a moving target as our knowledge about plant genomics increases and additional data sources become available

    Revealing missing human protein isoforms based on Ab initio prediction, RNA-seq and proteomics

    Get PDF
    Biological and biomedical research relies on comprehensive understanding of protein-coding transcripts. However, the total number of human proteins is still unknown due to the prevalence of alternative splicing. In this paper, we detected 31,566 novel transcripts with coding potential by filtering our ab initio predictions with 50 RNA-seq datasets from diverse tissues/cell lines. PCR followed by MiSeq sequencing showed that at least 84.1% of these predicted novel splice sites could be validated. In contrast to known transcripts, the expression of these novel transcripts were highly tissue-specific. Based on these novel transcripts, at least 36 novel proteins were detected from shotgun proteomics data of 41 breast samples. We also showed L1 retrotransposons have a more significant impact on the origin of new transcripts/genes than previously thought. Furthermore, we found that alternative splicing is extraordinarily widespread for genes involved in specific biological functions like protein binding, nucleoside binding, neuron projection, membrane organization and cell adhesion. In the end, the total number of human transcripts with protein-coding potential was estimated to be at least 204,950.publishedVersio

    Studies of gene expression in the Parkinson’s disease brain

    Get PDF
    Parkinson’s disease (PD) is the second most prevalent neurodegenerative disorder, affecting ~1.8% of the population above 65 years. A combination of genetic and environmental factors contributes to the risk of PD, but the molecular mechanisms underlying its aetiology remain largely unaccounted for. Profiling gene expression in the PD brain can identify molecular processes associated with the pathogenesis and nominate candidate therapeutic targets for further study. Most previous gene expression studies in PD focused on specific hypotheses and were restricted to selected genes of interest and only few were performed transcriptome-wide. While in part informative, the results of these studies must be interpreted with caution due to a combination of technical and biological limitations. Factors applying specifically to the study of human bulk brain tissue make it difficult to confidently and accurately determine altered pathways. 1) Bulk brain tissue is composed of multiple cell types, some of which are selectively affected in PD. Variation in cell-type composition across samples introduces noise, while disease-associated changes in the number of neurons and glia introduce systematic gene expression biases between conditions. 2) The complex architecture of neurons complicates sample dissection and can result in variable soma-to-synapses ratios across samples. This variability results in additional noise in expression data since RNA and proteins can undergo axonal transport, with some preferentially localizing to the soma or synapses. Another limitation of previous studies is that gene-level analyses provide only an incomplete perspective on the expression landscape. Regulation at the transcript- and protein-level is often overlooked. The work of this thesis comprises three alternative approaches of gene expression analyses in the PD brain, aiming to overcome these limitations. We employed RNA-Seq and mass spectrometry in the prefrontal cortex of PD patients and healthy controls and approached these challenges by profiling expression at transcript-, gene- and protein-level. Considering the described aspects of bulk brain tissue, we adjusted for changes in cellular composition, RNA quality and guided functional interpretation with the polarized nature of neurons in mind. Our results indicate that the frequently reported downregulation of mitochondrial function is partly driven by cellular composition. Adjusting for cell-type bias instead revealed altered pathways related to protein degradation, further strengthening their involvement in disease pathology. Both differential gene and transcript isoform expression showed enrichment for these. Additionally, we nominated genes that exhibit differential transcript usage events, suggesting alternate regulation at the transcript-level. These candidates can be targeted in future studies to identify functional consequences. Finally, we observed discordance between transcriptome and proteome which we concluded reflects alterations in PD proteostasis. Specifically, we identified certain proteasomal subunits central to these regulatory changes, providing us with further evidence for the key role of protein degradation in PD brain.Doktorgradsavhandlin

    Proteogenomics refines the molecular classification of chronic lymphocytic leukemia

    Full text link
    Cancer heterogeneity at the proteome level may explain differences in therapy response and prognosis beyond the currently established genomic and transcriptomic-based diagnostics. The relevance of proteomics for disease classifications remains to be established in clinically heterogeneous cancer entities such as chronic lymphocytic leukemia (CLL). Here, we characterize the proteome and transcriptome alongside genetic and ex-vivo drug response profiling in a clinically annotated CLL discovery cohort (n = 68). Unsupervised clustering of the proteome data reveals six subgroups. Five of these proteomic groups are associated with genetic features, while one group is only detectable at the proteome level. This new group is characterized by accelerated disease progression, high spliceosomal protein abundances associated with aberrant splicing, and low B cell receptor signaling protein abundances (ASB-CLL). Classifiers developed to identify ASB-CLL based on its characteristic proteome or splicing signature in two independent cohorts (n = 165, n = 169) confirm that ASB-CLL comprises about 20% of CLL patients. The inferior overall survival in ASB-CLL is also independent of both TP53- and IGHV mutation status. Our multi-omics analysis refines the classification of CLL and highlights the potential of proteomics to improve cancer patient stratification beyond genetic and transcriptomic profiling
    • …
    corecore