617 research outputs found

    Computational annotation of eukaryotic gene structures: algorithms development and software systems

    Get PDF
    An important foundation for the advancement of both basic and applied biological science is correct annotation of protein-coding gene repertoires in model organisms. Accurate automated annotation of eukaryotic gene structures remains a challenging, open-ended and critical problem for modern computational biology.;The use of extrinsic (homology) information has been shown as a quite successful strategy for this task, though it is not a perfect solution, for a variety of reasons. More recently, gene prediction methods leveraging information present in syntenic genomic sequences have become favorable, though these too, have limitations.;Identifying genes by inspection of genomic sequence alone thoroughly tests our theoretical understanding of the gene recognition process as it occurs in vivo, and where we encounter failure, excellent opportunities for meaningful research are revealed.;Therefore, the continued development of methods not reliant on homology information---the so-called ab initio gene prediction methods---should help to more rapidly achieve a comprehensive understanding of gene content in our model organisms, at least.;This thesis explores the development of novel algorithms in an attempt to advance the current state-of-the-art in gene prediction, with particular emphasis on ab initio approaches.;The work has been conducted with an eye towards contributing open source, well-documented, and extensible software systems implementing the methods, and to generate novel biological knowledge with respect to plant taxa, in particular

    Identification and analysis of noncoding genetic elements in plant genomes

    Get PDF
    The goal of this dissertation is to identify and analyze two of the noncoding genetic elements, microRNAs (miRNAs) and intorns, in plant genomes. miRNAs are a class of short noncoding RNAs of which some are shown to regulate gene expression at the post-transcriptional level by complementary base pairing to their target mRNAs. In the early 2000s, a large number of miRNAs were cloned in animals, plants and viruses. Complementary efforts also sought to identify miRNA genes computationally. In the dissertation, I developed a computational method to identify miRNA genes and their target mRNAs in Arabidopsis. Experiments were then performed to validate some of the new miRNA genes and the miRNA-target interaction. The study facilitates the identification and characterization of conserved and non-conserved miRNAs in plants. Another noncoding element discussed in the dissertation is intron. The removal of introns from precursor mRNAs (pre-mRNAs), which is called pre-mRNA splicing, is essential to produce mature mRNAs and proteins. Alternative splicing (AS) occurs when different patterns of splicing result from the same pre-mRNA. AS is important in regulated gene expression and has various effects on mRNAs and proteins. In the dissertation, I am interested in studying the role of splice site sequences in intron evolution and AS. In one chapter, software is developed to identify conserved intron positions within orthologous genes. I demonstrated its application to a set of plant-specific orthologous genes. In another chapter, I have developed a computational approach to identify transcript-confirmed introns and genes in 15 plant species and analyzed intron evolution in the context of orthology. The results indicate dynamic evolution of introns with different splice sites and the significance of splice site sequences during intron evolution. In a third chapter, by using the transcript-confirmed data, I identified AS introns and events in the 15 plant species and studied their behavior and effect on protein sequences. The findings underscore the important role of splice site sequences in AS regulation. In conclusion, the identification of transcript-confirmed introns and the study of intron evolution and AS provide insight into the significant role of splice site sequences in intron evolution and AS

    SPA: A Probabilistic Algorithm for Spliced Alignment

    Get PDF
    Recent large-scale cDNA sequencing efforts show that elaborate patterns of splice variation are responsible for much of the proteome diversity in higher eukaryotes. To obtain an accurate account of the repertoire of splice variants, and to gain insight into the mechanisms of alternative splicing, it is essential that cDNAs are very accurately mapped to their respective genomes. Currently available algorithms for cDNA-to-genome alignment do not reach the necessary level of accuracy because they use ad hoc scoring models that cannot correctly trade off the likelihoods of various sequencing errors against the probabilities of different gene structures. Here we develop a Bayesian probabilistic approach to cDNA-to-genome alignment. Gene structures are assigned prior probabilities based on the lengths of their introns and exons, and based on the sequences at their splice boundaries. A likelihood model for sequencing errors takes into account the rates at which misincorporation, as well as insertions and deletions of different lengths, occurs during sequencing. The parameters of both the prior and likelihood model can be automatically estimated from a set of cDNAs, thus enabling our method to adapt itself to different organisms and experimental procedures. We implemented our method in a fast cDNA-to-genome alignment program, SPA, and applied it to the FANTOM3 dataset of over 100,000 full-length mouse cDNAs and a dataset of over 20,000 full-length human cDNAs. Comparison with the results of four other mapping programs shows that SPA produces alignments of significantly higher quality. In particular, the quality of the SPA alignments near splice boundaries and SPA's mapping of the 5′ and 3′ ends of the cDNAs are highly improved, allowing for more accurate identification of transcript starts and ends, and accurate identification of subtle splice variations. Finally, our splice boundary analysis on the human dataset suggests the existence of a novel non-canonical splice site that we also find in the mouse dataset. The SPA software package is available at http://www.biozentrum.unibas.ch/personal/nimwegen/cgi-bin/spa.cgi

    Spliced alignment and its application in Arabidopsis thaliana

    Get PDF
    This thesis describes the development and biological applications of GeneSeqer, which is a homology-based gene prediction program by means of spliced alignment. Additionally, a program named MyGV was written in JAVA as a browser to visualize the output of GeneSeqer. In order to test and demonstrate the performance, GeneSeqer was utilized to map 176,915 Arabidopsis EST sequences on the whole genome of Arabidopsis thaliana, which consists of five chromosomes, with about 117 million base pairs in total. All results were parsed and imported into a MySQL database. Information that was inferred from the Arabidopsis spliced alignment results may serve as valuable resource for a number of projects of special scientific interest, such as alternative splicing, non-canonical splice sites, mini-exons, etc. We also built AtGDB (Arabidopsis thaliana Genome DataBase, http://www.plantgdb.org/AtGDB/) to interactively browse EST spliced alignments and GenBank annotations for the Arabidopsis genome. Moreover, as one application of the Arabidopsis EST mapping data, U12-type introns were identified from the transcript-confirmed introns in the Arabidopsis genome, and the characteristics of these minor class introns were further explored

    RNA-Seq analysis of splicing in Plasmodium falciparum uncovers new splice junctions, alternative splicing and splicing of antisense transcripts.

    Get PDF
    Over 50% of genes in Plasmodium falciparum, the deadliest human malaria parasite, contain predicted introns, yet experimental characterization of splicing in this organism remains incomplete. We present here a transcriptome-wide characterization of intraerythrocytic splicing events, as captured by RNA-Seq data from four timepoints of a single highly synchronous culture. Gene model-independent analysis of these data in conjunction with publically available RNA-Seq data with HMMSplicer, an in-house developed splice site detection algorithm, revealed a total of 977 new 5' GU-AG 3' and 5 new 5' GC-AG 3' junctions absent from gene models and ESTs (11% increase to the current annotation). In addition, 310 alternative splicing events were detected in 254 (4.5%) genes, most of which truncate open reading frames. Splicing events antisense to gene models were also detected, revealing complex transcriptional arrangements within the parasite's transcriptome. Interestingly, antisense introns overlap sense introns more than would be expected by chance, perhaps indicating a functional relationship between overlapping transcripts or an inherent organizational property of the transcriptome. Independent experimental validation confirmed over 30 new antisense and alternative junctions. Thus, this largest assemblage of new and alternative splicing events to date in Plasmodium falciparum provides a more precise, dynamic view of the parasite's transcriptome

    Animal, fungi, and plant genome sequences harbour different non-canonical splice sites

    Get PDF
    Frey K, Pucker B. Animal, fungi, and plant genome sequences harbour different non-canonical splice sites. Cells. 2020;9(2): 458.Most protein encoding genes in eukaryotes contain introns which are interwoven with exons. After transcription, introns need to be removed in order to generate the final mRNA which can be translated into an amino acid sequence by the ribosome. Precise excision of introns by the spliceosome requires conserved dinucleotides which mark the splice sites. However, there are variations of the highly conserved combination of GT at the 5' end and AG at the 3' end of an intron in the genome. GC-AG and AT-AC are two major non-canonical splice site combinations which are known for many years. During the last few years, various minor non-canonical splice site combinations were detected with all possible dinucleotide permutations. Here we expand systematic investigations of non-canonical splice site combinations in plant genomes to all eukaryotes by analysing fungal and animal genome sequences. Comparisons of splice site combinations between these three kingdoms revealed several differences such as a substantially increased CT-AC frequency in fungal genomes. In addition, high numbers of GA-AG splice site combinations were observed in two animal species. In depth investigation of splice site usage based on RNA-Seq read mappings indicates a generally higher flexibility of the 3' splice site compared to the 5' splice site

    Comprehensive splice-site analysis using comparative genomics

    Get PDF
    We have collected over half a million splice sites from five species—Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans and Arabidopsis thaliana—and classified them into four subtypes: U2-type GT–AG and GC–AG and U12-type GT–AG and AT–AC. We have also found new examples of rare splice-site categories, such as U12-type introns without canonical borders, and U2-dependent AT–AC introns. The splice-site sequences and several tools to explore them are available on a public website (SpliceRack). For the U12-type introns, we find several features conserved across species, as well as a clustering of these introns on genes. Using the information content of the splice-site motifs, and the phylogenetic distance between them, we identify: (i) a higher degree of conservation in the exonic portion of the U2-type splice sites in more complex organisms; (ii) conservation of exonic nucleotides for U12-type splice sites; (iii) divergent evolution of C.elegans 3′ splice sites (3′ss) and (iv) distinct evolutionary histories of 5′ and 3′ss. Our study proves that the identification of broad patterns in naturally-occurring splice sites, through the analysis of genomic datasets, provides mechanistic and evolutionary insights into pre-mRNA splicing

    Genome-wide analysis of alternative splicing in Chlamydomonas reinhardtii

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Genome-wide computational analysis of alternative splicing (AS) in several flowering plants has revealed that pre-mRNAs from about 30% of genes undergo AS. <it>Chlamydomonas</it>, a simple unicellular green alga, is part of the lineage that includes land plants. However, it diverged from land plants about one billion years ago. Hence, it serves as a good model system to study alternative splicing in early photosynthetic eukaryotes, to obtain insights into the evolution of this process in plants, and to compare splicing in simple unicellular photosynthetic and non-photosynthetic eukaryotes. We performed a global analysis of alternative splicing in <it>Chlamydomonas reinhardtii </it>using its recently completed genome sequence and all available ESTs and cDNAs.</p> <p>Results</p> <p>Our analysis of AS using BLAT and a modified version of the Sircah tool revealed AS of 498 transcriptional units with 611 events, representing about 3% of the total number of genes. As in land plants, intron retention is the most prevalent form of AS. Retained introns and skipped exons tend to be shorter than their counterparts in constitutively spliced genes. The splice site signals in all types of AS events are weaker than those in constitutively spliced genes. Furthermore, in alternatively spliced genes, the prevalent splice form has a stronger splice site signal than the non-prevalent form. Analysis of constitutively spliced introns revealed an over-abundance of motifs with simple repetitive elements in comparison to introns involved in intron retention. In almost all cases, AS results in a truncated ORF, leading to a coding sequence that is around 50% shorter than the prevalent splice form. Using RT-PCR we verified AS of two genes and show that they produce more isoforms than indicated by EST data. All cDNA/EST alignments and splice graphs are provided in a website at <url>http://combi.cs.colostate.edu/as/chlamy</url>.</p> <p>Conclusions</p> <p>The extent of AS in <it>Chlamydomonas </it>that we observed is much smaller than observed in land plants, but is much higher than in simple unicellular heterotrophic eukaryotes. The percentage of different alternative splicing events is similar to flowering plants. Prevalence of constitutive and alternative splicing in <it>Chlamydomonas</it>, together with its simplicity, many available public resources, and well developed genetic and molecular tools for this organism make it an excellent model system to elucidate the mechanisms involved in regulated splicing in photosynthetic eukaryotes.</p
    corecore