963 research outputs found

    Using Expressing Sequence Tags to Improve Gene Structure Annotation

    Get PDF
    Finding all gene structures is a crucial step in obtaining valuable information from genomic sequences. It is still a challenging problem, especially for vertebrate genomes, such as the human genome. Expressed Sequence Tags (ESTs) provide a tremendous resource for determining intron-exon structures. However, they are short and error prone, which prevents existing methods from exploiting EST information efficiently. This dissertation addresses three aspects of using ESTs for gene structure annotation. The first aspect is using ESTs to improve de novo gene prediction. Probability models are introduced for EST alignments to genomic sequence in exons, introns, interknit regions, splice sites and UTRs, representing the EST alignment patterns in these regions. New gene prediction systems were developed by combining the EST alignments with comparative genomics gene prediction systems, such as TWINSCAN and N-SCAN, so that they can predict gene structures more accurately where EST alignments exist without compromising their ability to predict gene structures where no EST exists. The accuracy of TWINSCAN_EST and NSCAN_EST is shown to be substantially better than any existing methods without using full-length cDNA or protein similarity information. The second aspect is using ESTs and de novo gene prediction to guide biology experiments, such as finding full ORF-containing-cDNA clones, which provide the most direct experimental evidence for gene structures. A probability model was introduced to guide experiments by summing over gene structure models consistent with EST alignments. The last aspect is a novel EST-to-genome alignment program called QPAIRAGON to improve the alignment accuracy by using EST sequencing quality values. Gene prediction accuracy can be improved by using this new EST-to-genome alignment program. It can also be used for many other bioinformatics applications, such as SNP finding and alternative splicing site prediction

    Integrating alternative splicing detection into gene prediction

    Get PDF

    End-to-end learning framework for circular RNA classification from other long non-coding RNAs using multi-modal deep learning.

    Get PDF
    Over the past two decades, a circular form of RNA (circular RNA) produced from splicing mechanism has become the focus of scientific studies due to its major role as a microRNA (miR) ac tivity modulator and its association with various diseases including cancer. Therefore, the detection of circular RNAs is a vital operation for continued comprehension of their biogenesis and purpose. Prediction of circular RNA can be achieved by first distinguishing non-coding RNAs from protein coding gene transcripts, separating short and long non-coding RNAs (lncRNAs), and finally pre dicting circular RNAs from other lncRNAs. However, available tools to distinguish circular RNAs from other lncRNAs have only reached 80% accuracy due to the difficulty of classifying circular RNAs from other lncRNAs. Therefore, the availability of a faster, more accurate machine learning method for the identification of circular RNAs, which will take into account the specific features of circular RNA, is essential in the development of systematic annotation. Here we present an End to-End multimodal deep learning framework, our tool, to classify circular RNA from other lncRNA. It fuses a RCM descriptor, an ACNN-BLSTM sequence descriptor, and a conservation descriptor into high level abstraction descriptors, where the shared representations across different modalities are integrated. The experiments show that our tool is not only faster compared to existing tools but also eclipses other tools by an over 12% increase in accuracy. Another interesting result found from analysis of a ACNN-BLSTM sequence descriptor is that circular RNA sequences share the characteristics of the coding sequence

    CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction

    Get PDF
    CONTRAST is a gene predictor that directly incorporates information from multiple alignments and uses discriminative machine learning techniques to give large improvements in prediction over previous methods

    Mining the genome for ‘known unknowns’

    Get PDF
    Bioinformatics is a multidisciplinary area that combines two major areas: Biology and Computer Science. It’s one of the fastest rising areas of investigation nowadays. It’s also a fundamental area for the processing of data and information from discoveries in the genetics area. One area that is prominent in the bioinformatics area is gene prediction, where various tools are available to aid researchers. Even though there are several gene prediction tools available, the most used are from several years back. They are reliable tools, but need optimization and some are not so flexible for modification. Tools created in the past years base their model on previous tools. In this dissertation work, a new model is proposed. Through ORF extraction from proteincoding sequences of a fasta-formatted file that the user inputs, these are compared to a target sequence of the user’s choice. A profile-HMM is used as the model to compare the sequences, returning a Logp value for each ORF compared with the target sequence. Match, insert and delete state probabilities were modified, to find the best scenario. The Viterbi algorithm was used to train the model, due to its speed. The results obtained were concordant with what we expected: That an ORF, which would be in the target sequence, presented a better Logp value than an ORF from a randomly selected sequence.A bioinformática é uma área multidisciplinar que combina duas áreas fundamentais: biologia e ciências da computação. É uma das áreas de investigação que mais está a crescer nos dias de hoje. É também uma área fundamental para o processamento de dados e informação na área da genética. Um ramo prominente na área da bioinformática é a predição de genes. Várias ferramentas encontram-se disponíveis para auxiliar investigadores. Estas ferramentas também se encontram disponíveis ao público em geral. Embora existam várias ferramentas, as mais utilizadas já têm muitos anos. São ferramentas fiáveis porém algumas precisam de ser otimizadas e não são muito flexíveis no que toca à sua modificação. Neste trabalho de dissertação é proposto um novo modelo. Por meio da extração de ORFs a partir de sequências de DNA que codificam para proteínas, inserido pelo usuário em formato fasta, estes são comparados com uma sequência alvo escolhida pelo utilizador. Foi utilizado um Profile-HMM como modelo para comparar as sequências, em que um valor de probabilidade logarítmica (Logp) é devolvido consoante a semelhança entre as sequências comparadas: o ORF e a sequência alvo. Quanto mais semelhantes forem as sequências comparadas, melhor será o valor da probabilidade logarítmica. Foram criados vários cenários de modo a ver qual seria a melhor forma de implementar o Profile-HMM. Nestes, os estados de correspondência, inserção e deleção foram modificados, até chegar ao melhor cenário. O algoritmo de Viterbi foi utilizado para treinar o modelo, devido à sua velocidade. Os resultados obtidos pelo modelo foram concordantes com o que esperávamos: um ORF que está presente na sequência alvo terá um valor Logp melhor que um ORF que não está presente na sequência alvo

    Unsupervised and semi-supervised training methods for eukaryotic gene prediction

    Get PDF
    This thesis describes new gene finding methods for eukaryotic gene prediction. The current methods for deriving model parameters for gene prediction algorithms are based on curated or experimentally validated set of genes or gene elements. These training sets often require time and additional expert efforts especially for the species that are in the initial stages of genome sequencing. Unsupervised training allows determination of model parameters from anonymous genomic sequence with. The importance and the practical applicability of the unsupervised training is critical for ever growing rate of eukaryotic genome sequencing. Three distinct training procedures are developed for diverse group of eukaryotic species. GeneMark-ES is developed for species with strong donor and acceptor site signals such as Arabidopsis thaliana, Caenorhabditis elegans and Drosophila melanogaster. The second version of the algorithm, GeneMark-ES-2, introduces enhanced intron model to better describe the gene structure of fungal species with posses with relatively weak donor and acceptor splice sites and well conserved branch point signal. GeneMark-LE, semi-supervised training approach is designed for eukaryotic species with small number of introns. The results indicate that the developed unsupervised training methods perform well as compared to other training methods and as estimated from the set of genes supported by EST-to-genome alignments. Analysis of novel genomes reveals interesting biological findings and show that several candidates of under-annotated and over-annotated fungal species are present in the current set of annotated of fungal genomes.Ph.D.Committee Chair: Mark Borodovky; Committee Member: Jung H. Choi; Committee Member: King Jordan; Committee Member: Leonid Bunimovich; Committee Member: Yury Chernof

    Comparative analysis of plant genomes through data integration

    Get PDF
    When we started our research in 2008, several online resources for genomics existed, each with a different focus. TAIR (The Arabidopsis Information Resource) has a focus on the plant model species Arabidopsis thaliana, with (at that time) little or no support for evolutionary or comparative genomics. Ensemble provided some basic tools and functions as a data warehouse, but it would only start incorporating plant genomes in 2010. There was no online resource at that time however, that provided the necessary data content and tools for plant comparative and evolutionary genomics that we required. As such, the plant community was missing an essential component to get their research at the same level as the biomedicine oriented research communities. We started to work on PLAZA in order to provide such a data resource that could be accessed by the plant community, and which also contained the necessary data content to help our research group’s focus on evolutionary genomics. The platform for comparative and evolutionary genomics, which we named PLAZA, was developed from scratch (i.e. not based on an existing database scheme, such as Ensemble). Gathering the data for all species, parsing this data into a common format and then uploading it into the database was the next step. We developed a processing pipeline, based on sequence similarity measurements, to group genes into gene families and sub families. Functional annotation was gathered through both the original data providers and through InterPro scans, combined with Interpro2GO. This primary data information was then ready to be used in every subsequent analysis. Building such a database was good enough for research within our bioinformatics group, but the target goal was to provide a comprehensive resource for all plant biologists with an interest in comparative and evolutionary genomics. Designing and creating a user-friendly, visually appealing web interface, connected to our database, was the next step. While the most detailed information is commonly presented in data tables, aesthetically pleasing graphics, images and charts are often used to visualize trends, general statistics and also used in specific tools. Design and development of these tools and visualizations is thus one of the core elements within my PhD. The PLAZA platform was designed as a gene-centric data resource, which is easily navigated when a biologist wants to study a relative small number of genes. However, using the default PLAZA website to retrieve information for dozens of genes quickly becomes very tedious. Therefore a ’gene set’-centric extra layer was developed where user-defined gene sets could be quickly analyzed. This extra layer, called the PLAZA workbench, functions on top of the normal PLAZA website, implicating that only gene sets from species present within the PLAZA database can be directly analyzed. The PLAZA resource for comparative and evolutionary genomics was a major success, but it still had several issues. We tried to solve at least two of these problems at the same time by creating a new platform. The first issue was the building procedure of PLAZA: adding a single species, or updating the structural annotation of an existing one, requires the total re-computation of the database content. The second issue was the restrictiveness of the PLAZA workbench: through a mapping procedure gene sets could be entered for species not present in the PLAZA database, but for species without a phylogenetic close relative this approach did not always yield satisfying results. Furthermore, the research in question might just focus on the difference between a species present in PLAZA and a close relative not present in PLAZA (e.g. to study adaptation to a different ecological niche). In such a case, the mapping procedure is in itself useless. With the advent of NGS transcriptome data sets for a growing number of species, it was clear that a next challenge had presented itself. We designed and developed a new platform, named TRAPID, which could automatically process entire transcriptome data sets, using a reference database. The target goal was to have the processing done quickly with the results containing both gene family oriented data (such as multiple sequence alignments and phylogenetic trees) and functional characterization of the transcripts. Major efforts went into designing the processing pipeline so it could be reliable, fast and accurate

    Transcript assembly and abundance estimation with high-throughput RNA sequencing

    Get PDF
    We present algorithms and statistical methods for the reconstruction and abundance estimation of transcript sequences from high throughput RNA sequencing ("RNA-Seq"). We evaluate these approaches through large-scale experiments of a well studied model of muscle development. We begin with an overview of sequencing assays and outline why the short read alignment problem is fundamental to the analysis of these assays. We then describe two approaches to the contiguous alignment problem, one of which uses massively parallel graphics hardware to accelerate alignment, and one of which exploits an indexing scheme based on the Burrows-Wheeler transform. We then turn to the spliced alignment problem, which is fundamental to RNA-Seq, and present an algorithm, TopHat. TopHat is the first algorithm that can align the reads from an entire RNA-Seq experiment to a large genome without the aid of reference gene models. In the second part of the thesis, we present the first comparative RNA-Seq as- sembly algorithm, Cufflinks, which is adapted from a constructive proof of Dilworth's Theorem, a classic result in combinatorics. We evaluate Cufflinks by assembling the transcriptome from a time course RNA-Seq experiment of developing skeletal muscle cells. The assembly contains 13,689 known transcripts and 3,724 novel ones. Of the novel transcripts, 62% were strongly supported by earlier sequencing experiments or by homologous transcripts in other organisms. We further validated interesting genes with isoform-specific RT-PCR. We then present a statistical model for RNA-Seq included in Cufflinks and with which we estimate abundances of transcripts from RNA-seq data. Simulation studies demonstrate that the model is highly accurate. We apply this model to the muscle data, and track the abundances of individual isoforms over development. Finally, we present significance tests for changes in relative and absolute abundances between time points, which we employ to uncover differential expression and differential regulation. By testing for relative abundance changes within and between transcripts sharing a transcription start site, we find significant shifts in the rates of alternative splicing and promoter preference in hundreds of genes, including those believed to regulate muscle development

    Polysomal mRNA Association and Gene Expression in <i>Trypanosoma brucei</i>

    Get PDF
    Background: The contrasting physiological environments of Trypanosoma brucei procyclic (insect vector) and bloodstream (mammalian host) forms necessitates deployment of different molecular processes and, therefore, changes in protein expression. Transcriptional regulation is unusual in T. brucei because the arrangement of genes is polycistronic; however, genes which are transcribed together are subsequently cleaved into separate mRNAs by trans-splicing. Following pre-mRNA processing, the regulation of mature mRNA stability is a tightly controlled cellular process. While many stage-specific transcripts have been identified, previous studies using RNA-seq suggest that changes in overall transcript level do not necessarily reflect the abundance of the corresponding protein. Methods: To better understand the regulation of gene expression in T. brucei, we performed a bioinformatic analysis of RNA-seq on total, sub-polysomal, and polysomal mRNA samples. We further cross-referenced our dataset with a previously published proteomics dataset to identify new protein coding sequences. Results: Our analyses showed that several long non-coding RNAs are more abundant in the sub-polysome samples, which possibly implicates them in regulating cellular differentiation in T. brucei. We also improved the annotation of the T.brucei genome by identifying new putative protein coding transcripts that were confirmed by mass spectrometry data. Conclusions: Several long non-coding RNAs are more abundant in the sub-polysome cellular fractions and might pay a role in the regulation of gene expression. We hope that these data will be of wide general interest, as well as being of specific value to researchers studying gene regulation expression and life stage transitions in T. brucei

    SUFFIX TREE, MINWISE HASHING AND STREAMING ALGORITHMS FOR BIG DATA ANALYSIS IN BIOINFORMATICS

    Get PDF
    In this dissertation, we worked on several algorithmic problems in bioinformatics using mainly three approaches: (a) a streaming model, (b) sux-tree based indexing, and (c) minwise-hashing (minhash) and locality-sensitive hashing (LSH). The streaming models are useful for large data problems where a good approximation needs to be achieved with limited space usage. We developed an approximation algorithm (Kmer-Estimate) using the streaming approach to obtain a better estimation of the frequency of k-mer counts. A k-mer, a subsequence of length k, plays an important role in many bioinformatics analyses such as genome distance estimation. We also developed new methods that use sux tree, a trie data structure, for alignment-free, non-pairwise algorithms for a conserved non-coding sequence (CNS) identification problem. We provided two different algorithms: STAG-CNS to identify exact-matched CNSs and DiCE to identify CNSs with mismatches. Using our algorithms, CNSs among various grass species were identified. A different approach was employed for identification of longer CNSs ( 100 bp, mostly found in animals). In our new method (MinCNE), the minhash approach was used to estimate the Jaccard similarity. Using also LSH, k-mers extracted from genomic sequences were clustered and CNSs were identified. Another new algorithm (MinIsoClust) that also uses minhash and LSH techniques was developed for an isoform clustering problem. Isoforms are generated from the same gene but by alternative splicing. As the isoform sequences share some exons but in different combinations, regular sequencing clustering methods do not work well. Our algorithm generates clusters for isoform sequences based on their shared minhash signatures. Finally, we discuss de novo transcriptome assembly algorithms and how to improve the assembly accuracy using ensemble approaches. First, we did a comprehensive performance analysis on different transcriptome assemblers using simulated benchmark datasets. Then, we developed a new ensemble approach (Minsemble) for the de novo transcriptome assembly problem that integrates isoform-clustering using minhash technique to identify potentially correct transcripts from various de novo transcriptome assemblers. Minsemble identified more correctly assembled transcripts as well as genes compared to other de novo and ensemble methods. Adviser: Jitender S. Deogu
    corecore