Ribonucleic acid (RNA) sequences are polymeric molecules ubiquitous in every living cell. RNA molecules mediate the flow of information from the DNA sequence to most functional elements in the cell. Therefore, it is of great interest in biological and biomedical research to associate RNA molecules to a biological function and to understand mechanisms of their regulation. The goal of this study is the characterization of the RNA sequence composi- tion of biological samples (transcriptome) to facilitate the understanding of RNA function and regulation. Traditionally, a similar task has been addressed by algorithms called gene finding systems, predicting RNA sequences (transcripts) from features of the genomic DNA sequence. Lacking sufficient experimental evidence for most of the genes, these systems learn sequence patterns on a few genes with direct evidence to identify many additional genes in the genome.
High-throughput sequencing of RNA (RNA-Seq) has recently become a powerful tech- nology in studying the transcriptome. This technology identifies millions of short RNA fragments (reads of ≈100 letters length), holding direct evidence for a large fraction of the genes. However, the analysis of RNA-Seq data faces profound challenges.
Firstly, the distribution of RNA-Seq reads is highly uneven among genes, resulting in a considerable fraction of genes with very few reads and the stochastic nature of the technology leads to gaps even for well covered genes. To accurately predict transcripts in cases with incomplete evidence, we need to combine RNA-Seq evidence with features derived from the genomic DNA sequence. We therefore developed a method to learn the integration of both information sources and implemented this strategy as an extension of the gene finder mGene. The system, now called mGene.ngs, determines close approximations of potentially non-linear transformations for all features on the training set, such that the prediction performance is maximized. With this ability, which is to our knowledge unique among gene finding systems, mGene.ngs can not only learn complex relationships between the two mentioned information sources, but gains the flexibility to take many additional information sources into account. mGene.ngs has been independently evaluated within the context of an international competition (RGASP) for RNA-Seq-based reannotation and has shown very favourable performance for two out of three model organisms. Moreover, we generated and analyzed RNA-Seq-based annotations for 20 Arabidopsis thaliana strains, to facilitate a deeper understanding of phenotypic variation in this natural plant population.
A second major challenge in transcriptome reconstruction lies in the complexity of the transcriptome itself. A process called alternative splicing generates multiple mature RNA sequences from a single primary RNA sequence by cutting out so-called introns, typically in a tightly regulated manner. Inference algorithms of almost all gene finding systems are limited to predict transcripts not overlapping in their genomic region of origin. To overcome this limitation, purely RNA-Seq-based approaches have been developed. However, biologically implausible assumptions or the neglect of available information often led to unsatisfactory results. A major contribution of this study is the integer optimization-based transcriptome reconstruction approach MiTie. MiTie utilizes a biologically motivated loss function, can take advantage of a priori known genome annotations and gains predictive power by considering multiple RNA-Seq samples simultaneously. Based on simulated data for the human genome as well as on an extensive RNA-Seq data set for the model organism Drosophila melanogaster we show that MiTie predicts transcripts significantly more accurate than state-of-the-art methods like Cufflinks and Trinity