2,790 research outputs found

    Improving the Caenorhabditis elegans Genome Annotation Using Machine Learning

    Get PDF
    For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ state-of-the-art machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA, utilizing recent advances in support vector machines and label sequence learning. In 87% (coding and untranslated regions) and 95% (coding regions only) of all genes tested in several out-of-sample evaluations, our method correctly identified all exons and introns. Notably, only 37% and 50%, respectively, of the presently unconfirmed genes in the C. elegans genome annotation agree with our predictions, thus we hypothesize that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation of the Wormbase WS120 annotation [1] of C. elegans reveals that splice form predictions on unconfirmed genes in WS120 are inaccurate in about 18% of the considered cases, while our predictions deviate from the truth only in 10%–13%. We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions. While our method correctly predicted 75% of those cases, the standard annotation was never completely correct. The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter. We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology

    Fast splice site detection using information content and feature reduction

    Get PDF
    Background: Accurate identification of splice sites in DNA sequences plays a key role in the prediction of gene structure in eukaryotes. Already many computational methods have been proposed for the detection of splice sites and some of them showed high prediction accuracy. However, most of these methods are limited in terms of their long computation time when applied to whole genome sequence data. Results: In this paper we propose a hybrid algorithm which combines several effective and informative input features with the state of the art support vector machine (SVM). To obtain the input features we employ information content method based on Shannon\u27s information theory, Shapiro\u27s score scheme, and Markovian probabilities. We also use a feature elimination scheme to reduce the less informative features from the input data. Conclusion: In this study we propose a new feature based splice site detection method that shows improved acceptor and donor splice site detection in DNA sequences when the performance is compared with various state of the art and well known method

    Discerning Novel Splice Junctions Derived from RNA-Seq Alignment: A Deep Learning Approach

    Get PDF
    Background: Exon splicing is a regulated cellular process in the transcription of protein-coding genes. Technological advancements and cost reductions in RNA sequencing have made quantitative and qualitative assessments of the transcriptome both possible and widely available. RNA-seq provides unprecedented resolution to identify gene structures and resolve the diversity of splicing variants. However, currently available ab initio aligners are vulnerable to spurious alignments due to random sequence matches and sample-reference genome discordance. As a consequence, a significant set of false positive exon junction predictions would be introduced, which will further confuse downstream analyses of splice variant discovery and abundance estimation. Results: In this work, we present a deep learning based splice junction sequence classifier, named DeepSplice, which employs convolutional neural networks to classify candidate splice junctions. We show (I) DeepSplice outperforms state-of-the-art methods for splice site classification when applied to the popular benchmark dataset HS3D, (II) DeepSplice shows high accuracy for splice junction classification with GENCODE annotation, and (III) the application of DeepSplice to classify putative splice junctions generated by Rail-RNA alignment of 21,504 human RNA-seq data significantly reduces 43 million candidates into around 3 million highly confident novel splice junctions. Conclusions: A model inferred from the sequences of annotated exon junctions that can then classify splice junctions derived from primary RNA-seq data has been implemented. The performance of the model was evaluated and compared through comprehensive benchmarking and testing, indicating a reliable performance and gross usability for classifying novel splice junctions derived from RNA-seq alignment

    Prediction of Alternative Splice Sites in Human Genes

    Get PDF
    This thesis addresses the problem of predicting alternative splice sites in human genes. The most common way to identify alternative splice sites are the use of expressed sequence tags and microarray data. Since genes only produce alternative proteins under certain conditions, these methods are limited to detecting only alternative splice sites in genes whose alternative protein forms are expressed under the tested conditions. I have introduced three multiclass support vector machines that predict upstream and downstream alternative 3’ splice sites, upstream and downstream alternative 5’ splice sites, and the 3’ splice site of skipped and cryptic exons. On a test set extracted from the Alternative Splice Annotation Project database, I was able to correctly classify about 68% of the splice sites in the alternative 3’ set, about 62% of the splice sites in the alternative 5’ set, and about 66% in the exon skipping set

    SplicePort—An interactive splice-site analysis tool

    Get PDF
    SplicePort is a web-based tool for splice-site analysis that allows the user to make splice-site predictions for submitted sequences. In addition, the user can also browse the rich catalog of features that underlies these predictions, and which we have found capable of providing high classification accuracy on human splice sites. Feature selection is optimized for human splice sites, but the selected features are likely to be predictive for other mammals as well. With our interactive feature browsing and visualization tool, the user can view and explore subsets of features used in splice-site prediction (either the features that account for the classification of a specific input sequence or the complete collection of features). Selected feature sets can be searched, ranked or displayed easily. The user can group features into clusters and frequency plot WebLogos can be generated for each cluster. The user can browse the identified clusters and their contributing elements, looking for new interesting signals, or can validate previously observed signals. The SplicePort web server can be accessed at http://www.cs.umd.edu/projects/SplicePort and http://www.spliceport.org
    corecore