Search CORE

9,304 research outputs found

Improving the Caenorhabditis elegans Genome Annotation Using Machine Learning

Author: Bernhard Schölkopf
Gunnar Rätsch
Hanh Witte
Jagan Srinivasan
Klaus-R Müller
Ralf-J Sommer
Sören Sonnenburg
The Caenorhabditis elegans sequencing consortium
Uwe Ohler
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2007
Field of study

For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ state-of-the-art machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA, utilizing recent advances in support vector machines and label sequence learning. In 87% (coding and untranslated regions) and 95% (coding regions only) of all genes tested in several out-of-sample evaluations, our method correctly identified all exons and introns. Notably, only 37% and 50%, respectively, of the presently unconfirmed genes in the C. elegans genome annotation agree with our predictions, thus we hypothesize that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation of the Wormbase WS120 annotation [1] of C. elegans reveals that splice form predictions on unconfirmed genes in WS120 are inaccurate in about 18% of the considered cases, while our predictions deviate from the truth only in 10%–13%. We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions. While our method correctly predicted 75% of those cases, the standard annotation was never completely correct. The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter. We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology

CiteSeerX

Public Library of Science (PLOS)

Crossref

Fraunhofer-ePrints

Directory of Open Access Journals

PubMed Central

Caltech Authors

MPG.PuRe

Extraction of Transcript Diversity from Scientific Literature

Author: Lars J Jensen
Parantu K Shah
Peer Bork
Philip Bourne
Stéphanie Boué
Publication venue: Public Library of Science
Publication date: 01/01/2005
Field of study

Transcript diversity generated by alternative splicing and associated mechanisms contributes heavily to the functional complexity of biological systems. The numerous examples of the mechanisms and functional implications of these events are scattered throughout the scientific literature. Thus, it is crucial to have a tool that can automatically extract the relevant facts and collect them in a knowledge base that can aid the interpretation of data from high-throughput methods. We have developed and applied a composite text-mining method for extracting information on transcript diversity from the entire MEDLINE database in order to create a database of genes with alternative transcripts. It contains information on tissue specificity, number of isoforms, causative mechanisms, functional implications, and experimental methods used for detection. We have mined this resource to identify 959 instances of tissue-specific splicing. Our results in combination with those from EST-based methods suggest that alternative splicing is the preferred mechanism for generating transcript diversity in the nervous system. We provide new annotations for 1,860 genes with the potential for generating transcript diversity. We assign the MeSH term “alternative splicing” to 1,536 additional abstracts in the MEDLINE database and suggest new MeSH terms for other events. We have successfully extracted information about transcript diversity and semiautomatically generated a database, LSAT, that can provide a quantitative understanding of the mechanisms behind tissue-specific gene expression. LSAT (Literature Support for Alternative Transcripts) is publicly available at http://www.bork.embl.de/LSAT/

CiteSeerX

Crossref

Directory of Open Access Journals

PubMed Central

MDC Repository

FigShare

Fast splice site detection using information content and feature reduction

Author: AKMA Baten
AKMA Baten
BCH Chang
C Burge
C Burge
C Cortes
CE Shannon
D Cai
G Dror
G Ratsch
G Yeo
H Drucker
H Itoh
H Liu
JCaHLS Rajapakse
JSaRD Chuang
L Zhang
M Burset
M Pertea
M Zhang
MB Shapiro
MG Reese
MG Reese
N Cristianini
P Waddell
R Castelo
S Brunak
S Buckingham
S Degroeve
S Salzberg
S Sonnenburg
S Sonnenburg
S Washietl
SA Marashi
SK Halgamuge
SM Hebsgaard
T Golub
T-M Chen
TD Schneider
v Vapnik
XH-F Zhang
Y Saeys
YF Sun
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Background: Accurate identification of splice sites in DNA sequences plays a key role in the prediction of gene structure in eukaryotes. Already many computational methods have been proposed for the detection of splice sites and some of them showed high prediction accuracy. However, most of these methods are limited in terms of their long computation time when applied to whole genome sequence data. Results: In this paper we propose a hybrid algorithm which combines several effective and informative input features with the state of the art support vector machine (SVM). To obtain the input features we employ information content method based on Shannon\u27s information theory, Shapiro\u27s score scheme, and Markovian probabilities. We also use a feature elimination scheme to reduce the less informative features from the input data. Conclusion: In this study we propose a new feature based splice site detection method that shows improved acceptor and donor splice site detection in DNA sequences when the performance is compared with various state of the art and well known method

ePublications@SCU

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

University of Melbourne Institutional Repository

αCP binding to a cytosine-rich subset of polypyrimidine tracts drives a novel pathway of cassette exon splicing in the mammalian transcriptome.

Author: Bahrami-Samani Emad
Duncan-Lewis Christopher
Ji Xinjun
Liebhaber Stephen A
Lin Lan
Park Juw Won
Pherribo Gordon
Xing Yi
Publication venue: eScholarship, University of California
Publication date: 20/02/2016
Field of study

Alternative splicing (AS) is a robust generator of mammalian transcriptome complexity. Splice site specification is controlled by interactions of cis-acting determinants on a transcript with specific RNA binding proteins. These interactions are frequently localized to the intronic U-rich polypyrimidine tracts (PPT) located 5' to the majority of splice acceptor junctions. αCPs (also referred to as polyC-binding proteins (PCBPs) and hnRNPEs) comprise a subset of KH-domain proteins with high affinity and specificity for C-rich polypyrimidine motifs. Here, we demonstrate that αCPs promote the splicing of a defined subset of cassette exons via binding to a C-rich subset of polypyrimidine tracts located 5' to the αCP-enhanced exonic segments. This enhancement of splice acceptor activity is linked to interactions of αCPs with the U2 snRNP complex and may be mediated by cooperative interactions with the canonical polypyrimidine tract binding protein, U2AF65. Analysis of αCP-targeted exons predicts a substantial impact on fundamental cell functions. These findings lead us to conclude that the αCPs play a direct and global role in modulating the splicing activity and inclusion of an array of cassette exons, thus driving a novel pathway of splice site regulation within the mammalian transcriptome

PubMed Central

eScholarship - University of California

Accurate splice site prediction using support vector machines

Author: Bmc Bioinformatics
Gabriele Schweikert
Gunnar Rätsch
Jonas Behr
Jonas Behr
Petra Philips
Petra Philips
Sören Sonnenburg
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Proceeding

CiteSeerX

Crossref

Springer - Publisher Connector

Fraunhofer-ePrints

PubMed Central

MPG.PuRe

Kernel methods in genomics and computational biology

Author: Vert Jean-Philippe
Publication venue
Publication date: 17/10/2005
Field of study

Support vector machines and kernel methods are increasingly popular in genomics and computational biology, due to their good performance in real-world applications and strong modularity that makes them suitable to a wide range of problems, from the classification of tumors to the automatic annotation of proteins. Their ability to work in high dimension, to process non-vectorial data, and the natural framework they provide to integrate heterogeneous data are particularly relevant to various problems arising in computational biology. In this chapter we survey some of the most prominent applications published so far, highlighting the particular developments in kernel methods triggered by problems in biology, and mention a few promising research directions likely to expand in the future

arXiv.org e-Print Archive

HAL-MINES ParisTech