799 research outputs found

    Improving the Caenorhabditis elegans Genome Annotation Using Machine Learning

    Get PDF
    For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ state-of-the-art machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA, utilizing recent advances in support vector machines and label sequence learning. In 87% (coding and untranslated regions) and 95% (coding regions only) of all genes tested in several out-of-sample evaluations, our method correctly identified all exons and introns. Notably, only 37% and 50%, respectively, of the presently unconfirmed genes in the C. elegans genome annotation agree with our predictions, thus we hypothesize that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation of the Wormbase WS120 annotation [1] of C. elegans reveals that splice form predictions on unconfirmed genes in WS120 are inaccurate in about 18% of the considered cases, while our predictions deviate from the truth only in 10%–13%. We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions. While our method correctly predicted 75% of those cases, the standard annotation was never completely correct. The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter. We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology

    The Feature Importance Ranking Measure

    Full text link
    Most accurate predictions are typically obtained by learning machines with complex feature spaces (as e.g. induced by kernels). Unfortunately, such decision rules are hardly accessible to humans and cannot easily be used to gain insights about the application domain. Therefore, one often resorts to linear models in combination with variable selection, thereby sacrificing some predictive power for presumptive interpretability. Here, we introduce the Feature Importance Ranking Measure (FIRM), which by retrospective analysis of arbitrary learning machines allows to achieve both excellent predictive performance and superior interpretation. In contrast to standard raw feature weighting, FIRM takes the underlying correlation structure of the features into account. Thereby, it is able to discover the most relevant features, even if their appearance in the training data is entirely prevented by noise. The desirable properties of FIRM are investigated analytically and illustrated in simulations.Comment: 15 pages, 3 figures. to appear in the Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), 200

    KIRMES: kernel-based identification of regulatory modules in euchromatic sequences

    Get PDF
    Motivation: Understanding transcriptional regulation is one of the main challenges in computational biology. An important problem is the identification of transcription factor (TF) binding sites in promoter regions of potential TF target genes. It is typically approached by position weight matrix-based motif identification algorithms using Gibbs sampling, or heuristics to extend seed oligos. Such algorithms succeed in identifying single, relatively well-conserved binding sites, but tend to fail when it comes to the identification of combinations of several degenerate binding sites, as those often found in cis-regulatory modules

    Evaluation of antigens for the serodiagnosis of kala-azar and oriental sores by means of the indirect immunofluorescence antibody test (IFAT)

    Get PDF
    Antigens and corresponding sera were collected from travellers with leishmaniasis returning to Germany from different endemic areas of the old world. The antigenicity of these Leishmania strains, which were maintained in Syrian hamsters, was compared by indirect immunofluorescence (IFAT). Antigenicity was demonstrated by antibody titres in 18 sera from 11 patients. The amastigotic stages of nine strains of Leishmania donovani and four strains of Leishmania tropica were compared with each other and with the culture forms of insect flagellates (Strigomonas oncopelti and Leptomonas ctenocephali). Eighteen sera from 11 patients were available for antibody determination with these antigens. The maximal antibody titres in a single serum varied considerably depending on which antigen was used for the test. High antibody levels could only be maintained when Leishmania donovani was employed as the antigen, but considerable differences also occurred between the different strains of this species. The other antigens were weaker. No differences in antigenicity between amastigotes and promastigotes of the same strain were observed. It is important to select suitable antigens. Low titres may be of doubtful specificity and are a poor baseline for the fall in titre which is an essential index of effective treatment.Wir sammelten Parasiten und Seren von Reisenden, die aus verschiedenen endemischen Gebieten der Alten Welt mit einer Leishmaniasis nach Deutschland zurückkehrten. Die Antigenaktivitäten der isolierten und fortlaufend in Goldhamstern gehaltenenLeishmania-Stämme wurden im indirekten Immunofluoreszenztest (IFAT) verglichen. Die Antigenität wurde an Hand von Antikörpertitern in 18 Serumproben von 11 Patienten bewiesen. Neun Stämme desLeishmania donovani-Komplexes und vierLeishmania tropica-Isolate wurden in ihrem amastigoten Stadium miteinander verglichen. Hinzu kamen zwei Insekten-Flagellaten als Kulturformen:Strigomonas oncopelti undLeptomonas ctenocephali. 18 Serumproben von 11 Patienten standen für die Antikörperbestimmung mit diesen Antigenen zur Verfügung. Die maximalen Titerhöhen variierten in ein- und derselben antiserumprobe zum Teil erheblich, je nachdem, welches Antigen für den Test benutzt wurde. Hohe Antikörpertiter konnten nur erhalten werden, wennLeishmania donovani als Antigen vorlag, es ergaben sich aber auch zwischen den einzelnen Stämmen dieser Leishmaniaart erhebliche Unterschiede in der Antigenaktivität. Antigene anderer Art erwiesen sich als wenig wirksam. Zwischen amastigoten und promastigoten Entwicklungsformen einesLeishmania donovani-Stammes konnten keine Unterschiede in der Antigenaktivität erkannt werden. Für den Nachweis möglichst hoher Antikörpertiter im IFAT ist die Auswahl geeigneter Antigene von ausschlaggebender Bedeutung. Niedrige Titer erschweren deren Beurteilung als spezifisch und sind eine schlechte Ausgangsposition für die Beobachtung des obligatorischen Titerabfalles nach erfolgreicher Therapie

    mGene.web: a web service for accurate computational gene finding

    Get PDF
    We describe mGene.web, a web service for the genome-wide prediction of protein coding genes from eukaryotic DNA sequences. It offers pre-trained models for the recognition of gene structures including untranslated regions in an increasing number of organisms. With mGene.web, users have the additional possibility to train the system with their own data for other organisms on the push of a button, a functionality that will greatly accelerate the annotation of newly sequenced genomes. The system is built in a highly modular way, such that individual components of the framework, like the promoter prediction tool or the splice site predictor, can be used autonomously. The underlying gene finding system mGene is based on discriminative machine learning techniques and its high accuracy has been demonstrated in an international competition on nematode genomes. mGene.web is available at http://www.mgene.org/web, it is free of charge and can be used for eukaryotic genomes of small to moderate size (several hundred Mbp)

    mGene.web: a web service for accurate computational gene finding

    Get PDF
    We describe mGene.web, a web service for the genome-wide prediction of protein coding genes from eukaryotic DNA sequences. It offers pre-trained models for the recognition of gene structures including untranslated regions in an increasing number of organisms. With mGene.web, users have the additional possibility to train the system with their own data for other organisms on the push of a button, a functionality that will greatly accelerate the annotation of newly sequenced genomes. The system is built in a highly modular way, such that individual components of the framework, like the promoter prediction tool or the splice site predictor, can be used autonomously. The underlying gene finding system mGene is based on discriminative machine learning techniques and its high accuracy has been demonstrated in an international competition on nematode genomes. mGene.web is available at http://www.mgene.org/web, it is free of charge and can be used for eukaryotic genomes of small to moderate size (several hundred Mbp)

    Probabilistic Clustering of Time-Evolving Distance Data

    Full text link
    We present a novel probabilistic clustering model for objects that are represented via pairwise distances and observed at different time points. The proposed method utilizes the information given by adjacent time points to find the underlying cluster structure and obtain a smooth cluster evolution. This approach allows the number of objects and clusters to differ at every time point, and no identification on the identities of the objects is needed. Further, the model does not require the number of clusters being specified in advance -- they are instead determined automatically using a Dirichlet process prior. We validate our model on synthetic data showing that the proposed method is more accurate than state-of-the-art clustering methods. Finally, we use our dynamic clustering model to analyze and illustrate the evolution of brain cancer patients over time
    corecore