2,996 research outputs found
A Factor Graph Approach to Automated GO Annotation
As volume of genomic data grows, computational methods become essential for providing a first glimpse onto gene annotations. Automated Gene Ontology (GO) annotation methods based on hierarchical ensemble classification techniques are particularly interesting when interpretability of annotation results is a main concern. In these methods, raw GO-term predictions computed by base binary classifiers are leveraged by checking the consistency of predefined GO relationships. Both formal leveraging strategies, with main focus on annotation precision, and heuristic alternatives, with main focus on scalability issues, have been described in literature. In this contribution, a factor graph approach to the hierarchical ensemble formulation of the automated GO annotation problem is presented. In this formal framework, a core factor graph is first built based on the GO structure and then enriched to take into account the noisy nature of GO-term predictions. Hence, starting from raw GO-term predictions, an iterative message passing algorithm between nodes of the factor graph is used to compute marginal probabilities of target GO-terms. Evaluations on Saccharomyces cerevisiae, Arabidopsis thaliana and Drosophila melanogaster protein sequences from the GO Molecular Function domain showed significant improvements over competing approaches, even when protein sequences were naively characterized by their physicochemical and secondary structure properties or when loose noisy annotation datasets were considered. Based on these promising results and using Arabidopsis thaliana annotation data, we extend our approach to the identification of most promising molecular function annotations for a set of proteins of unknown function in Solanum lycopersicum.Fil: Spetale, Flavio Ezequiel. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas. Universidad Nacional de Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas; ArgentinaFil: Krsticevic, Flavia Jorgelina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas. Universidad Nacional de Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas; ArgentinaFil: Roda, Fernando. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas. Universidad Nacional de Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas; ArgentinaFil: Bulacio, Pilar Estela. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas. Universidad Nacional de Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas; Argentin
Recommended from our members
An Overview of the Use of Neural Networks for Data Mining Tasks
In the recent years the area of data mining has experienced a considerable demand for technologies that extract knowledge from large and complex data sources. There is a substantial commercial interest as well as research investigations in the area that aim to develop new and improved approaches for extracting information, relationships, and patterns from datasets. Artificial Neural Networks (NN) are popular biologically inspired intelligent methodologies, whose classification, prediction and pattern recognition capabilities have been utilised successfully in many areas, including science, engineering, medicine, business, banking, telecommunication, and many other fields. This paper highlights from a data mining perspective the implementation of NN, using supervised and unsupervised learning, for pattern recognition, classification, prediction and cluster analysis, and focuses the discussion on their usage in bioinformatics and financial data analysis tasks
Linear and Order Statistics Combiners for Pattern Classification
Several researchers have experimentally shown that substantial improvements
can be obtained in difficult pattern recognition problems by combining or
integrating the outputs of multiple classifiers. This chapter provides an
analytical framework to quantify the improvements in classification results due
to combining. The results apply to both linear combiners and order statistics
combiners. We first show that to a first order approximation, the error rate
obtained over and above the Bayes error rate, is directly proportional to the
variance of the actual decision boundaries around the Bayes optimum boundary.
Combining classifiers in output space reduces this variance, and hence reduces
the "added" error. If N unbiased classifiers are combined by simple averaging,
the added error rate can be reduced by a factor of N if the individual errors
in approximating the decision boundaries are uncorrelated. Expressions are then
derived for linear combiners which are biased or correlated, and the effect of
output correlations on ensemble performance is quantified. For order statistics
based non-linear combiners, we derive expressions that indicate how much the
median, the maximum and in general the ith order statistic can improve
classifier performance. The analysis presented here facilitates the
understanding of the relationships among error rates, classifier boundary
distributions, and combining in output space. Experimental results on several
public domain data sets are provided to illustrate the benefits of combining
and to support the analytical results.Comment: 31 page
Predicting gene function using hierarchical multi-label decision tree ensembles
<p>Abstract</p> <p>Background</p> <p><it>S. cerevisiae</it>, <it>A. thaliana </it>and <it>M. musculus </it>are well-studied organisms in biology and the sequencing of their genomes was completed many years ago. It is still a challenge, however, to develop methods that assign biological functions to the ORFs in these genomes automatically. Different machine learning methods have been proposed to this end, but it remains unclear which method is to be preferred in terms of predictive performance, efficiency and usability.</p> <p>Results</p> <p>We study the use of decision tree based models for predicting the multiple functions of ORFs. First, we describe an algorithm for learning hierarchical multi-label decision trees. These can simultaneously predict all the functions of an ORF, while respecting a given hierarchy of gene functions (such as FunCat or GO). We present new results obtained with this algorithm, showing that the trees found by it exhibit clearly better predictive performance than the trees found by previously described methods. Nevertheless, the predictive performance of individual trees is lower than that of some recently proposed statistical learning methods. We show that ensembles of such trees are more accurate than single trees and are competitive with state-of-the-art statistical learning and functional linkage methods. Moreover, the ensemble method is computationally efficient and easy to use.</p> <p>Conclusions</p> <p>Our results suggest that decision tree based methods are a state-of-the-art, efficient and easy-to-use approach to ORF function prediction.</p
Improving protein fold recognition using the amalgamation of evolutionary-based and structural-based information
Deciphering three dimensional structure of a protein sequence is a challenging task in biological science. Protein
fold recognition and protein secondary structure prediction are transitional steps in identifying the three
dimensional structure of a protein. For protein fold recognition, evolutionary-based information of amino acid
sequences from the position specific scoring matrix (PSSM) has been recently applied with improved results. On
the other hand, the SPINE-X predictor has been developed and applied for protein secondary structure prediction.
Several reported methods for protein fold recognition have only limited accuracy. In this paper, we have
developed a strategy of combining evolutionary-based information (from PSSM) and predicted secondary structure
using SPINE-X to improve protein fold recognition. The strategy is based on finding the probabilities of amino acid
pairs (AAP). The proposed method has been tested on several protein benchmark datasets and an improvement of
8.9% recognition accuracy has been achieved. We have achieved, for the first time over 90% and 75% prediction
accuracies for sequence similarity values below 40% and 25%, respectively. We also obtain 90.6% and 77.0%
prediction accuracies, respectively, for the Extended Ding and Dubchak and Taguchi and Gromiha benchmark
protein fold recognition datasets widely used for in the literature
A study of hierarchical and flat classification of proteins
Automatic classification of proteins using machine learning is an important problem that has received significant attention in the literature. One feature of this problem is that expert-defined hierarchies of protein classes exist and can potentially be exploited to improve classification performance. In this article we investigate empirically whether this is the case for two such hierarchies. We compare multi-class classification techniques that exploit the information in those class hierarchies and those that do not, using logistic regression, decision trees, bagged decision trees, and support vector machines as the underlying base learners. In particular, we compare hierarchical and flat variants of ensembles of nested dichotomies. The latter have been shown to deliver strong classification performance in multi-class settings. We present experimental results for synthetic, fold recognition, enzyme classification, and remote homology detection data. Our results show that exploiting the class hierarchy improves performance on the synthetic data, but not in the case of the protein classification problems. Based on this we recommend that strong flat multi-class methods be used as a baseline to establish the benefit of exploiting class hierarchies in this area
Machine learning methods for omics data integration
High-throughput technologies produce genome-scale transcriptomic and metabolomic (omics) datasets that allow for the system-level studies of complex biological processes. The limitation lies in the small number of samples versus the larger number of features represented in these datasets. Machine learning methods can help integrate these large-scale omics datasets and identify key features from each dataset. A novel class dependent feature selection method integrates the F statistic, maximum relevance binary particle swarm optimization (MRBPSO), and class dependent multi-category classification (CDMC) system. A set of highly differentially expressed genes are pre-selected using the F statistic as a filter for each dataset. MRBPSO and CDMC function as a wrapper to select desirable feature subsets for each class and classify the samples using those chosen class-dependent feature subsets. The results indicate that the class-dependent approaches can effectively identify unique biomarkers for each cancer type and improve classification accuracy compared to class independent feature selection methods. The integration of transcriptomics and metabolomics data is based on a classification framework. Compared to principal component analysis and non-negative matrix factorization based integration approaches, our proposed method achieves 20-30% higher prediction accuracies on Arabidopsis tissue development data. Metabolite-predictive genes and gene-predictive metabolites are selected from transcriptomic and metabolomic data respectively. The constructed gene-metabolite correlation network can infer the functions of unknown genes and metabolites. Tissue-specific genes and metabolites are identified by the class-dependent feature selection method. Evidence from subcellular locations, gene ontology, and biochemical pathways support the involvement of these entities in different developmental stages and tissues in Arabidopsis
- …