4,647 research outputs found
Inference of Functional Relations in Predicted Protein Networks with a Machine Learning Approach
Background: Molecular biology is currently facing the challenging task of functionally characterizing the proteome. The large number of possible protein-protein interactions and complexes, the variety of environmental conditions and cellular states in which these interactions can be reorganized, and the multiple ways in which a protein can influence the function of others, requires the development of experimental and computational approaches to analyze and predict functional associations between proteins as part of their activity in the interactome. Methodology/Principal Findings: We have studied the possibility of constructing a classifier in order to combine the output of the several protein interaction prediction methods. The AODE (Averaged One-Dependence Estimators) machine learning algorithm is a suitable choice in this case and it provides better results than the individual prediction methods, and it has better performances than other tested alternative methods in this experimental set up. To illustrate the potential use of this new AODE-based Predictor of Protein InterActions (APPIA), when analyzing high-throughput experimental data, we show how it helps to filter the results of published High-Throughput proteomic studies, ranking in a significant way functionally related pairs. Availability: All the predictions of the individual methods and of the combined APPIA predictor, together with the used datasets of functional associations are available at http://ecid.bioinfo.cnio.es/. Conclusions: We propose a strategy that integrates the main current computational techniques used to predict functional associations into a unified classifier system, specifically focusing on the evaluation of poorly characterized protein pairs. We selected the AODE classifier as the appropriate tool to perform this task. AODE is particularly useful to extract valuable information from large unbalanced and heterogeneous data sets. The combination of the information provided by five prediction interaction prediction methods with some simple sequence features in APPIA is useful in establishing reliability values and helpful to prioritize functional interactions that can be further experimentally characterized.This work was funded by the BioSapiens (grant number LSHG-CT-2003-503265) and the Experimental Network for Functional Integration (ENFIN) Networks of Excellence (contract number LSHG-CT-2005-518254), by Consolider BSC (grant number CSD2007-00050) and by the project “Functions for gene sets” from the Spanish Ministry of Education and Science (BIO2007-66855). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript
Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor
BACKGROUND: Transient protein-protein interactions (PPIs), which underly most biological processes, are a prime target for therapeutic development. Immense progress has been made towards computational prediction of PPIs using methods such as protein docking and sequence analysis. However, docking generally requires high resolution structures of both of the binding partners and sequence analysis requires that a significant number of recurrent patterns exist for the identification of a potential binding site. Researchers have turned to machine learning to overcome some of the other methods’ restrictions by generalising interface sites with sets of descriptive features. Best practices for dataset generation, features, and learning algorithms have not yet been identified or agreed upon, and an analysis of the overall efficacy of machine learning based PPI predictors is due, in order to highlight potential areas for improvement. RESULTS: The presence of unknown interaction sites as a result of limited knowledge about protein interactions in the testing set dramatically reduces prediction accuracy. Greater accuracy in labelling the data by enforcing higher interface site rates per domain resulted in an average 44% improvement across multiple machine learning algorithms. A set of 10 biologically unrelated proteins that were consistently predicted on with high accuracy emerged through our analysis. We identify seven features with the most predictive power over multiple datasets and machine learning algorithms. Through our analysis, we created a new predictor, RAD-T, that outperforms existing non-structurally specializing machine learning protein interface predictors, with an average 59% increase in MCC score on a dataset with a high number of interactions. CONCLUSION: Current methods of evaluating machine-learning based PPI predictors tend to undervalue their performance, which may be artificially decreased by the presence of un-identified interaction sites. Changes to predictors’ training sets will be integral to the future progress of interface prediction by machine learning methods. We reveal the need for a larger test set of well studied proteins or domain-specific scoring algorithms to compensate for poor interaction site identification on proteins in general
Land subsidence susceptibility mapping in South Korea using machine learning algorithms
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. In this study, land subsidence susceptibility was assessed for a study area in South Korea by using four machine learning models including Bayesian Logistic Regression (BLR), Support Vector Machine (SVM), Logistic Model Tree (LMT) and Alternate Decision Tree (ADTree). Eight conditioning factors were distinguished as the most important affecting factors on land subsidence of Jeong-am area, including slope angle, distance to drift, drift density, geology, distance to lineament, lineament density, land use and rock-mass rating (RMR) were applied to modelling. About 24 previously occurred land subsidence were surveyed and used as training dataset (70% of data) and validation dataset (30% of data) in the modelling process. Each studied model generated a land subsidence susceptibility map (LSSM). The maps were verified using several appropriate tools including statistical indices, the area under the receiver operating characteristic (AUROC) and success rate (SR) and prediction rate (PR) curves. The results of this study indicated that the BLR model produced LSSM with higher acceptable accuracy and reliability compared to the other applied models, even though the other models also had reasonable results
The era of big data: Genome-scale modelling meets machine learning
With omics data being generated at an unprecedented rate, genome-scale modelling has become pivotal in its organisation and analysis. However, machine learning methods have been gaining ground in cases where knowledge is insufficient to represent the mechanisms underlying such data or as a means for data curation prior to attempting mechanistic modelling. We discuss the latest advances in genome-scale modelling and the development of optimisation algorithms for network and error reduction, intracellular constraining and applications to strain design. We further review applications of supervised and unsupervised machine learning methods to omics datasets from microbial and mammalian cell systems and present efforts to harness the potential of both modelling approaches through hybrid modelling
Recommended from our members
Identifying metabolic enzymes with multiple types of association evidence
BACKGROUND: Existing large-scale metabolic models of sequenced organisms commonly include enzymatic functions which can not be attributed to any gene in that organism. Existing computational strategies for identifying such missing genes rely primarily on sequence homology to known enzyme-encoding genes. RESULTS: We present a novel method for identifying genes encoding for a specific metabolic function based on a local structure of metabolic network and multiple types of functional association evidence, including clustering of genes on the chromosome, similarity of phylogenetic profiles, gene expression, protein fusion events and others. Using E. coli and S. cerevisiae metabolic networks, we illustrate predictive ability of each individual type of association evidence and show that significantly better predictions can be obtained based on the combination of all data. In this way our method is able to predict 60% of enzyme-encoding genes of E. coli metabolism within the top 10 (out of 3551) candidates for their enzymatic function, and as a top candidate within 43% of the cases. CONCLUSION: We illustrate that a combination of genome context and other functional association evidence is effective in predicting genes encoding metabolic enzymes. Our approach does not rely on direct sequence homology to known enzyme-encoding genes, and can be used in conjunction with traditional homology-based metabolic reconstruction methods. The method can also be used to target orphan metabolic activities
High-Throughput, Time-Resolved Mechanical Phenotyping of Prostate Cancer Cells
Abstract Worldwide, prostate cancer sits only behind lung cancer as the most commonly diagnosed form of the disease in men. Even the best diagnostic standards lack precision, presenting issues with false positives and unneeded surgical intervention for patients. This lack of clear cut early diagnostic tools is a significant problem. We present a microfluidic platform, the Time-Resolved Hydrodynamic Stretcher (TR-HS), which allows the investigation of the dynamic mechanical response of thousands of cells per second to a non-destructive stress. The TR-HS integrates high-speed imaging and computer vision to automatically detect and track single cells suspended in a fluid and enables cell classification based on their mechanical properties. We demonstrate the discrimination of healthy and cancerous prostate cell lines based on the whole-cell, time-resolved mechanical response to a hydrodynamic load. Additionally, we implement a finite element method (FEM) model to characterise the forces responsible for the cell deformation in our device. Finally, we report the classification of the two different cell groups based on their time-resolved roundness using a decision tree classifier. This approach introduces a modality for high-throughput assessments of cellular suspensions and may represent a viable application for the development of innovative diagnostic devices
Markov Models of Amino Acid Substitution to Study Proteins with Intrinsically Disordered Regions
Intrinsically disordered proteins (IDPs) or proteins with disordered regions
(IDRs) do not have a well-defined tertiary structure, but perform a
multitude of functions, often relying on their native disorder to achieve
the binding flexibility through changing to alternative conformations.
Intrinsic disorder is frequently found in all three kingdoms of life, and
may occur in short stretches or span whole proteins. To date most studies
contrasting the differences between ordered and disordered proteins focused
on simple summary statistics. Here, we propose an evolutionary approach to
study IDPs, and contrast patterns specific to ordered protein regions and
the corresponding IDRs.Two empirical Markov models of amino acid substitutions were estimated, based
on a large set of multiple sequence alignments with experimentally verified
annotations of disordered regions from the DisProt database of IDPs. We
applied new methods to detect differences in Markovian evolution and
evolutionary rates between IDRs and the corresponding ordered protein
regions. Further, we investigated the distribution of IDPs among functional
categories, biochemical pathways and their preponderance to contain tandem
repeats. disorder prediction using a phylogenetic Hidden Markov
Model based on our matrices showed a performance similar to other disorder
predictors
Aerospace Medicine and Biology, a continuing bibliography with indexes
This bibliography lists 365 reports, articles and other documents introduced into the NASA scientific and technical information system in October 1984
- …