8,392 research outputs found
Diffusion Component Analysis: Unraveling Functional Topology in Biological Networks
Complex biological systems have been successfully modeled by biochemical and
genetic interaction networks, typically gathered from high-throughput (HTP)
data. These networks can be used to infer functional relationships between
genes or proteins. Using the intuition that the topological role of a gene in a
network relates to its biological function, local or diffusion based
"guilt-by-association" and graph-theoretic methods have had success in
inferring gene functions. Here we seek to improve function prediction by
integrating diffusion-based methods with a novel dimensionality reduction
technique to overcome the incomplete and noisy nature of network data. In this
paper, we introduce diffusion component analysis (DCA), a framework that plugs
in a diffusion model and learns a low-dimensional vector representation of each
node to encode the topological properties of a network. As a proof of concept,
we demonstrate DCA's substantial improvement over state-of-the-art
diffusion-based approaches in predicting protein function from molecular
interaction networks. Moreover, our DCA framework can integrate multiple
networks from heterogeneous sources, consisting of genomic information,
biochemical experiments and other resources, to even further improve function
prediction. Yet another layer of performance gain is achieved by integrating
the DCA framework with support vector machines that take our node vector
representations as features. Overall, our DCA framework provides a novel
representation of nodes in a network that can be used as a plug-in architecture
to other machine learning algorithms to decipher topological properties of and
obtain novel insights into interactomes.Comment: RECOMB 201
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Herb Target Prediction Based on Representation Learning of Symptom related Heterogeneous Network.
Traditional Chinese Medicine (TCM) has received increasing attention as a complementary approach or alternative to modern medicine. However, experimental methods for identifying novel targets of TCM herbs heavily relied on the current available herb-compound-target relationships. In this work, we present an Herb-Target Interaction Network (HTINet) approach, a novel network integration pipeline for herb-target prediction mainly relying on the symptom related associations. HTINet focuses on capturing the low-dimensional feature vectors for both herbs and proteins by network embedding, which incorporate the topological properties of nodes across multi-layered heterogeneous network, and then performs supervised learning based on these low-dimensional feature representations. HTINet obtains performance improvement over a well-established random walk based herb-target prediction method. Furthermore, we have manually validated several predicted herb-target interactions from independent literatures. These results indicate that HTINet can be used to integrate heterogeneous information to predict novel herb-target interactions
Predicting protein-protein interactions as a one-class classification problem
Protein-protein interactions represent a key step in understanding proteins functions. This is due to the fact that proteins usually work in context of other proteins and rarely function alone. Machine learning techniques have been used to predict protein-protein interactions. However, most of these techniques address this problem as a binary classification problem. While it is easy to get a dataset of interacting protein as positive example, there is no experimentally confirmed non-interacting protein to be considered as a negative set. Therefore, in this paper we solve this problem as a one-class classification problem using One-Class SVM (OCSVM). Using only positive examples (interacting protein pairs) for training, the OCSVM achieves accuracy of 80%. These results imply that protein-protein interaction can be predicted using one-class classifier with reliable accuracy
Prediction of protein-protein interactions using one-class classification methods and integrating diverse data
This research addresses the problem of prediction of protein-protein interactions (PPI)
when integrating diverse kinds of biological information. This task has been commonly
viewed as a binary classification problem (whether any two proteins do or do not interact)
and several different machine learning techniques have been employed to solve this
task. However the nature of the data creates two major problems which can affect results.
These are firstly imbalanced class problems due to the number of positive examples (pairs
of proteins which really interact) being much smaller than the number of negative ones.
Secondly the selection of negative examples can be based on some unreliable assumptions
which could introduce some bias in the classification results.
Here we propose the use of one-class classification (OCC) methods to deal with the task of
prediction of PPI. OCC methods utilise examples of just one class to generate a predictive
model which consequently is independent of the kind of negative examples selected; additionally
these approaches are known to cope with imbalanced class problems. We have
designed and carried out a performance evaluation study of several OCC methods for this
task, and have found that the Parzen density estimation approach outperforms the rest. We
also undertook a comparative performance evaluation between the Parzen OCC method
and several conventional learning techniques, considering different scenarios, for example
varying the number of negative examples used for training purposes. We found that the
Parzen OCC method in general performs competitively with traditional approaches and in
many situations outperforms them. Finally we evaluated the ability of the Parzen OCC
approach to predict new potential PPI targets, and validated these results by searching for
biological evidence in the literature
- …