21,337 research outputs found
Diffusion Component Analysis: Unraveling Functional Topology in Biological Networks
Complex biological systems have been successfully modeled by biochemical and
genetic interaction networks, typically gathered from high-throughput (HTP)
data. These networks can be used to infer functional relationships between
genes or proteins. Using the intuition that the topological role of a gene in a
network relates to its biological function, local or diffusion based
"guilt-by-association" and graph-theoretic methods have had success in
inferring gene functions. Here we seek to improve function prediction by
integrating diffusion-based methods with a novel dimensionality reduction
technique to overcome the incomplete and noisy nature of network data. In this
paper, we introduce diffusion component analysis (DCA), a framework that plugs
in a diffusion model and learns a low-dimensional vector representation of each
node to encode the topological properties of a network. As a proof of concept,
we demonstrate DCA's substantial improvement over state-of-the-art
diffusion-based approaches in predicting protein function from molecular
interaction networks. Moreover, our DCA framework can integrate multiple
networks from heterogeneous sources, consisting of genomic information,
biochemical experiments and other resources, to even further improve function
prediction. Yet another layer of performance gain is achieved by integrating
the DCA framework with support vector machines that take our node vector
representations as features. Overall, our DCA framework provides a novel
representation of nodes in a network that can be used as a plug-in architecture
to other machine learning algorithms to decipher topological properties of and
obtain novel insights into interactomes.Comment: RECOMB 201
Automated data integration for developmental biological research
In an era exploding with genome-scale data, a major challenge for developmental biologists is how to extract significant clues from these publicly available data to benefit our studies of individual genes, and how to use them to improve our understanding of development at a systems level. Several studies have successfully demonstrated new approaches to classic developmental questions by computationally integrating various genome-wide data sets. Such computational approaches have shown great potential for facilitating research: instead of testing 20,000 genes, researchers might test 200 to the same effect. We discuss the nature and state of this art as it applies to developmental research
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Herb Target Prediction Based on Representation Learning of Symptom related Heterogeneous Network.
Traditional Chinese Medicine (TCM) has received increasing attention as a complementary approach or alternative to modern medicine. However, experimental methods for identifying novel targets of TCM herbs heavily relied on the current available herb-compound-target relationships. In this work, we present an Herb-Target Interaction Network (HTINet) approach, a novel network integration pipeline for herb-target prediction mainly relying on the symptom related associations. HTINet focuses on capturing the low-dimensional feature vectors for both herbs and proteins by network embedding, which incorporate the topological properties of nodes across multi-layered heterogeneous network, and then performs supervised learning based on these low-dimensional feature representations. HTINet obtains performance improvement over a well-established random walk based herb-target prediction method. Furthermore, we have manually validated several predicted herb-target interactions from independent literatures. These results indicate that HTINet can be used to integrate heterogeneous information to predict novel herb-target interactions
Multi-omics integration accurately predicts cellular state in unexplored conditions for Escherichia coli.
A significant obstacle in training predictive cell models is the lack of integrated data sources. We develop semi-supervised normalization pipelines and perform experimental characterization (growth, transcriptional, proteome) to create Ecomics, a consistent, quality-controlled multi-omics compendium for Escherichia coli with cohesive meta-data information. We then use this resource to train a multi-scale model that integrates four omics layers to predict genome-wide concentrations and growth dynamics. The genetic and environmental ontology reconstructed from the omics data is substantially different and complementary to the genetic and chemical ontologies. The integration of different layers confers an incremental increase in the prediction performance, as does the information about the known gene regulatory and protein-protein interactions. The predictive performance of the model ranges from 0.54 to 0.87 for the various omics layers, which far exceeds various baselines. This work provides an integrative framework of omics-driven predictive modelling that is broadly applicable to guide biological discovery
- …