17 research outputs found
Finding motif pairs in the interactions between heterogeneous proteins via bootstrapping and boosting
<p>Abstract</p> <p>Background</p> <p>Supervised learning and many stochastic methods for predicting protein-protein interactions require both negative and positive interactions in the training data set. Unlike positive interactions, negative interactions cannot be readily obtained from interaction data, so these must be generated. In protein-protein interactions and other molecular interactions as well, taking all non-positive interactions as negative interactions produces too many negative interactions for the positive interactions. Random selection from non-positive interactions is unsuitable, since the selected data may not reflect the original distribution of data.</p> <p>Results</p> <p>We developed a bootstrapping algorithm for generating a negative data set of arbitrary size from protein-protein interaction data. We also developed an efficient boosting algorithm for finding interacting motif pairs in human and virus proteins. The boosting algorithm showed the best performance (84.4% sensitivity and 75.9% specificity) with balanced positive and negative data sets. The boosting algorithm was also used to find potential motif pairs in complexes of human and virus proteins, for which structural data was not used to train the algorithm. Interacting motif pairs common to multiple folds of structural data for the complexes were proven to be statistically significant. The data set for interactions between human and virus proteins was extracted from BOND and is available at <url>http://virus.hpid.org/interactions.aspx</url>. The complexes of human and virus proteins were extracted from PDB and their identifiers are available at <url>http://virus.hpid.org/PDB_IDs.html</url>.</p> <p>Conclusion</p> <p>When the positive and negative training data sets are unbalanced, the result via the prediction model tends to be biased. Bootstrapping is effective for generating a negative data set, for which the size and distribution are easily controlled. Our boosting algorithm could efficiently predict interacting motif pairs from protein interaction and sequence data, which was trained with the balanced data sets generated via the bootstrapping method.</p
Recommended from our members
Analysis of the understudied parts of the phospho-signalome using machine learning methods
Abstract
Analysis of the understudied parts of the phospho-signalome using machine learning methods
Borgthor Petursson
In order to make decisions and respond appropriately to external stimuli, cells rely on an intricate signalling system. One of the most important and best studied components of this signalling system is the phospho-signalling network. Phosphorylation relays information through adding phosphoryl groups onto substrates such as lipids or proteins, which in turn leads to changes in substrate function. Crucial components of this system include kinases, which phosphorylate on the substrate molecule and phosphatases that remove the phosphoryl group from the substrate.
To date, even though >100K phosphoproteins have been identified through high throughput experiments, the vast majority of phosphosites are of unknown function, while over a third of kinases have no known substrate (Needham et al., 2019). Furthermore, there is a large study bias in our current knowledge, demonstrated by a disproportionate number of interactions between highly cited kinases and substrates Invergo and Beltrao, 2018. The vast understudied signalling space combined with this study bias make it difficult to understand the general principles underpinning cell signalling regulation and stresses the need to research the phosphoproteomic signalling system in an unbiased manner.
In this thesis the central aim is to use data-driven and unbiased approaches to study the human phosphoproteomic signalling network. The first chapter describes a project where I co-developed a machine learning model to predict signed kinase-kinase regulatory circuits based on kinase specificities and high throughput phosphoproteomics and transcriptomic data. The network was validated using independent high throughput data and used to identify novel kinase-kinase regulatory interactions. This project was done in collaboration with Brandon Invergo, a postdoc in Pedro Beltrao’s research group.
In the second chapter I expand upon work done in the first chapter. I used various predictors such as: Co-expression, kinase specificities and different variables characterising kinase-substrate potential target phosphosites to predict kinase-substrate relationships and their signs. I then used independent experimental kinase-substrate predictions to validate the predictions and identify high confidence kinase-substrate relationships. I then combined the kinase-substrate predictions with the kinase-kinase regulatory circuits to identify condition-specific signalling networks. To enable easy use of my method and networks and analyses of phosphoproteomics data by non-expert users I also developed the SELPHI2 server, where the user can extract biological insight from their datasets. SELPHI2 presents a substantial improvement upon the SELPHI server, which was developed in 2015 by my supervisor, Evangelia Petsalaki.
Thirdly, to study the architecture of human cell signalling networks at a whole-cell level and address the limited predictive power of the current models of cell signalling such as pathways found in KEGG (Kanehisa, 2019), Reactome (Jassal et al., 2020) and WikiPathways (Slenter et al., 2018), the third chapter aims to identify signalling modules from phosphoproteomic data. These data-extracted modules were found to have a greater predictive power for independent data sets in terms of number of significant enrichments. Furthermore, we sought to predict the probability of module co-membership from predictors such as membership within data-driven modules, co-phosphorylation and co-expression.
In summary, the work presented here seeks to explore the understudied phospho-signalling systems through system-wide prediction of kinase-substrate regulation and the identification of phospho-signalling modules through data-driven means
Algorithms for the analysis of protein interaction networks
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 107-117).In the decade since the human genome project, a major research trend in biology has been towards understanding the cell as a system. This interest has stemmed partly from a deeper appreciation of how important it is to understand the emergent properties of cellular systems (e.g., they seem to be the key to understanding diseases like cancer). It has also been enabled by new high-throughput techniques that have allowed us to collect new types of data at the whole-genome scale. We focus on one sub-domain of systems biology: the understanding of protein interactions. Such understanding is valuable: interactions between proteins are fundamental to many cellular processes. Over the last decade, high-throughput experimental techniques have allowed us to collect a large amount of protein-protein interaction (PPI) data for many species. A popular abstraction for representing this data is the protein interaction network: each node of the network represents a protein and an edge between two nodes represents a physical interaction between the two corresponding proteins. This abstraction has proven to be a powerful tool for understanding the systems aspects of protein interaction. We present some algorithms for the augmentation, cleanup and analysis of such protein interaction networks: 1. In many species, the coverage of known PPI data remains partial. Given two protein sequences, we describe an algorithm to predict if two proteins physically interact, using logistic regression and insights from structural biology. We also describe how our predictions may be further improved by combining with functional-genomic data. 2. We study systematic false positives in a popular experimental protocol, the Yeast 2-Hybrid method. Here, some "promiscuous" proteins may lead to many false positives. We describe a Bayesian approach to modeling and adjusting for this error. 3. Comparative analysis of PPI networks across species can provide valuable insights. We describe IsoRank, an algorithm for global network alignment of multiple PPI networks. The algorithm first constructs an eigenvalue problem that encapsulates the network and sequence similarity constraints. The solution of the problem describes a k-partite graph that is further processed to find the alignment. 4. For a given signaling network, we describe an algorithm that combines RNA-interference data with PPI data to produce hypotheses about the structure of the signaling network. Our algorithm constructs a multi-commodity flow problem that expresses the constraints described by the data and finds a sparse solution to it.by Rohit Singh.Ph.D
Latent Representation and Sampling in Network: Application in Text Mining and Biology.
In classical machine learning, hand-designed features are used for learning a mapping from raw data. However, human involvement in feature design makes the process expensive. Representation learning aims to learn abstract features directly from data without direct human involvement. Raw data can be of various forms. Network is one form of data that encodes relational structure in many real-world domains. Therefore, learning abstract features for network units is an important task. In this dissertation, we propose models for incorporating temporal information given as a collection of networks from subsequent time-stamps. The primary objective of our models is to learn a better abstract feature representation of nodes and edges in an evolving network. We show that the temporal information in the abstract feature improves the performance of link prediction task substantially. Besides applying to the network data, we also employ our models to incorporate extra-sentential information in the text domain for learning better representation of sentences. We build a context network of sentences to capture extra-sentential information. This information in abstract feature representation of sentences improves various text-mining tasks substantially over a set of baseline methods. A problem with the abstract features that we learn is that they lack interpretability. In real-life applications on network data, for some tasks, it is crucial to learn interpretable features in the form of graphical structures. For this we need to mine important graphical structures along with their frequency statistics from the input dataset. However, exact algorithms for these tasks are computationally expensive, so scalable algorithms are of urgent need. To overcome this challenge, we provide efficient sampling algorithms for mining higher-order structures from network(s). We show that our sampling-based algorithms are scalable. They are also superior to a set of baseline algorithms in terms of retrieving important graphical sub-structures, and collecting their frequency statistics. Finally, we show that we can use these frequent subgraph statistics and structures as features in various real-life applications. We show one application in biology and another in security. In both cases, we show that the structures and their statistics significantly improve the performance of knowledge discovery tasks in these domains
Recommended from our members
Accurate Prediction Methods on Biomolecular Data
With the recent advancements in sequencing technologies, molecular biologists are producing ever-increasing amounts of biomolecular data. Extracting useful information from these massive data sets requires efficient and effective data mining and machine learning methods. In this dissertation, we explore the use of supervised machine learning (ML) to solve some challenging classification problems in molecular biology.First, we devise an ML model for classifying cancer types from very sparse somatic point mutation data. Accumulation of mutation and epigenetic modifications in somatic cells results in various cancer. For this purpose, we propose a method called mClass for efficient feature (gene) ranking that uses clustering, normalized mutual information and logistic regression. We show that somatic mutation data has sufficient discriminative power for cancer type classification.Next, we address the problem of gene essentiality prediction in microbes. Essential genes are significant to identify since their function is vital for the survival of the organism. Our proposed deep learning architecture called DeeplyEssential exclusively uses features extracted from the primary sequence of genes and their corresponding proteins, to maximize the utility and practicality of the tool. DeeplyEssential achieved state-of-the-art performance over previously proposed methods as well as expose and study a hidden performance bias affected previous models.Finally, we consider the problem of predicting the enhancer regions in the human genome from chromatin data. Enhancers contribute to the transcription of target genes. We propose a convolutional neural network framework named Epi2En that takes advantage of epigenetic ChIP-seq data. Epi2En's classification performance is not only very strong on cross-validation experiments, but also when testing across different cell-lines
Computational Methods for the Analysis of Genomic Data and Biological Processes
In recent decades, new technologies have made remarkable progress in helping to understand biological systems. Rapid advances in genomic profiling techniques such as microarrays or high-performance sequencing have brought new opportunities and challenges in the fields of computational biology and bioinformatics. Such genetic sequencing techniques allow large amounts of data to be produced, whose analysis and cross-integration could provide a complete view of organisms. As a result, it is necessary to develop new techniques and algorithms that carry out an analysis of these data with reliability and efficiency. This Special Issue collected the latest advances in the field of computational methods for the analysis of gene expression data, and, in particular, the modeling of biological processes. Here we present eleven works selected to be published in this Special Issue due to their interest, quality, and originality