Confidence in protein interaction networks

Abstract

Protein interaction networks are a commonly used tool in bioinformatics, e.g. for the purposes of gene function prediction or drug target identification. They are built from often heterogeneous and error-prone protein-protein interaction data. In this thesis we study the effects of data uncertainty on the structure of protein interaction networks and on downstream network analysis. Some databases provide confidence scores for protein-protein interactions, and networks are built from the data after a minimum score cut-off, or threshold, is applied. We study the effects of threshold choice on network structure. We argue that robust, biologically-relevant network analysis results should be replicated across networks obtained at different thresholds, and develop a methodology for quantifying this robustness in the context of node metrics. Our results indicate that the same node metrics are robust across a range of protein interaction networks, but are not necessarily robust in synthetic networks. We further investigate uncertain networks as a possible approach to incorporating confidence scores explicitly into network analysis. Uncertain networks are a way of conceptualising the difference between the "true" network of biologically-relevant protein-protein interactions and the observed scored data. We show that any inference on the structure of the "true" network is strongly influenced by assumptions made about the dependence - or lack thereof - between edges in the scored network. Finally, we focus on networks constructed from gene co-expression data. Gene co-expression can be measured in a number of different ways. Moreover, when networks are constructed, different thresholds can be applied to the co-expression values. It is not always clear which network construction method should be preferred. We develop a software package, COGENT, designed to aid network construction choice without the need for external validation data.</p

    Similar works