9 research outputs found

    Natalie 2.0: Sparse Global Network Alignment as a Special Case of Quadratic Assignment

    Get PDF
    International audienceData on molecular interactions is increasing at a tremendous pace, while the development of solid methods for analyzing this network data is still lagging behind. This holds in particular for the field of comparative network analysis, where one wants to identify commonalities between biological networks. Since biological functionality primarily operates at the network level, there is a clear need for topology-aware comparison methods. We present a method for global network alignment that is fast and robust and can flexibly deal with various scoring schemes taking both node-to-node correspondences as well as network topologies into account. We exploit that network alignment is a special case of the well-studied quadratic assignment problem (QAP). We focus on sparse network alignment, where each node can be mapped only to a typically small subset of nodes in the other network. This corresponds to a QAP instance with a symmetric and sparse weight matrix. We obtain strong upper and lower bounds for the problem by improving a Lagrangian relaxation approach and introduce the open source software tool Natalie 2.0, a publicly available implementation of our method. In an extensive computational study on protein interaction networks for six different species, we find that our new method outperforms alternative established and recent state-of-the-art methods

    Data-driven network alignment

    Full text link
    Biological network alignment (NA) aims to find a node mapping between species' molecular networks that uncovers similar network regions, thus allowing for transfer of functional knowledge between the aligned nodes. However, current NA methods do not end up aligning functionally related nodes. A likely reason is that they assume it is topologically similar nodes that are functionally related. However, we show that this assumption does not hold well. So, a paradigm shift is needed with how the NA problem is approached. We redefine NA as a data-driven framework, TARA (daTA-dRiven network Alignment), which attempts to learn the relationship between topological relatedness and functional relatedness without assuming that topological relatedness corresponds to topological similarity, like traditional NA methods do. TARA trains a classifier to predict whether two nodes from different networks are functionally related based on their network topological patterns. We find that TARA is able to make accurate predictions. TARA then takes each pair of nodes that are predicted as related to be part of an alignment. Like traditional NA methods, TARA uses this alignment for the across-species transfer of functional knowledge. Clearly, TARA as currently implemented uses topological but not protein sequence information for this task. We find that TARA outperforms existing state-of-the-art NA methods that also use topological information, WAVE and SANA, and even outperforms or complements a state-of-the-art NA method that uses both topological and sequence information, PrimAlign. Hence, adding sequence information to TARA, which is our future work, is likely to further improve its performance

    A Family of Tractable Graph Distances

    Full text link
    Important data mining problems such as nearest-neighbor search and clustering admit theoretical guarantees when restricted to objects embedded in a metric space. Graphs are ubiquitous, and clustering and classification over graphs arise in diverse areas, including, e.g., image processing and social networks. Unfortunately, popular distance scores used in these applications, that scale over large graphs, are not metrics and thus come with no guarantees. Classic graph distances such as, e.g., the chemical and the CKS distance are arguably natural and intuitive, and are indeed also metrics, but they are intractable: as such, their computation does not scale to large graphs. We define a broad family of graph distances, that includes both the chemical and the CKS distance, and prove that these are all metrics. Crucially, we show that our family includes metrics that are tractable. Moreover, we extend these distances by incorporating auxiliary node attributes, which is important in practice, while maintaining both the metric property and tractability.Comment: Extended version of paper appearing in SDM 201

    Applying graph matching techniques to enhance reuse of plant design information

    Full text link
    This article investigates how graph matching can be applied to process plant design data in order to support the reuse of previous designs. A literature review of existing graph matching algorithms is performed, and a group of algorithms is chosen for further testing. A use case from early phase plant design is presented. A methodology for addressing the use case is proposed, including graph simplification algorithms and node similarity measures, so that existing graph matching algorithms can be applied in the process plant domain. The proposed methodology is evaluated empirically on an industrial case consisting of design data from several pulp and paper plants

    Assessing the robustness of genetic codes and genomes

    Full text link
    Deux approches principales existent pour évaluer la robustesse des codes génétiques et des séquences de codage. L'approche statistique est basée sur des estimations empiriques de probabilité calculées à partir d'échantillons aléatoires de permutations représentant les affectations d'acides aminés aux codons, alors que l'approche basée sur l'optimisation repose sur le pourcentage d’optimisation, généralement calculé en utilisant des métaheuristiques. Nous proposons une méthode basée sur les deux premiers moments de la distribution des valeurs de robustesse pour tous les codes génétiques possibles. En se basant sur une instance polynomiale du Problème d'Affectation Quadratique, nous proposons un algorithme vorace exact pour trouver la valeur minimale de la robustesse génomique. Pour réduire le nombre d'opérations de calcul des scores et de la borne supérieure de Cantelli, nous avons développé des méthodes basées sur la structure de voisinage du code génétique et sur la comparaison par paires des codes génétiques, entre autres. Pour calculer la robustesse des codes génétiques naturels et des génomes procaryotes, nous avons choisi 23 codes génétiques naturels, 235 propriétés d'acides aminés, ainsi que 324 procaryotes thermophiles et 418 procaryotes non thermophiles. Parmi nos résultats, nous avons constaté que bien que le code génétique standard soit plus robuste que la plupart des codes génétiques, certains codes génétiques mitochondriaux et nucléaires sont plus robustes que le code standard aux troisièmes et premières positions des codons, respectivement. Nous avons observé que l'utilisation des codons synonymes tend à être fortement optimisée pour amortir l'impact des changements d'une seule base, principalement chez les procaryotes thermophiles.There are two main approaches to assess the robustness of genetic codes and coding sequences. The statistical approach is based on empirical estimates of probabilities computed from random samples of permutations representing assignments of amino acids to codons, whereas, the optimization-based approach relies on the optimization percentage frequently computed by using metaheuristics. We propose a method based on the first two moments of the distribution of robustness values for all possible genetic codes. Based on a polynomially solvable instance of the Quadratic Assignment Problem, we propose also an exact greedy algorithm to find the minimum value of the genome robustness. To reduce the number of operations for computing the scores and Cantelli’s upper bound, we developed methods based on the genetic code neighborhood structure and pairwise comparisons between genetic codes, among others. For assessing the robustness of natural genetic codes and genomes, we have chosen 23 natural genetic codes, 235 amino acid properties, as well as 324 thermophilic and 418 non-thermophilic prokaryotes. Among our results, we found that although the standard genetic code is more robust than most genetic codes, some mitochondrial and nuclear genetic codes are more robust than the standard code at the third and first codon positions, respectively. We also observed that the synonymous codon usage tends to be highly optimized to buffer the impact of single-base changes, mainly, in thermophilic prokaryotes
    corecore