97 research outputs found

    Fair Evaluation of Global Network Aligners

    Get PDF
    Biological network alignment identifies topologically and functionally conserved regions between networks of different species. It encompasses two algorithmic steps: node cost function (NCF), which measures similarities between nodes in different networks, and alignment strategy (AS), which uses these similarities to rapidly identify high-scoring alignments. Different methods use both different NCFs and different ASs. Thus, it is unclear whether the superiority of a method comes from its NCF, its AS, or both. We already showed on MI-GRAAL and IsoRankN that combining NCF of one method and AS of another method can lead to a new superior method. Here, we evaluate MI-GRAAL against newer GHOST to potentially further improve alignment quality. Also, we approach several important questions that have not been asked systematically thus far. First, we ask how much of the node similarity information in NCF should come from sequence data compared to topology data. Existing methods determine this more-less arbitrarily, which could affect the resulting alignment(s). Second, when topology is used in NCF, we ask how large the size of the neighborhoods of the compared nodes should be. Existing methods assume that larger neighborhood sizes are better. We find that MI-GRAAL's NCF is superior to GHOST's NCF, while the performance of the methods' ASs is data-dependent. Thus, the combination of MI-GRAAL's NCF and GHOST's AS could be a new superior method for certain data. Also, which amount of sequence information is used within NCF does not affect alignment quality, while the inclusion of topological information is crucial. Finally, larger neighborhood sizes are preferred, but often, it is the second largest size that is superior, and using this size would decrease computational complexity. Together, our results give several general recommendations for a fair evaluation of network alignment methods.Comment: 19 pages. 10 figures. Presented at the 2014 ISMB Conference, July 13-15, Boston, M

    Experimental Evaluation of Subgraph Isomorphism Solvers

    Get PDF
    International audienceSubgraph Isomorphism (SI) is an NP-complete problem which is at the heart of many structural pattern recognition tasks as it involves finding a copy of a pattern graph into a target graph. In the pattern recognition community, the most well-known SI solvers are VF2, VF3, and RI. SI is also widely studied in the constraint programming community, and many constraint-based SI solvers have been proposed since Ullman, such as LAD and Glasgow, for example. All these SI solvers can solve very quickly some large SI instances, that involve graphs with thousands of nodes. However, McCreesh et al. have recently shown how to randomly generate SI instances the hardness of which can be controlled and predicted, and they have built small instances which are computationally challenging for all solvers. They have also shown that some small instances, which are predicted to be easy and are easily solved by constraint-based solvers, appear to be challenging for VF2 and VF3. In this paper, we widen this study by considering a large test suite coming from eight benchmarks. We show that, as expected for an NP-complete problem, the solving time of an instance does not depend on its size, and that some small instances coming from real applications are not solved by any of the considered solvers. We also show that, if RI and VF3 can solve very quickly a large number of easy instances, for which Glasgow or LAD need more time, they fail at solving some other instances that are quickly solved by Glasgow or LAD, and they are clearly outperformed by Glasgow on hard instances. Finally, we show that we can easily combine solvers to take benefit of their complementarity

    ClouDiA: a deployment advisor for public clouds

    Get PDF
    An increasing number of distributed data-driven applications are moving into shared public clouds. By sharing resources and oper-ating at scale, public clouds promise higher utilization and lower costs than private clusters. To achieve high utilization, however, cloud providers inevitably allocate virtual machine instances non-contiguously, i.e., instances of a given application may end up in physically distant machines in the cloud. This allocation strategy can lead to large differences in average latency between instances. For a large class of applications, this difference can result in signif-icant performance degradation, unless care is taken in how applica-tion components are mapped to instances. In this paper, we propose ClouDiA, a general deployment ad-visor that selects application node deployments minimizing either (i) the largest latency between application nodes, or (ii) the longest critical path among all application nodes. ClouDiA employs mixed-integer programming and constraint programming techniques to ef-ficiently search the space of possible mappings of application nodes to instances. Through experiments with synthetic and real applica-tions in Amazon EC2, we show that our techniques yield a 15 % to 55 % reduction in time-to-solution or service response time, without any need for modifying application code. 1

    A Parallel, Backjumping Subgraph Isomorphism Algorithm Using Supplemental Graphs

    Get PDF
    This registry entry contains a reference to the code, data and experimental scripts needed to reproduce the subgraph isomorphism paper: Ciaran McCreesh and Patrick Prosser, "A Parallel, Backjumping Subgraph Isomorphism Algorithm using Supplemental Graphs". To appear at the 21st International Conference on Principles and Practice of Constraint Programming (CP 2015)

    Towards Structural Classification of Proteins based on Contact Map Overlap

    Get PDF
    A multitude of measures have been proposed to quantify the similarity between protein 3-D structure. Among these measures, contact map overlap (CMO) maximization deserved sustained attention during past decade because it offers a fine estimation of the natural homology relation between proteins. Despite this large involvement of the bioinformatics and computer science community, the performance of known algorithms remains modest. Due to the complexity of the problem, they got stuck on relatively small instances and are not applicable for large scale comparison. This paper offers a clear improvement over past methods in this respect. We present a new integer programming model for CMO and propose an exact B &B algorithm with bounds computed by solving Lagrangian relaxation. The efficiency of the approach is demonstrated on a popular small benchmark (Skolnick set, 40 domains). On this set our algorithm significantly outperforms the best existing exact algorithms, and yet provides lower and upper bounds of better quality. Some hard CMO instances have been solved for the first time and within reasonable time limits. From the values of the running time and the relative gap (relative difference between upper and lower bounds), we obtained the right classification for this test. These encouraging result led us to design a harder benchmark to better assess the classification capability of our approach. We constructed a large scale set of 300 protein domains (a subset of ASTRAL database) that we have called Proteus 300. Using the relative gap of any of the 44850 couples as a similarity measure, we obtained a classification in very good agreement with SCOP. Our algorithm provides thus a powerful classification tool for large structure databases

    Querying Graph Databases.

    Full text link
    Real life data can often be modeled as graphs, in which nodes represent objects and edges indicate their relationships. Large graph datasets are common in many emerging applications. To fully exploit the wealth of information encoded in graphs, systems for managing and analyzing graph data are critical. To address the need of complex analysis on graph data, this thesis presents a graph querying toolkit, called Periscope/GQ. This toolkit is built on top of a commodity RDBMS. It provides a uniform schema for storing graphs and supports various graph query operations. Users can easily combine several operations to perform complex analysis on graphs. The key feature of Periscope/GQ is the support of various sophisticated graph query operations besides the simple ones like node/edge selection and path search. In particular, this thesis focuses on two classes of sophisticated queries: graph matching and graph summarization. The database community has largely focus on exact graph matching problems. However, due to the noisy and incomplete nature of real graph datasets, approximate, rather than exact graph matching is required. This thesis presents a novel approximate graph matching technique, called SAGA. SAGA employs a flexible graph similarity model and utilizes an index-based matching algorithm to efficiently evaluate matching queries. SAGA is effective and efficient for small query graphs (with tens of nodes and edges), but is expensive when applied to large query graphs (with hundreds to thousands of nodes and edges). To handle large query graphs, TALE is proposed. TALE employs a novel indexing technique, which achieves high pruning power and scales linearly with the database sizes. The matching algorithm utilizes the index to first match the important nodes in the query, and then extends them to produce large graph matches. Graph summarization techniques are useful for understanding underlying characteristics of graphs. To summarize large graphs, this thesis introduces an aggregation method. This method produces summary graphs by grouping nodes based on user-selected node attributes and relationships. It further allows users to control the resolutions of summaries, and provides the "drill-down" and "roll-up" abilities to navigate through summaries with different resolutions.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/61640/1/ytian_1.pd

    Certifying Correctness for Combinatorial Algorithms : by Using Pseudo-Boolean Reasoning

    Get PDF
    Over the last decades, dramatic improvements in combinatorialoptimisation algorithms have significantly impacted artificialintelligence, operations research, and other areas. These advances,however, are achieved through highly sophisticated algorithms that aredifficult to verify and prone to implementation errors that can causeincorrect results. A promising approach to detect wrong results is touse certifying algorithms that produce not only the desired output butalso a certificate or proof of correctness of the output. An externaltool can then verify the proof to determine that the given answer isvalid. In the Boolean satisfiability (SAT) community, this concept iswell established in the form of proof logging, which has become thestandard solution for generating trustworthy outputs. The problem isthat there are still some SAT solving techniques for which prooflogging is challenging and not yet used in practice. Additionally,there are many formalisms more expressive than SAT, such as constraintprogramming, various graph problems and maximum satisfiability(MaxSAT), for which efficient proof logging is out of reach forstate-of-the-art techniques.This work develops a new proof system building on the cutting planesproof system and operating on pseudo-Boolean constraints (0-1 linearinequalities). We explain how such machine-verifiable proofs can becreated for various problems, including parity reasoning, symmetry anddominance breaking, constraint programming, subgraph isomorphism andmaximum common subgraph problems, and pseudo-Boolean problems. Weimplement and evaluate the resulting algorithms and a verifier for theproof format, demonstrating that the approach is practical for a widerange of problems. We are optimistic that the proposed proof system issuitable for designing certifying variants of algorithms inpseudo-Boolean optimisation, MaxSAT and beyond
    corecore