A basic assumption of molecular biology is that proteins sharing close
three-dimensional (3D) structures are likely to share a common function and in
most cases derive from a same ancestor. Computing the similarity between two
protein structures is therefore a crucial task and has been extensively
investigated. Evaluating the similarity of two proteins can be done by finding
an optimal one-to-one matching between their components, which is equivalent to
identifying a maximum weighted clique in a specific "alignment graph". In this
paper we present a new integer programming formulation for solving such clique
problems. The model has been implemented using the ILOG CPLEX Callable Library.
In addition, we designed a dedicated branch and bound algorithm for solving the
maximum cardinality clique problem. Both approaches have been integrated in
VAST (Vector Alignment Search Tool) - a software for aligning protein 3D
structures largely used in NCBI (National Center for Biotechnology
Information). The original VAST clique solver uses the well known Bron and
Kerbosh algorithm (BK). Our computational results on real life protein
alignment instances show that our branch and bound algorithm is up to 116 times
faster than BK for the largest proteins

Andonov, Rumen

Malod-Dognin, Noël

Yanev, Nicola

English

arXiv

International audienceComputing the similarity between two protein structures is a crucial task in molecular biology, and has been extensively investigated. Many protein structure comparison methods can be modeled as maximum weighted clique problems in specific k-partite graphs, referred here as alignment graphs. In this paper we present both a new integer programming formulation for solving such clique problems and a dedicated branch and bound algorithm for solving the maximum cardinality clique problem. Both approaches have been integrated in VAST, a software for aligning protein 3D struct ures largely used in the National Center for Biotechnology Information, an original clique solver which uses the well known Bron and Kerbosch algorithm (BK). Our computational results on real protein alignment instances show that our branch and bound algorithm is up to 116 times faster than BK

HAL-CentraleSupelec

Solving Maximum Clique Problem for Protein Structure Similarity

A basic assumption of molecular biology is that proteins sharing close three-dimensional (3D) structures are likely to share a common function and in most cases derive from a same ancestor. Computing the similarity between two protein structures is therefore a crucial task and has been extensively investigated. Evaluating the similarity of two proteins can be done by finding an optimal one-to-one matching between their components, which is equivalent to identifying a maximum weighted clique in a specific ``alignment graph". In this paper we present a new integer programming formulation for solving such clique problems. The model has been implemented using the ILOG CPLEX Callable Library. In addition, we designed a dedicated branch and bound algorithm for solving the maximum cardinality clique problem. Both approaches have been integrated in VAST (Vector Alignment Search Tool) - a software for aligning protein 3D structures largely used in NCBI (National Center for Biotechnology Information). The original VAST clique solver uses the well known Bron and Kerbosh algorithm (BK). Our computational results on real life protein alignment instances show that our branch and bound algorithm is up to 116 times faster than BK for the largest proteins

HAL Descartes

HAL-Rennes 1

Hal-Diderot

INRIA a CCSD electronic archive server

Serdica J. Computing 4 (2010), 93–100SOLVING MAXIMUM CLIQUE PROBLEM FOR PROTEINSTRUCTURE SIMILARITY*Noe¨l Malod-Dognin, Rumen Andonov, Nicola YanevAbstract. Computing the similarity between two protein structures isa crucial task in molecular biology, and has been extensively investigated.Many protein structure comparison methods can be modeled as maximumweighted clique problems in specific k-partite graphs, referred here as align-ment graphs.In this paper we present both a new integer programming formulationfor solving such clique problems and a dedicated branch and bound algo-rithm for solving the maximum cardinality clique problem. Both approacheshave been integrated in VAST, a software for aligning protein 3D structureslargely used in the National Center for Biotechnology Information, an orig-inal clique solver which uses the well known Bron and Kerbosch algorithm(BK). Our computational results on real protein alignment instances showthat our branch and bound algorithm is up to 116 times faster than BK.1. Introduction. A fruitful assumption in molecular biology is thatproteins of similar three-dimensional (3D) structures are likely to share a com-mon function and in most cases derive from the same ancestor. UnderstandingACM Computing Classification System (1998): G.2.1, G.2.2.Key words: protein structure comparison, maximum clique, k-partite graphs, integer pro-gramming, branch and bound.*This work is supported by the ANR project PROTEUS “ANR-06-CIS6-008”, by the Brit-tany Region and by the Bulgarian NSF project DO 02-359/2008.94 Noe¨l Malod-Dognin, Rumen Andonov, Nicola Yanevand computing physical similarity of protein structures is one of the keys fordeveloping protein based medical treatments, and thus it has been extensivelyinvestigated [11]. Evaluating the similarity of two protein structures can be doneby finding an optimal order-preserving matching (also called alignment) betweentheir components. We show that finding such alignments is equivalent to solvingmaximum clique problems in specific k-partite graphs referred here as alignmentgraphs. In this context, we present a new integer programming model for solvingthe maximum weighted clique problem in alignment graphs. In addition, we alsopropose a dedicated branch and bound algorithm (B&B) for the maximum cliqueproblem. Both approaches have been integrated and validated in VAST[7] (Vec-tor Alignment Search Tool), a software for aligning protein 3D structures largelyused in the National Center for Biotechnology Information1, and compared to theoriginal VAST clique solver which is based on the Bron and Kerbosch algorithm(BK) [5]. The obtained results on real protein structure comparison instancesshow that our B&B algorithm is up to 116 times faster than BK, and thus clearlydemonstrate the usefulness of our dedicated algorithm.2. Clique problems and protein structure similarity. In thispaper, we focus on grid-alike graphs, which we define as follows. A m×n align-ment graph G = (V,E) is a graph in which the vertex set V is depicted bya (m-row) × (n-column) array T , where each cell T [i][k] contains at most onevertex i.k from V (note that for both arrays and vertices, the first index standsfor the row number, and the second for the column number). Two vertices i.kand j.l can be connected by an edge (i.k, j.l) ∈ E only if i < j and k < l. It iseasily seen that the m rows form a m-partition of G, and that the n columns alsoform a n-partition. As for the general case, a clique in G is a subset of V suchthat any two vertices in it are connected by an edge.Various clique problems can be formulated in such a graph. TheMaximumClique problem (MCC) consists in finding in G a clique of maximum cardinality,denoted by MCC(G). MCC is one of the first problems shown to be NP-Complete[8]. If we associate to each vertex i.k a weigth Sik, and to each edge (i.k, j.l) aweight Cikjl, then other maximum clique problems arise. The most general one isthe Maximum Weighted Clique problem (MWC), which consists in finding theclique having the maximum sum of vertex and edge weights. Its particular cases– MCC, the clique with maximum sum of vertex weights and the clique withmaximum sum of edge weights – have been extensively investigated [1, 4, 6].From a general point of view, two proteins P1 and P2 can be represented1http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtmlSolving Maximum Clique Problem. . . 95by their ordered set of components N1 and N2, and estimating their similaritycan be done by finding an optimal matching between the elements of N1 and N2.In [2], we show that such matchings can be represented in an |N1|×|N2| alignmentgraph G = (V,E), where each row corresponds to an element of N1 and eachcolumn corresponds to an element of N2. A vertex i.k is in V (i.e. matchingi ↔ k is possible), only if elements i ∈ N1 and k ∈ N2 are compatible, and thiscompatibility can be represented by a weight Sik. An edge (i.k, j.l) is in E ifand only if (i) i < j and k < l, for order preserving, and (ii) matching i ↔ k iscompatible with matching j ↔ l. Again, this compatibility can be representedby a weight Cikjl. A feasible matching of P1 and P2 is then a clique in G. Thereis a multitude of alignment methods and they differ mainly by (i) the nature ofthe elements of N1 and N2, (ii) the compatibility definitions between elementsand between pairs of matched elements, and (iii) the kind of maximum clique tofind in G. For example, in VAST, N1 and N2 contain 3D vectors representing thesecondary structure elements of P1 and P2. Matching i↔ k is possible if vectorsi and k have similar norms and correspond either both to α-helices or both toβ-strands. Finally, matching i↔ k is compatible with matching j ↔ l only if thecouple of vectors (i, j) from P1 can be well superimposed in 3D-space with thecouple of vectors (k, l) from P2. The longest alignment corresponds to MCC(G).3. Integer programming model for MWC. By using the prop-erties of our alignment graphs, we designed a new integer programming (IP)model (whose formulation is very different from [10, 3]) for solving the maximumweighted clique problem, where the weights are all in R. To each vertex i.k ∈ V(in row i ∈ V1 and column k ∈ V2), we associate a binary variable xik such that:xik = {1 if vertex i.k is in the clique, 0 otherwise} .We also associate to each edge (i.k, j.l) ∈ E a binary variable yikjl such that:yikjl = {1 if edge (i.k, j.l) is in the clique, 0 otherwise} .The goal is to find a clique which maximizes the sum of its vertex weights andthe sum of its edge weights. This leads to the objective function:(1) ZMWC = max∑i.kSik xik +∑(i.k,j.l)Cikjl yikjl.The one-to-one matching implies special order set constraints. In each row i ∈ V1,at most one vertex can be chosen (2), and the same holds for the columns (3).(2)∑kxik ≤ 1, ∀i ∈ V1.96 Noe¨l Malod-Dognin, Rumen Andonov, Nicola Yanev(3)∑ixik ≤ 1, ∀k ∈ V2.These special order set constraints lead to compact formulations of the relationsbetween vertices and edges. Denote by d+col(i.k) the set of columns l, l > k, suchthat ∃(i.k, j.l) ∈ E. In a similar way, d−col(i.k) is the set of columns l, l < k, suchthat ∃(j.l, i.k) ∈ E. d+row(i.k) is the set of rows j, j > i, such that ∃(i.k, j.l) ∈ E.And finally, d−row(i.k) is the set of rows j, j < i, such that ∃(j.l, i.k) ∈ E. Edge-driven activations of vertices can be formulated with (4), (5), (6) and (7):xik ≥∑jyikjl, ∀i.k ∈ V, ∀l ∈ d+col(i.k).(4)xjl ≥∑iyikjl, ∀j.l ∈ V, ∀k ∈ d−col(j.l).(5)xik ≥∑lyikjl, ∀i.k ∈ V, ∀j ∈ d+row(i.k).(6)xjl ≥∑kyikjl, ∀j.l ∈ V, ∀i ∈ d−row(j.l).(7)Vertice-driven activations of edges can be formulated with (8) and (9) :∑ixik +∑jxjl −∑ijyikjl ≤ 1, ∀k ∈ V2, ∀l ∈ V2, k < l.(8)∑kxik +∑lxjl −∑klyikjl ≤ 1, ∀i ∈ V1, ∀j ∈ V1, i < j.(9)This IP formulation is an improved version of the one that we proposed in [9].4. Branch and Bound approach for MCC. We present here anew branch and bound algorithm for solving the MCC problem in the previouslydefined alignment graph G = (V,E). Let us first introduce some notions andnotations. A successor of a vertex i.k ∈ G is an element of the set Γ+(i.k) ={j.l ∈ V s.t. (i.k, j.l) ∈ E, i < j and k < l}. Similarly, a predecessor of avertex i.k ∈ G is an element of the set Γ−(i.k) = {j.l ∈ V s.t. (j.l, i.k) ∈ E, j < iand l < k}. GΓ+(i.k), GΓ−(i.k) denote the subgraphs of G induced by the verticesin Γ+(i.k)and in Γ−(i.k). A feasible path in G is an ordered sequence “i1.k1,i2.k2, . . ., it.kt” of vertices ∈ V , such that ∀n ∈ [1, t − 1], (in.kn, in+1.kn+1) ∈ Eand in < in+1, kn < kn+1.Branching: Each node of the B&B tree is characterized by a couple (C,Cand) where C is the clique under construction and Cand is the set of candidateSolving Maximum Clique Problem. . . 97vertices to be added to C. All B&B nodes can also access Cbest, the best cliquefound so far during the exploration of the B&B tree (initially set to ∅). Startingfrom the root node (∅, V ), successors of a B&B node (C,Cand) are the nodes(C⋃{i.k}, Cand⋂Γ+(i.k)), for all vertices i.k ∈ Cand. Branching follows thelexicographic increasing order (row first).Fathoming: For a given a B&B node (C,Cand) and a current best cliqueCbest, we denote by MCCi.k(G) the maximum cardinality clique in G containingvertex i.k ∈ Cand. If |MCCi.k(G)| ≤ |Cbest|, then we do not miss the solution bydiscarding i.k from Cand. Furthermore, denote by Ci.k the best clique that canbe found by branching on the vertex i.k, and letMCCi.k(GCand) be the maximumcardinality clique in GCand (the subgraph of G induced by the vertices in Cand)containing i.k. It is easily seen that |Ci.k| = |C|+ |MCCi.k(GCand)|. Any vertexi.k ∈ Cand such that |MCCi.k(GCand)| ≤ |Cbest| − |C| leads to non-interestingleaves, and thus, can be removed from Cand.Bounds: We are not going to compute |MCCi.k(G)| or |MCCi.k(GCand)|,but we replace them with upper bounds based on feasible paths. Denote by P (G)the longest (in terms of vertices) feasible path in G. Note that computing |P (G)|can be done by dynamic programming in O(|E|) time. For any vertex i.k ∈ V ,we denote by Pi.k(G) the longest feasible path in G containing i.k, such that forany vertex j.l 6= i.k in the feasible path, j.l is connected to i.k (i.e. (i.k, j.l) ∈ Eor (j.l, i.k) ∈ E). By definition, Pi.k(G) = P (GΓ−(i.k))⋃{i.k}⋃P (GΓ+(i.k)), and|Pi.k(G)| = |P (GΓ−(i.k))| +1+ |P (GΓ+(i.k))|. It is easily seen that |MCCi.k(G)| ≤|Pi.k(G)| for all i.k ∈ V . Similarly, |MCCi.k(GCand)| ≤ |Pi.k(GCand)| for alli.k ∈ Cand. Thus any vertex i.k ∈ Cand such that: (i) |Pi.k(G)| ≤ |Cbest|, or (ii)|Pi.k(GCand)| ≤ |Cbest| − |C|, can be safely removed from Cand.5. Results. All results were obtained on a PC with an Intel Pen-tium 4tm CPU at 3GHz. The IP based solver (MIP) was implemented with IlogCplex 10.0, and the B&B solver was implemented in C. These two clique solverswere compared to (BK)2 [5]. All algorithms were used to solve maximum cardi-nality clique problems. The comparison was performed on real protein structurecomparison instances. We used two different benchmarks3 which significantlydiffer by the number of secondary structure elements (SSE) per protein chain.The first benchmark, the Skolnick set, contains 40 small protein chains havingfrom 5 to 20 SSEs. The second benchmark, the S2 set, contains 36 long protein2VAST’s clique solver, BK, returns all maximal cliques in a graph and thus can be used tosolve any kind of clique problems.3The full description of both benchmarks is availlable at:https://www.irisa.fr/symbiose/old/softwares/resources/proteus30098 Noe¨l Malod-Dognin, Rumen Andonov, Nicola Yanevchains having from 51 to 87 SSEs. Note that for the Skolnick set, we only consid-ered the 170 instances leading to alignement graphs having at least 100 vertices.Table 1 presents the characteristics of the corresponding alignment graphs. Onepeculiarity is their low density, less than 20% for the Skolnick set and less than6% for the S2 set.Table 1. Characteristics of the alignment graphsNumber of vertices Number of edges DensitySet name min, average, max min, average, max min, average, maxSkolnick 100, 158.92, 208 886, 2368.69, 3547 0.16, 0.18, 0.20S2 1390, 2384.97, 5582 45278, 144206.44, 604793 0.03, 0.05, 0.06Figure 1 compares the time needed by MIP to the one of BK on the 170Skolnick instances. On the average, MIP is 3.35 times slower than BK. Thisis not surprising, since dedicated solvers are expected to be faster than generalpurpose solvers (CPLEX in this case). This observation motivated us to gofurther in developing a fast special purpose clique solver. Figure 2 compares the 0.01 0.1 1 0.01  0.1MIP time in sec., log scaleBK time in sec., log scaleInstancesy=xFor each instance the execution time of MIP is plottedon the x-axis, while the one of BK is depicted on they-axis. All points are above the x = y line (i.e. BKis always faster than MIP).Fig. 1. MIP vs BK running time comparison on a Skolnick settime needed by B&B to the one of BK on set S2. We observed that B&B is inaverage 15.57 times faster than BK, and on the biggest instances (where bothproteins contain more than 80 SSEs), it is up to 116.7 times faster. Such biginstances are solved by B&B in less than 79 seconds (25 sec. on average) whileBK needs up to 2660 seconds (1521 sec. on average).Solving Maximum Clique Problem. . . 99 1 10 100 1000 10000 0  10  20  30  40  50  60  70  80BK time in sec., log scaleB&B time in sec.Instancesy=xThe execution time of B&B is presented on the x-axis, while the one of BK is on the y-axis (in logscale). Any point above the x = y line is an instancefor which B&B is faster than BK.Fig. 2. B&B vs BK running time comparison on an S2 set6. Conclusion. We presented a new IP model for solving the maximumweighted clique problem arising in the context of protein structure comparison,which was implemented and validated on a small benchmark. We also presenteda new dedicated B&B algorithm for the maximum cardinality clique problem.The computational results show that on big instances, our B&B is significantlyfaster than the Bron and Kerbosch algorithm (up to 116 times for the largestproteins). In the near future, we intend to study the behavior of the proposedalgorithms on arbitrary graphs, conveniently transformed into grid graphs in apreprocessing step.REFERENCES[1] Abello J., P. M. Pardalos, M. G. C. Resende. On maximum cliqueproblems in very large graphs, Ext. Mem. Alg., 1999, 119–130.[2] Andonov R., N. Yanev, N. Malod-Dognin. An efficient lagrangianrelaxation for the contact map overlap problem. In: Proceedings of WABI’08,Lecture Notes in Computer Science, Vol. 5251, Springer, Berlin/Heidelberg,2008, 162–173.[3] Balas E., S. Ceria, G. Cornuejols, G. Pataki. Polyhedral methods100 Noe¨l Malod-Dognin, Rumen Andonov, Nicola Yanevfor the maximum clique problem. DIMACS Series in Discrete Mathematicsand Theoretical Computer Science, 26 (1996), 11–28.[4] Bomze I. M., M. Budinich, P. M. Pardalos, M. Pelillo. The maxi-mum clique problem. Handbook of Combinatorial Optimization, 1999.[5] Bron C., J. Kerbosch. Algorithm 457: finding all cliques of an undirectedgraph. Commun. ACM, 16 (1973), No 9, 575–577.[6] Busygin S. A new trust region technique for the maximum weight cliqueproblem. Discrete Appl. Math., 154 2006, No 15, 2080–2096.[7] Gibrat J-F., T. Madej, S. H. Bryant. Surprising similarities in structurecomparison. Current Opinion in Structural Biology, 6 (1996), No 3, 377–385.[8] Karp R. M. Reducibility among combinatorial problems. Complexity ofComputer Computations, 6 (1972), 85–103.[9] Malod-Dognin N., R. Andonov, N. Yanev, J-F. Gibrat. Mode`le dePLNE pour la recherche de cliques de poids maximal. In: ROADEF 2008,307–308.[10] Pardalos P. M., G. P. Rodgers. A branch and bound algorithm for themaximum clique problem. Comput. Oper. Res., 19 (1992), No 5, 363–375.[11] Sierk M. L., G.J. Kleywegt. De´ja` vu all over again: Finding and ana-lyzing protein structure similarities .Structure, 12 (2004), No 12, 2103–2111.N. Malod-Dognin, R. AndonovIRISA–Universite´ de Rennes 1Campus de Beaulieu35042 Rennes Cedex, Francee-mail: nmaloddg@irisa.fr, randonov@irisa.frN. YanevFaculty of Mathematics and InformaticsUniversity of SofiaandInstitute of Mathematics and InformaticsAcad. G. Bonchev Str., Bl. 81113 Sofia, Bulgariae-mail: choby@math.bas.bgReceived November 2, 2009Final Accepted February 4, 2010

Solving Maximum Clique Problem for Protein Structure Similarity

Abstract

Similar works

Full text

Available Versions

HAL-CentraleSupelec

HAL-CentraleSupelec

HAL Descartes

HAL-Rennes 1

Hal-Diderot

INRIA a CCSD electronic archive server

INRIA a CCSD electronic archive server

HAL-Rennes 1