Abstract-Search of discrete spaces is important in combinatorial optimization. Such problems arise in artificial intelligence, computer vision, operations research, and other areas. For realistic problems, the search spaces to be processed are usually huge, necessitating long computation times, pruning heuristics, or massively parallel processing. We present an algorithm that reduces the computation time for graph matching by employing both branch-and-bound pruning of the search tree and massively-parallel search of the as-yet-unpruned portions of the space. Most research on parallel search has assumed that a multiple-instructionstream/multiple-data-stream (MIMD) parallel computer is available. Since massively parallel single-instruction-stream/multiple-datastream (SIMD) computers are much less expensive than MIMD systems with equal numbers of processors, the question arises as to whether SIMD systems can efficiently handle state-space search problems. We demonstrate that the answer is yes, and in particular, that graph matching has a natural and efficient implementation on SIMD machines.
INTRODUCTION

Overview
In many application areas, combinatorial search problems are common. These include computer vision, artificial intelligence, code optimization in compilers, and computeraided design. In practical situations, the sizes of the search spaces are so large that one often has to settle for solutions that are either suboptimal or based on more simplifying assumptions than would be ideal. In this paper, we give an algorithm that combines several techniques to speed up a particular optimization problem: the inexact matching of two graphs. The techniques we use include branch-andbound search, heuristics for computing good lower bounds on the distance between the two graphs, and massive parallelism. In addition, we formulate the algorithm to take advantage of single-instruction-stream/multiple-data-stream parallelism rather than the more flexible but much more expensive multiple-instruction-stream/multiple-data-stream sort of parallel system. This means that massively parallel processing can be more economically applied using our algorithm than would be the case with another algorithm.
Although we did address the issue of load balancing to some extent, the importance of our contribution lies in the formulation of our search and pruning methods to the massively parallel SIMD environment, and the integration of these three techniques to speed up graph matching.
In the following paragraphs, we describe the application area that motivated our study: a problem in computer vision.
Motivating Application
Model-based image analysis consists of constructing a structural description of a portion of an image, and then comparing this description to a model from a database of possible models. The determination of the probable identity of the objects in the image from the database of models involves two levels of search. The outer level is the search through the database to select those models that are most likely to match the structures extracted from the image. Then, finding the best such model requires an inner-level search, which is the process of evaluating how closely a model corresponds to the extracted structures. Most of the parallel approaches to model-based recognition in the literature (see Section 2) find the most likely model through a process of constraint satisfaction. These techniques have successfully reduced the cost of finding the best model; however, they are problem-specific rather than general techniques.
Relational matching is a general approach to modelbased recognition that has been used by a number of researchers in the computer vision field [1] , [2] , [37] , [38] , [39] , [40] , [41] , [42] , [43] . Relational matching is a particular process for comparing structures which are represented as relational models. A relational model consists of a set of primitive parts and a description of the relationships among them. Two relational models are equivalent when a bijection exists between the pieces of the first model and the pieces of the second that preserves all the relationships. Two relational models are similar if there is a one-one mapping from the pieces of the first model onto the pieces of the second that preserves most of the relationships. Similarity can be measured using the "relational distance" between two relational models [2] . When the relationships are binary, this reduces to the problem of finding a mapping between the vertex sets of two graphs that minimizes the value of a distance measure.
Although our relational matching algorithm was motivated by applications in computer vision, it is not limited to this domain. It is a fully general graph matching algorithm limited neither to particular types of graphs, to particular applications, nor to particular types of computers.
Problem Formulation
In this paper, we focus on the efficient solution of the graph matching problem in a parallel computing environment. The goal of the matching process is to find the best match between two graphs: G 1 , which we sometimes call the unit graph, and G 2 , which we sometimes call the label graph. (This terminology was introduced in [1] .) We assume in our work that G 1 and G 2 have the same number of vertices. The best match can be considered as a permutation of the vertices of G 1 so that as few of the permuted graph's edges as possible differ from those of G 2 . For each possible permutation, the edges of G 1 and G 2 are compared, and an error of one is recorded for each unmatched edge. An unmatched edge can be an edge of G 1 with no corresponding edge in G 2 or an edge in G 2 with no corresponding edge in G 1 .
A simple example graph-matching problem is shown in Fig. 1 . In this example we seek a permutation of the set {0, 1, 2, 3} of vertices of G 1 that best aligns G 1 with G 2 . In this case there are three best permutations, each of which has an error of one. In the case of the permutation [0, 1, 3, 2] this is because the edge (3, 1) OE G 1 does not have a matching edge in G 2 ; similar mismatches occur for the other two permutations.
In Fig. 2 , the tree of possible permutations and partial permutations for this problem is shown. Each branch of the tree represents the assignment of an element of the set to another (possibly the same) element of the set. At the first level, the element 0 is assigned either to itself, to 1, to 2, or to 3. At the second level, the possible assignments for element 1 triple the number of possible alternatives. At the third level, the number of possible alternatives doubles, whereas at the fourth level, the assignments for element 3 are fixed by the assignments of elements 0, 1, and 2. There are a total of 4! = 24 possible joint assignments (permutations) and each corresponds with one leaf node of the tree.
The number of possible vertex permutations can be very large, growing as the factorial of the number of vertices of the unit graph. The number of possible mappings for a 15 vertex graph-matching problem is 360,360 times the number of mappings for a 10 vertex problem; the increased number of processors on a parallel machine can be quickly overcome by the problem size. Without clever means of finding cutoffs quickly, this number grows so rapidly that any standard backtracking scheme will take too long to be useful. One technique for pruning the search space is the use of a heuristic estimate of the cost of matching the unmatched portions of the graphs. This technique is called "forward checking" and we have described it in a previous paper [3] .
The forward checking method employs a structure called the forward checking matrix (FCM) in its pruning algorithm. The rows of this matrix are indexed by unmatched Fig. 1 . Two graphs, G 1 and G 2 , to be matched. vertices in G 1 and the columns by unmatched vertices in G 2 ; its elements hold the independent error caused by mapping a presently unmatched vertex of G 2 .
The main contribution of this paper is to present a new parallel algorithm for determining a minimum-distance correspondence between two graphs. In doing this, the algorithm finds the relational distance between two graphs.
The paper is organized as follows. In the next section, we discuss the related literature in constraint satisfaction, object recognition, and parallel object recognition. In Section 3, we give a formal definition of our distance metric. In Section 4, we describe the SIMD parallel computing environment, the search algorithm, and how we manage the search process on the processor array. In Section 5, we present the results of our experiments on a 1,024 processor MasPar as well as on a DecStation serial processor. Section 6 contains the conclusions from our work. Detailed parallel pseudocode is included as an appendix.
RELATED WORK
Graph matching plays an important role in various applications, such as computer vision, and the problem is closely related to constraint satisfaction, consistent labeling, and relaxation labeling. Solution approaches include "hierarchical synthesis" [4] , in which one first looks for subgraph matches and then tries to extend them to larger mappings, and "inexact matching" [2] , in which a distance measure called the relational distance between graphs is defined and used to determine the best mapping between two graphs. Constraint satisfaction has been studied extensively in computer vision and artificial intelligence [5] , [6] , [7] , [8] , [9] , [10] , [11] , [12] . However, the object recognition comunity has done a great deal of work on matching that goes further than traditional constraint satisfaction does. For example, Bolles and Horaud [13] detected subgraph isomorphisms by finding maximal cliques in an association graph, and Umeyama [14] developed a polynomial-time method based on the eigendecomposition of the adjacency matrix of a graph; this technique obtains good results when the graphs are close to each other. The graph matching problem has also been attacked with strategies based on the generation and evaluation of hypotheses [15] , [16] , [17] , [18] .
Parallel methods have been studied for graph matching and other combinatorial optimization problems in both the computer vision and artificial intelligence communities, as well as in the computing community more generally. For example, parallel object recognition based upon tree search has been done on a Connection Machine CM-1 [19] . Geometric hashing can speed up graph matching for 3D vision problems [20] . Parallel hypothesis generation and testing for object recognition was demonstrated in [21] , and other parallel object recognition methods are given in [22] , [23] , [24] .
More general work on parallel search includes parallelizing the processing of individual nodes in the tree [25] , tree decomposition [26] , [27] , [28] , parallelizing alpha-beta search [29] , and "parallel window search" [30] . An overview can be found in [31] .
Combinatorial search on SIMD parallel architectures is problematical in part because search spaces are often irregular either in shape or in the relative importance of their subspaces, and also in part because the manipulation of graph structures typically involves manipulation of pointers or other symbolic representations. Consequently, it would seem necessary for the processors of a parallel machine to have autonomy at the level of SPMD architectures (single program/multiple data streams). Several authors have shown that SIMD systems are indeed capable of doing data-parallel combinatorial search. Iterative deepening search for solutions to the well-known 15-puzzle on an SIMD system was studied [32] , [33] , where the primary focus was on load balancing. A more theoretical study of the load balancing problem for SIMD combinatorial search can be found in [34] , in which optimal conditions are derived for triggering load-balancing phases of the computation; they also test their method using the 15-puzzle on a Connection Machine CM-2 computer.
The branch-and-bound method for combinatorial search is attractive whenever the objective function being minimized lends itself to easy computation of (relatively high) lower bounds. Computing distances between graphs fits nicely into this framework, as we will show later. One way to take advantage of parallelism for branch-and-bound search is to give different bounds to different processors in such a way that one processor is guaranteed to eventually find a solution, but other processors are made likely to find a solution quickly [35] . We note that parallel methods for branch-and-bound search can benefit from parallelization on massively-parallel computers in two other ways. First, processors can divide up the search space and process it more quickly together. Second, whenever one processor succeeds in improving the lower bound on the objective function, it can share it with all the others and thus enable them all to prune off more of their own subspaces from the search.
APPROACH
Basic Definitions
Let G 1 = ·V 1 , E 1 Ò and G 2 = ·V 2 , E 2 Ò be two directed graphs to be matched. We assume for now that
The sets of edges E 1 and E 2 are therefore both subsets of V 1 ¥ V 1 , although we will often speak of E 2 as a subset of V 2 ¥ V 2 . We use both V 1 and V 2 in our presentation in order to make it clear which graph's vertices we are referring to.
We define the degree of mismatch d of a permutation p as the number of edges in G 1 that are not mapped by p to edges in G 2 plus the number of edges in G 2 not mapped to edges in G 1 .
The distance between two graphs is defined as the minimum d that can be obtained for any permutation p of the vertices.
The problem of finding a permutation of one graph's vertices to achieve a minimum distance from another graph is combinatorial in nature, and can be solved by a bruteforce backtracking tree search. Each iteration of the search answers the following question: Are there any unexplored paths from the root to a leaf going through the current node that might have a degree of mismatch less than the best value found so far? If there is such a path, the search process will move down the tree by adding an assignment for one more vertex to the partial permutation that corresponds to the next node along the path from the root to the current node, and then the question is asked again, etc. If there is no such unexplored path, the search moves back towards the root by removing the most recently added construct of the permutation. 
Bound on the Degree of Mismatch
Given a partial permutation p m , the search process needs to find a good lower bound for the degree of mismatch of all complete permutations containing p m , so that the search can be pruned as soon as possible. In order to provide a method for finding this lower bound, we need to examine the possible sources of mismatches. A mismatching edge of G 1 in G 2 according to p m can be classified into one of three groups: 1) an edge between a pair of nodes in G 1 that have assigned image nodes in G 2 under p m , 2) an edge between a node assigned an image node by p m and a node not assigned an image node by p m , and 3) an edge between two nodes not assigned image nodes by p m .
In keeping with the terminology in our previous paper, the number of mismatching edges of the first type is called the past-past error, the number of the second type the past-future error, and the number of the third type the future-future error.
Since the past-past error involves only the vertices of the graphs G 1 and G 2 whose correspondences are specified by p m , it can always be computed exactly by comparing the edges of the graphs. The past-future error is the error resulting from comparing those edges with exactly one incident vertex in V m 1 with corresponding edges in G 2 . The error resulting from these potential assignments can be placed into a data structure called the forward checking matrix (FCM).
The FCM has one row for each unassigned vertex of G 1 and one column for each unassigned vertex of G 2 . The entry FCM[i, j] represents the accumulated error for the mapping of the ith vertex of V 1 to the jth vertex of V 2 . In practice, we maintain the FCM as an n ¥ n array, even though as a permutation is built up, the numbers of unassigned vertices in V 1 and V 2 get smaller and smaller. Each time we add another vertex assignment to the partial permutation, we simply ignore one more row and one more column of the n ¥ n array.
At the beginning of the search, the n 2 coefficients of the FCM are initialized to zero, meaning that each potential vertex assignment has no error associated with it as yet. The values in the FCM are updated by the forward checking procedure, which is called every time a vertex assignment is added or removed from the partial permutation. The value of FCM[i, j] is incremented for each unmatchable edge in either G 1 or G 2 resulting from the addition of an assignment i ‫ۋ‬ j to the permutation being constructed. An example of an unmatchable edge would be an edge (v i , v k ) in E 1 , for which no unmatched vertex l OE V 2 can be found for which ( j , l) is a corresponding edge in E 2 . By estimating a lower bound for the minimum error permutation in this matrix we estimate the lower bound for the error of any complete permutation containing p m . We call this estimation process "solving the FCM"; it is equivalent to solving a weighted bipartite matching problem.
The future-future error is the number of mismatched edges between unassigned vertices. The future-future error constitutes a major portion of the potential distance between the graphs when the partial permutation is small and pruning is most effective. It cannot be exactly computed, because neither vertex incident to an edge is fixed; however, some information from the future-future error can be added to the forward checking matrix. If there are differences in the in-degree or out-degree between a vertex of G 1 and a vertex of G 2 , then at least one edge mismatch must occur when these vertices are paired in the permutation. Although the estimated future-future error is often far smaller than the actual error that results from a full permutation, combining it with the FCM allows much earlier cutoffs than when using the past-past and past-future errors only.
For future-future edges, using a simple counting mechanism can provide further information about error that would accumulate below a node were its descendants actually to be explored.
For example, suppose there are 12 edges between four unmapped G 2 vertices. These edges are counted at their source and sink vertices, providing an in-degree and outdegree for each unmapped vertex. These counts are expressed as (in-degree, out-degree) pairs for each G 2 vertex, e.g., (7, 2) , (3, 9) , (2, 0), and (0, 1). Suppose the unmapped G 1 vertices have (in-degree, out-degree) pairs of (3, 5), (6, 0), (1, 2) , and (0, 3). Performing speculative mappings between G 1 vertices and G 2 vertices reveals edge deficiencies using either in-degree or out-degree. A table of all possible mappings for the vertices above gives the following deficiency information: from. The techniques involved are fundamentally the same.) Choosing the minimum path through the tables yields errors of two and six, respectively. Since any mapping must involve both the sources and sinks of edges, the greater of these can be used as a minimum bound. The errors from both of the tables cannot be combined at full value, since the tables reflect the same edge twice: once at its source vertex and again at its sink vertex. However, the tables may be combined after first multiplying them by scalars whose sum is 1.0, producing a result that is not greater than choosing the higher cost table. This is an advantageous approach because the cost of separately calculating errors for in-edges and out-edges to see which provides the better result is prohibitive. In this linear combination, both types of edges can contribute to the error calculation.
The useful information from both the past-future and future-future errors is stored in tables with G 2 vertices indexing the columns and G 1 vertices indexing the rows. They are combined into a dense combined error matrix (CEM) before the more intensive work of finding a minimum path is performed. The CEM is used to determine the lower bounds on error for each child of a node. Since the children of a node correspond to the next G 1 vertex being matched to each available G 2 vertex in turn, a minimum bound for the combined future error for each child can be calculated as the minimum path through the CEM if the top row (the next G 1 vertex) is matched to each column (each child). For instance, if the following is the CEM:
then the four children of the current node would correspond to G 2 vertices "A," "B," "C," and "D" each being mapped to G 1 vertex "a." To calculate the future error of child (A, a), we take the error of three from the child's place in the table and add the minimum path value (allowing a row to be chosen by only one column) through the remaining rows and columns. This path would be (B, d) (C, b) (D, c), resulting in an error of seven; the total minimum error for the (A, a) mapping is 10. The other children would have errors of six, 12, and 10, respectively. We call this process "solving" the matrix. This lower bound on future error can then be added to the past error of the current node to determine if it is possible for the child mapping to have an error under the current threshold. If not, the child can be eliminated. To solve the CEM by finding the minimum path requires its own combinatorial search. Shapiro and Haralick [1] gave a method to estimate the solution by choosing for each row the column with minimum value without breaking ties where two rows pick the same column. In a previous paper [3] , we described two improvements on this algorithm. The first improvement was to calculate the higher error that must be incurred when more than one G 2 vertex chose the same G 1 vertex. The G 2 vertex with the greatest difference between its first and second choices would keep the contested G 1 vertex, and the other G 2 vertices would select their next-best choices. This second choice might result in further conflicts which were left unresolved; leaving the conflicts unresolved results in a lower error being projected from the matrix than would be the case if the minimum path were found. We found that the number of conflicts grew with the size of the CEM; one G 1 vertex might be preferred by all the G 2 vertices. This tendency resulted in a value far below that of the minimum path. Since the CEM is largest at the top of the tree (when the number of unmapped vertices is greatest), the technique is weakest when the potential savings from pruning is greatest.
The second improvement was to reduce these conflicts by making the rows that are not chosen more competitive. This is done by calculating a value to subtract from each row; we found the most effective value for medium to large sized tables to be the fourth largest value in the row. This results in a leveling of the values in the matrix. The number of G 1 vertices selected by some G 2 vertex is increased, so that the resulting prospective mapping more closely resembles a minimum path. After the error of the prospective mapping is calculated, the sum of the values subtracted from the rows is added back.
THE PARALLEL ALGORITHM
We implemented the parallel graph-matching algorithm on a MasPar MP-1 machine. In the following section, we describe the MasPar environment and explain the parallel algorithm in detail.
The MasPar Environment
The MasPar MP-1 is a single instruction multiple data (SIMD) parallel processing computer consisting of an array control unit (ACU) and a relatively coarse-grained grid of processing elements (PEs) also referred to as "processors." Our machine has 1,024 PEs, as compared to the 256,000 1-bit PEs on the Connection Machine CM-2. Each four-bit-wide processor has 32 32-bit registers on-chip and 64KB of offchip RAM. The ACU memory is termed "singular" and is readily accessible to any PE, while the memory on the PEs is termed "plural." The PEs are arranged in a 32 ¥ 32 grid (array), indexed according to their position. Communication between PEs is accomplished either 1) neighbor-to-neighbor: Each PE can address the memory of its eight neighbors using the "X-net" interconnection scheme, 2) randomly via a global router, or 3) via the ACU: The ACU may poll from the array and broadcast to the array.
Performing a branch-and-bound search with multiple instruction, multiple data (MIMD) machines has been shown to be no less efficient, in the sense of number of nodes expanded, than on a serial processor [36] . An SIMD processor is substantially less flexible; the speed-up for a search (in the sense of real time) compared to a serial machine is a fraction of the number of processors working on the search. However, the lower cost and the appropriateness of the SIMD system for lower-level image processing would likely make this architecture readily available for the higher-level tasks as well. A discussion of the changes to our algorithm that were required when it was moved from serial hardware to the MasPar is given following the explanation of the algorithm.
Summary of the Parallel Algorithm
Much of the parallel algorithm for the MasPar involves identical logic to that of a serial algorithm; we implemented a serial version of the algorithm for comparison in our experiments. The algorithm presented below has the form of a sandwich: Steps unique to the parallel environment occur on either side of steps that a serial search would perform. 
Initializing the Processor Array
As implied by its classification, an SIMD implementation must have multiple data streams for the single instruction stream to manipulate. In our data-parallel approach, the first task of parallelization is to partition the search space among the processors. In order to explain the partitioning algorithm, we first describe the search tree. To represent all possible mappings of G 1 vertices to G 2 vertices, we employ a search tree in which each horizontal level represents a G 1 vertex and each node of that level represents a potential G 2 vertex to correspond to it (but at the second and deeper levels, there generally are multiple nodes for each possible G 2 vertex). The root of the tree will have N branches, since the first G 1 vertex may be mapped to any of the G 2 vertices. Each of these branches will have N -1 branches, since a G 2 vertex cannot be simultaneously mapped to more than one G 1 vertex. Continuing this process down N levels results in N! leaf nodes or possible mappings, and each root-to-leaf path is a unique permutation of the G 2 vertices. Therefore, the set of all paths from root to leaf in an N vertex mapping problem is equivalent to the set of all permutations of the integers from zero to N -1, a fact we use to maintain the distinction between neighboring subproblems that may be present on widely separated processors. Using permutations, we have developed a means of representing the needed partitions of the search space in a highly efficient manner.
A subtree (a subproblem) can be completely designated by a pair ·H, bÒ, where H is the permutation representing the path from the root to the subproblem's left-most leaf, and b is defined as the depth of the root of the subtree that represents the subproblem.
The portion of the permutation from the root to a subtree root is fixed for a subproblem. The branches from the subtree root are the set of all permutations of the G 2 vertices not used in the fixed portion. Each node of the tree will have as many branches (children) as the distance between it and the bottom level (this distance is N -b). (Equivalently, the number of G 2 vertices still available is the distance to the leaf, and any of the available G 2 vertices can be mapped to the next G 1 vertex.) This is recursively true as we move down the tree: The G 2 vertices that make up any path from a node to a descendant leaf are identical to the children of the node.
Each processor is initialized with a queue of subproblems in the following way. We start by calculating how many subproblems are required to provide all processors with work. Each processor can have an active job and several jobs waiting in a queue. Let q be the number of entries allocated for each processor's queue. Let n be the number of processors on the machine. Then nq is the total number of queue entries allocated on the machine. Partitioning the tree evenly by expanding all the branches at the top level will create N subproblems. Expanding the next level will result in N(N -1) subproblems, and so on. Therefore it is simple to calculate N b , the number of equal-sized subproblems we can easily generate that will fit into the nq job queue entries, and b, the number of levels of expansion required to produce those N b subproblems. Since all the branches at a given level will be expanded, all subproblems are initially the same size. Note, however, that some fraction of the PEs are likely to have one more job in their queues than the others, since the number of subproblems is not likely to be a multiple of the number of processors. So the task is to partition the tree at level b, so that each processor has roughly the same amount of work to do.
For an illustration of the job allocation process, take the number of processors to be six, the queue size to be five, and the number of vertices in the graphs to be mapped as five. The tree can be split at the second level to produce twenty jobs, and still fit within the 30 available queue entries. Using the terms above, n = 6, q There are two approaches to achieving this initial job distribution: 1) make it the responsibility of the serial host computer, or 2) devise a parallel algorithm such that the SIMD machine can do it by itself.
Since the latter is more interesting, we describe such a method below. Each processor will calculate the left-most permutation of the job that belongs in each of its queue entries. Each queue entry is initialized with the left-most permutation of the whole tree; in our example, this would be [0, 1, 2, 3, 4]. This permutation also represents the N children of the root of the search tree. We use both meanings of the permutation in our initialization process: At any iteration i, the first i slots of the permutation are fixed, describing a vertical path in the tree, while the remaining slots of the permutation describe the children of the ith node. The first iteration of our process is for each queue entry to calculate which branch of the root node it belongs to, establishing the element in the first slot of the permutation. The method of calculation will be explained below. The occupant of the first slot is swapped with the new first element; for example, if three was calculated, the initial permutation would be modified into [3|1, 2, 0, 4], which contains a mark showing what choices have been made so far (the elements to the left of the vertical bar). In the second iteration, each queue entry calculates which of the children of the first node it belongs to. The resulting element is again swapped with the occupant of the second slot; for example if two was calculated, the resulting marked permutation is [3, 2|1, 0, 4]. The calculation can be repeated again with the queue entries calculating which child belongs in its third slot, and so on. In our example, we need only assign the first two elements, so we are done after the second iteration. The final step establishes the left-most branch of the subtree by writing the unused elements into the remaining slots in ascending order. In the example, this would result in [3, 2, 0, 1, 4].
Each processor will calculate which elements it should select for its permutation using a unique number that is systematically modified as successive elements are assigned through the iterative process. Let x i represent this identifier at the ith iteration. The initial value x 0 is made to be a unique identifier for each queue entry; x 0 is obtained by adding the unique processor ID (with the first processor having an ID of 0) to the product of the number of processors times the position of the entry in the queue (with the first queue entry being at position 0). In the example, the third queue entry for the fifth processor would have an x 0 value of 16, because x 0 = 4 + (6 * 2).
The selection of the first element is calculated for each queue entry by taking x 0 modulo N. Any surplus queue spaces (those above N b ) are left empty. This results in N b /N queue entries selecting each child of the root node. Our example case would map element 1 to 1, since 16 modulo 5 equals 1, with the resulting marked permutation being [1|0, 2, 3, 4] . In our example system, four queue entries (those with x 0 equal to 1, 6, 11, and 16) would assign themselves to the branch of the tree starting with one.
The second element is chosen by first calculating x 1 as the greatest integer less than or equal to x 0 /N. This partitions the queue entries so that the first N entries (which have each selected successive children from the root node) have x 1 equal to 0, the next N entries have x 1 equal to 1, and so on, with the last N entries having a value of N -1 for their x 1 . The element belonging to the second slot of the permutation is calculated by taking x 1 modulo N -1. This calculation allows each queue entry to select which child of the first node it should map to the second G 1 vertex. Since there are N -1 branches from each level-one node, and by definition N -1 is a factor of N b , there will be one queue entry for each available child of the first level node. In our example case, x 0 was 16, so x 1 is 3, which indicates that this entry should take the element from the permutation with index 3, which is the element 4 (since our index starts with 0 at the current position). Now the example queue entry would be [1, 4|2, 3, 0] .
Our small example is finished after the second iteration, since b = 2, so the cleanup step of assigning unused elements in ascending order is performed, with the resultant permutation being [1, 4, 0, 2, 3] . For actual problems, the algorithm can continue to map elements as necessary. The first step in deriving the (i + 1)th mapping is to take x i as the greatest integer less than or equal to The (i + 1)th element's index is x i mod N -i. This cycle is repeated until level b of the tree is reached; each level-b node will be represented by one and only one queue entry in the processor array.
The Search
In order to have the processors of the SIMD explore the search space in parallel, we have developed a simple algorithm that a processor can use to generate all leaf nodes (permutations) in a subproblem.
The process is to inspect the vector describing the cur- 1, 2, 5, 3, 4 ].
An initial mismatch threshold is input with the graphs to be matched. If a permutation, or all possible permutations below a given node, can be determined to have a degree of mismatch above the threshold, the node and its descendants can be eliminated from further consideration. If a complete permutation is found which has a degree of mismatch below the threshold, its degree of mismatch becomes the new threshold, and the permutation is stored as the new incumbent (best permutation known so far to this processor). The processors are scanned for improvements in the lowest degree of mismatch found, and a global best permutation and new threshold are established. This process continues until all permutations have been checked or pruned away.
The branch-and-bound search process consists of four components:
1) selection of subproblems, 2) expansion of subproblems, 3) termination testing and updating of the incumbent, and 4) elimination tests.
The search is performed through iterating these steps. The selection of the next subproblem is done through a simple ordering of nodes; each processor investigates a subproblem from left to right, without evaluating which branch may have the best chance for providing a minimal degree of mismatch. Therefore, the process of selection of the next subproblem is blind; if a branch is not eliminated by the elimination tests, it is expanded. Suppose a processor is at a node in the "middle" of its subproblem; that is, the node is at a level c such that b < c < N -1. The basic search process performed by each processing element is an iteration of a loop that either expands a child node (moving down the tree), expands the next node on the same level (moving up to the parent and back down to a sibling to the right), or closes a node (moving up the tree to the parent). The first two options require determining whether the node expanded provides a complete permutation (the "termination test"). If it does not provide a complete permutation, the algorithm must evaluate whether any children of the node should be eliminated (the "elimination test"). If a complete permutation is available, the relational distance associated with this permutation is checked against the threshold, and the local incumbent permutation is updated as appropriate. When moving up the tree, the level reached is compared against b, the depth of the subproblem root, to determine whether the subproblem has been completed.
The heart of the elimination process is to evaluate the children of the current node to determine whether further work needs to be done. The three types of error are combined for children of the current node to determine whether any can be eliminated from further consideration: 1) error from edges between mapped vertices (past-past error), 2) error from edges between a mapped vertex and an unmapped vertex (past-future error), and 3) error from edges between unmapped vertices (futurefuture error).
Each processor has its own FCM which has columns corresponding to G 2 vertices and rows corresponding to G 1 vertices. Due to the restricted amount of memory available to each processor, we do not retain separate FCMs for each tree node or each level of the search. Instead, each processor updates a single table of its own as it moves through its subproblem's search tree; a traversal of a single arc in the tree generally affects only a small portion of the FCM. Therefore, the process is one of updating the tablemodifying it when we move down the tree and updating it in reverse when moving back up the tree.
All of the branch-and-bound functions described above can be performed in parallel, since each processor's subtree can be processed in isolation from other parts of the search tree. Besides having more processors doing the work, the chief benefit from the parallel approach is having quick updating of all processors with a new threshold as soon as it is found. After a constant number of cycles (we have found 200 is acceptable for large graphs), the processors that have finished their jobs can extract a fresh job from their queues, reinitialize their tables, and continue the search process. Once a processor has completed the work in its queue, it remains idle until the next load-balancing cycle arrives.
Load Balancing
The mechanism we developed for load balancing has two phases. The first phase is for the processors with jobs remaining in their queues to move them to processors that are idle. The second phase is performed when, after this redistribution of jobs, a significant number of processors (we found this level to be best set at about 20%) still have no work to do. In this case, the processors that have jobs split them into equal parts, one for each unexpanded child of the job's root node. This new multiplicity of jobs may overfill the processor's job queue, in which case the final job will be all the remaining children. Next, the queue-sharing cycle is repeated, spreading jobs throughout the processing array. Our original plan was to split the tree as much as necessary to keep the processors constantly active. However, we were surprised by the additional work created by splitting the tree. The processors were far more active, but they had far more work to do, and the overall time was lengthened.
To evenly spread the work from processors that have untouched jobs in their queue to those with no work at all, we have each processor look to its neighbor to the east. This is done toroidally, that is, the eastern-most column of processors considers the western-most column of the processor array as its neighbor. If the neighbor of a processor has more work than the original processor, it moves a job to its own queue. This process is repeated c times, where c is the number of columns in the processor array. This sharing is then repeated with each processor looking to its neighbor to the south r times, with r equal to the rows of the processor array. These two cycles insure that no processor will have more than one more job in its queue than any other.
As mentioned above, a job can be completely designated by the pair ·H, bÒ, where H is defined as the permutation representing the path from the root to the job's leftmost leaf, and b is defined as the depth of the node which is the root of the subtree which represents the subproblem. To split the job, b is incremented, and a new permutation H is defined for each child of the original branch-node. If the job being split is the active job, new permutations H are only created for the children to the left of the child involved in the current search. Since this process is creating copies of H, the fixed portion of the tree is being replicated for each child. The number of copies and the amount of fixed tree being copied are inverses, and the number of fixed nodes added is greatest at the point half way from the root to the leaf. When a job is loaded from the queue, it is initialized at the root of the tree; it must perform a move down the tree along the fixed path until it reaches b. These added moves, when multiplied over all the processors splitting jobs, quickly outweighed the benefit of always having all the processors active. We have found superior performance when we allow up to 20% of the processors to become idle before splitting up jobs.
Parallel Implementation Considerations
The following are the main limitations of the MasPar hardware and their effects on our algorithm.
1) SIMD Execution:
Because the ACU sequentially broadcasts instructions to the processor array, branches and loops that are dependent on plural data will cause some processors to be idle while others are active; consequently, more statements will be executed from the code than on a serial machine (e.g., all the statements in an "if º else" structure will be executed). One of the challenges in designing an SIMD algorithm is to minimize these situations in the code, maximizing the number of active processors. One technique is to add a small number of initializing statements to manipulate the plural data, aligning the processors and eliminating the need for separate branches to handle the different situations. 2) Four-Bit Processing Elements: Because of the small size of the processor, plural data is stored as eight-bit characters whenever possible. Because of the multiple cycles required for each step and the lower-level technology, we found program execution time to be about 27 times longer on a single PE operating as a serial processor than on a DecStation 5000 serial machine. This hardware performance level reduced the real-time theoretical hardware advantage of the MasPar from a ratio of 1,024:1 to something closer to 40:1.
The processing time for real number arithmetic is substantially longer than integer arithmetic; however, our algorithm does not use any real variables.
3) Off-Chip PE Memory: Memory access speeds were found to vary greatly. Let T r denote the time required for a PE to access an on-chip register. PE access to offchip RAM is 4T r . PE access to RAM using an indirect lookup with a RAM pointer is 8T r . This constraint can be ameliorated in some situations by explicitly naming register variables in code where one would normally use a "for" loop and an array with an incrementing index. When arrays are necessary, they should be as dense as possible to minimize lookups. It is also possible to overlap the RAM access time with calculations by making an assignment to a register from an array several instructions before the value will be used. 4) Inter-PE Communication: The time required for neighbor-to-neighbor communication is very little more than local RAM access time (4T r ), and posed no constraint. Communication between PEs using the Router without conflicting destinations takes about 85T r and was not used in our algorithm (conflicting destinations can slow the process tremendously). We found that a PE could receive a value broadcast from the ACU in time T r /2, faster than from any other source, probably because it is using the same path that instructions use. For the ACU to scan the array (a reduction operation) for a minimum (or maximum) value takes approximately 50T r . These reduction operations are necessary to detect new thresholds for pruning the search tree, and their value quickly outweighs their cost, especially in the initial phases of the search.
EXPERIMENTS AND RESULTS
The results of our experiments with graph matching are presented in two areas: the number of search tree nodes expanded (or the number of "moves"), and the real time required to complete a given search. The number of moves required to solve a given problem is highly data-dependent and anomalies are common [36] . For instance, if the vertices of the graph have homogenous characteristics and the best mapping has a relational distance close to the error of many mappings, fewer cut-offs will be found and the search will take longer than for a graph with sharp differentiations. Moves are counted when expanding a branch, moving across the tree to a sibling, as well as when moving up the tree. Therefore, the number of moves is higher than the number of search tree nodes expanded; the figure is intended to be used as a comparison with the serial algorithm. Our experiments were performed using the test graphs listed in Fig. 3 .
We implemented a serial version of our algorithm on a DecStation 5000 and tested several of the mappings on both it and the 1,024 processor MasPar. For the larger graphs the time required on the serial machine became prohibitive; the table shows n/r for no results in these cases. The number of parallel moves are expressed in three ways: The first column shows the number of move cycles performed by the algorithm; the second shows the number of potential nodes explored in that number of cycles (Column 1 * Number of Processors); and the third shows the actual number of moves made on the tree by all processors (Column 2 -Idle Processor Cycles). Fig. 4 shows the results in terms of moves for several graphs. When graphs of different sizes are compared, the smaller graph is enlarged with "dummy" vertices without edges, so the problem size is always the size of the larger graph. The real times taken for the problems in Fig. 4 are shown in Fig. 5 .
These results show that the increased overhead and slower processing rates make the MasPar a liability for smaller problems. It is only in searches involving more than 16 vertices that the MasPar consistently outperforms the DecStation in real time terms. However, once that line is passed, the performance of the serial processor falls far behind the MasPar.
CONCLUSIONS
We have presented an algorithm for finding the relational distance between two graphs. It searches the space of all permutations of the N vertices of one graph to find a best correspondence between the nodes (and edges) of one graph with those of the other. The algorithm achieves speed through a combination of features: branch-and-bound search, good bound estimation through the use of the forward-checking matrix, efficient generation of permutations by the processors of the SIMD system, and massive parallelism. Furthermore, by formulating the algorithm to run on the SIMD class of parallel machine, we are able to take advantage of the economies that such machines offer over MIMD systems.
It might be possible to improve our algorithm further by 1) making the bound-computation heuristics even better through deeper lookahead in the forward-checking matrix, 2) incorporating the more sophisticated load-balancing methodology of Karypis and Kumar [34] where additional testing of a triggering condition is done in order to balance the load precisely when it becomes too unbalanced.
Both of these enhancements, however, incur additional overhead, and only additional experimentation will be able to determine whether they bring savings in practical situations. 
APPENDIX ALGORITHM DESCRIPTION IN PSEUDOCODE
Pseudocode for Next Leaf Generation from Permutation
function NextPerm(array H[0 .. N -1, int b) i := N -2; while ë (i ≥ b) do if ë (H[i] < H[i + 1]) then[N -1]) whileí ( j £ N -1) do if ë (H[i] < H[j]then ë Swap(H[i],
ACKNOWLEDGMENTS
This research was supported in part by U.S. National Science Foundation Grants Number IRI-9023977 and CDA-9123308, by NASA/CESDIS, and by the Boeing Company. Robert Allen received the MS degree in computer science from Stanford University in 1995. His studies focused on distributed systems, machine learning, and databases. He is currently a software engineer at the Hewlett-Packard Company's Palo Alto site. His interests include object-oriented languages and the research and development of information systems to support design and manufacturing.
Luigi Cinque (S'85, M'90, SM'96) received his PhD degree in physics (and computer science) from the University of Napoli in 1983. After a few years at the Artificial Intelligence Laboratory (Selenia S.p.A.), where he worked on artificial intelligence, expert systems, and knowledgebased vision systems, in 1990, he joined the Department of Computer Science at the University "La Sapienze" of Rome, as an assistant professor. There he works on object recognition, parallel architectures, and algorithms for image processing and computer vision. In 1992, he was a visiting scientist at the University of Washington (Seattle) working on parallel machines. His current interests include computer vision, parallel image processing and architectures, pattern recognition, 3D object recognition, and spatial database systems. Dr. Cinque is a senior member of the IEEE, and a member of the ACM, the Pattern Recognition Society, and the American Association of Artificial Intelligence. He is an editorial board member of Pattern Recognition. 
