This paper presents an efficient algorithm to detect the global topological similarity between two circuits. By applying the proposed circuit similarity algorithm in an incremental design flow, IDUCS (incremental design using circuit similarity), the design and optimization effort in the previous design iterations is automatically captured and can be used to guide the next design iteration. IDUCS is able to identify the similarity between the original netlist and the modified one with aggressive resynthesis, which might destroy the naming and local structures of the original netlist. This is superior to the existing design preservation approaches such as naming and local topological matching. Furthermore, IDUCS simply inserts a plugin for circuit similarity detection, and therefore preserves the "push-button" feature, significantly simplifying the engineering complexity of incremental tasks. As a case study, we perform the proposed IDUCS process to generate the placement for a logically resynthesized netlist based on the placement of the original netlist and the circuit similarity between the original and the modified logic-level netlists. The experimental results show our IDUCS-based placement is 28X faster than versatile place and route (VPR) with comparable wire length and estimated critical delay.
INTRODUCTION
In a typical field programmable gate array (FPGA) design cycle, series of synthesis iterations need to be performed before delivering the final design. The recompilation time for these iterations heavily affects the time-to-market of a product. There are several phases of a design process in which iterative repetitions are common, including the initial checks of the register transfer level (RTL) code, constraint verification, timing closure, and in-system debugging [6] . Each of these steps requires a time-consuming resynthesis of the FPGA design. Incremental design methodology has been devised to save the recompilation time by maintaining the essential properties across consecutive iterations [9, 12, 23, 1, 25] .
The key to incremental design methodology is design preservation [22] , i.e., maximally preserving and taking advantage of the engineering effort in the previous design iterations. A commonly employed method for design preservation is to partition a design and avoid a recompilation of unchanged partitions in the next iteration. This method can yield a significant reduction in iteration time but due to strict hierarchical boundaries, synthesis cannot perform any cross-boundary optimizations where a partition exists. To break this hard hierarchy boundary constraint for improving the quality of the design, Xilinx SmartGuide [22] employs naming and local topological matching to identify the correspondence between two netlists resulting from the previous and the current iterations, respectively. Based on this correspondence, the layout from the previous iteration can be reused in the current iteration, leading to better quality and saving the recompilation time. However, in modern synthesis algorithms (e.g., ABC [4] ), the internal node boundaries are usually destroyed by structural hashing (transforming a logic network into an and-inverter graph (AIG)), and aggressive optimization and logic restructuring performed in the netlist make it difficult to produce the naming matching between the original and modified netlists. Consequently, the local topological matching based on the naming matching also becomes less effective.
In this paper, we present IDUCS, an enhanced incremental design using circuit similarity 1 flow for FPGAs. In contrast to Xilinx SmartGuide [22] , the circuit similarity identifies a correspondence between the original and modified netlists based on a global topology matching. Based on circuit similarity, the placement and routing of the modified netlist can be derived from the layout of the original netlist obtained in the previous iteration. Unlike many existing algorithms for incremental designs [12, 1] , which require radical changes to existing computer-aided design (CAD) tools, we have developed a plugin that preserves the "push-button" feature in the commercial FPGA CAD tools.
A key insight used in IDUCS is that incremental functional changes in RTL or logic level are small, and they generally result in a "similar" topology of the modified netlist compared with the original one [17] . To quantita-tively represent such a similarity, we adapt graph similarity [26] , a widely applied technique in social network and cheminformatic domains, to measure the topological similarity of two circuits. We present an iterative algorithm to compute the circuit similarity between the modified and original netlists, and identify the correspondence of nodes/edges.
The IDUCS flow is shown in Figure 1 . IDUCS produces an initial placement and routing solution for the modified logic-level netlist, based on the layout of the original netlist and the circuit similarity between the original and modified logic-level netlists. Based on this initial solution, an efficient refinement is then performed as a fine-grain tuning for further improvement of the layout quality. Note that such a refinement procedure does not require a new implementation since the existing placement and routing tools can be used with less optimization strength (e.g., lower initial temperature in the simulated annealing-based placement or fewer iterations in the negotiation-based routing). The essential information obtained from the previous design iteration is automatically captured and quantified by a runtime-efficient "similarity detection" phase.
To verify the effectiveness of IDUCS, we have applied it to placement acceleration, one of the most time-consuming design phases, in a multi-pass design. We have used IDUCS to generate the placement for a logically resynthesized netlist based on the placement of the original netlist and the circuit similarity between the original and the modified logiclevel netlists. Tested on the 20 largest MCNC benchmark circuits [24] , experimental results show IDUCS produces a much higher quality initial placement than VPR's [20] initial placement in terms of bounding box costs and delay costs. Our IDUCS-based placement is 28X faster on average than the VPR placement, while producing comparable wire length and critical delay. The results suggest a huge potential to accelerate other incremental design phases, including routing and verification, using circuit similarity.
The remainder of this paper is organized as follows. Section 2 illustrates the overall IDUCS flow with an example. Section 3 describes the circuit similarity algorithm. Section 4 presents a placement case study to experimentally demonstrate the efficiency and the effectiveness of IDUCS. The paper is concluded in Section 5.
MOTIVATING EXAMPLE
Following the flow in Figure 1 , we use an example to illustrate the procedures of using IDUCS to generate the placement based on the layout results obtained from the previous design iteration. In the first design iteration, given a logic-level network G shown in Figure 2 (a), where each node denotes a look-up table (LUT) and each edge denotes an interconnection between LUTs, the placement (Figure 3(a) ) of network G can be obtained by performing a time-consuming and highly-optimized placement (e.g., VPR). Suppose a change of RTL code is made due to a bug found after the first iteration, and the RTL and logic-level synthesis is performed in the following iteration, resulting in a modified network, G ′ , as shown in Figure 2 (b). To produce the placement of network G ′ , IDUCS first computes the similarity between networks G and G ′ , and finds the correspondence of nodes in these two networks (Figure 3(a) right) . Based on such node correspondence, the initial placement (Figure 3(b) ) of network G ′ can be determined using the placement of network G (Figure 3 Note that the detection of similarity and the correspondence of two networks is generally much faster than the replacement of the entire network. Therefore, the IDUCSbased approach is more efficient than the from-scratch design flow, which re-places the entire circuit. Furthermore, the naming matching-based correspondence will not work in this example since only two nodes (node 7 and node 8) out of nine internal nodes have the same names in the original and the modified networks. On the other hand, IDUCS employs a topological similarity detection technique and is able to identify a more comprehensive correspondence between the two networks. In general, IDUCS-based flow contains the following two phases:
1. Detection of the similarity between two networks and the correspondence of the components (e.g., nodes and edges) in them; 2. Refinement of the results inferred based on the previous design iteration and the detected similarity, e.g., resolving overlaps in the initial placement and congestions in the initial routing.
Section 3 and Section 4 will detail these two aspects, respectively.
CIRCUIT SIMILARITY

Review of Graph Similarity
Given two graphs (or networks), there are multiple ways to define their similarity. The characteristics of commonly used measures of similarity are summarized in Table 1 , where column "Global Topo" indicates whether a measure considers the global topological information, which is important to find the correspondence between nodes of two graphs. Some measures have already been used for FPGA [10] . Our IDUCS employs the iterative method, which has relatively low computational complexity and considers the global topological information.
Different algorithms, including similarity flooding [18] , simRank [7] , and the coupled node-edge [26] , have been proposed to compute the graph similarity based on the iterative definition. In this work, we use an iterative graph similarity algorithm for molecular graphs [14] , which takes advantage of graph sparsity, one of the properties of a circuit graph. Table 2 describes all frequently used variables in this algorithm.
The iterative similarity algorithm is summarized in Algorithm 1. In each iteration (t), the algorithm computes the similarity score, X
The similarity score of a node pair is a real value between 0 and 1. The higher the similarity score of a node pair is, the more likely these two nodes are matched together. This score is updated based on the values of their adjacent node pairs obtained in the previous iteration and the predefined inter-similarity between two nodes/edges. The predefined similarity is used to capture non-topological connections between two graphs. The algorithm terminates when the difference between of the total similarity scores in two consecutive iterations is smaller [15] Identifying a bijection between the nodes of two graphs which preserves (directed) adjacency.
NP-Hard Yes
Edit distance [5] Given a cost function on edit operations (e.g., addition/deletion of nodes and edges), determine the minimum cost transformation from one graph to another.
Common subgraph [13] Identifying the 'largest' isomorphic subgraphs of two graphs. NP-Hard Yes Iterative methods [21] Two graph elements (e.g., edges or nodes) are similar if their neighborhoods are similar.
Cubic Yes
Statistical methods [16] Assessing aggregate measures of graph structure (e.g., degree distribution, diameter, betweenness measures). 
Linear No
Similarity score between node i in graph G(V ) and node j in graph
The set of all adjacent nodes of node v π
An upper bound for number of iterations kv : V → V ′ A predefined inter-similarity between two nodes ke : E → E ′ A predefined inter-similarity between two edges, where
The set of all adjacent nodes that have an edge entering node v out (v) The set of all adjacent nodes that have an edge leaving node v than ϵ, or the number of iterations reaches an upper bound M .
Circuit Similarity Detection
Algorithm 1 is designed for undirected molecular graphs [14] , and the computational complexity is too expensive to handle real circuits. In this subsection, we first adapt Algorithm 1 to consider a directed circuit graph and then present two techniques to significantly improve both time and space efficiency of the circuit similarity detection.
One unique constraint for circuit similarity detection in incremental design is that the matching of the corresponding primary inputs (PIs) and primary outputs (POs) of the two circuits must be guaranteed. Therefore, the similarity score for a pair of corresponding PI/PO nodes is set to 1 and is not updated during the iteration. As a result, such a predefined PI/PO matching effectively provides extra hints for the iterative similarity detection process and generates better matching between the two circuits. Intuitively, for those node pairs close to PI/PO nodes, higher scores will be obtained because of the propagation of the constant similarity score set in PI/PO node pairs. Note that other hints such as internal registers and naming matching information obtained in logic synthesis can also be used as the predefined matching to enhance both the quality and speed of
Algorithm 1 Similarity of G and G
the circuit similarity detection. For those internal nodes without predefined similarity, we replace k v with X (t) i,j , and k e with 1. Instead of updating similarity scores based on all the neighbors, we can perform the update for edges that leave the nodes and edges that enter the nodes, separately. More specifically, given the two graphs, we initialize the similarity scores of all pairs of nodes to 1. In each iteration, for |in (v 
i,j is updated as follows
In our experiments, we find α = 0.75 consistently produces a high quality matching. For the two circuit graphs in Figure 2 , the obtained similarity score matrix is shown in Table  3 (PI/PO nodes are not shown). Clearly, the topologically similar node pairs (e.g., node 7 in graph G and node 7 in graph G ′ ) have scores close to 1. This matrix describes a complete bipartite graph, where the weight associated with each edge denotes the similarity score of two nodes. Now we can compute a maximum matching in this bipartite graph to obtain a node matching between the two graphs. The min-cost network flow [19] is used to compute the maximum matching in our experiments, and the resulting node matching is given in Figure 3 (a) right. 
Performance Enhancement
In practice, it is infeasible to compute the similarity scores of all |V |·|V ′ | node pairs for large circuits. In this subsection, we present two pruning techniques to reduce the number of pairs that need to be updated so that we can reduce both the runtime and storage. Support Constraint. Two internal nodes are less likely to be matched if they share few predefined matchings in their supports. A support of a node is the set of nodes with predefined matchings in the transitive fanin or fanout cone of this node. For example, the nodes with predefined matchings are PIs and POs in two graphs in Figure 2 
where β ∈ (0, 1] is a constant. If the support constraint of the two nodes is not satisfied, we do not update their similarity score in the iteration. For example, if β = 1, we only keep the pairs of nodes that have exactly the same supporting PIs and POs. 54 node pairs (e.g., (V7, Figure 2 can be pruned.
Level Constraint. If only combinatorial resynthesis is involved in an incremental design process, we can convert a circuit into a directed acyclic graph (DAG) by removing all registers and adding the register inputs (outputs) as POs (PIs). Given a DAG, a topological sort and reverse topological sort can label each internal node v with two values (shown above each node in Figure 2) , i.e., level(v) and rlevel(v), where level(v) (rlevel(v)) denotes the length of the longest path from PIs (node v) to node v (POs). Two nodes with significantly different (level, rlevel) values are less likely to be matched. Formally, for two nodes v ∈ G and v ′ ∈ G ′ , the level constraint requires
where B l and Br are two nonnegative constant integers. For example, if B l and Br are both set to be one, 22 node pairs (e.g., (V7, V For each circuit, we run two logic synthesis algorithms, one with ABC command "if -K 4" and the other with "if -K 4; imfs" (an area-oriented resynthesis engine which destroys the internal name matching [2] .), and generate two logic-level netlists. Figure 4 compares the number of node pairs that need to be updated in the iterative similarity with the following five schemes: (a) without pruning ("no pruning"), (b) using a weak level constraint-based pruning ("B l =Br=1"), (c) using a strong level constraint-based pruning ("B l =Br=0"), (d) using a weak support constraint-based pruning ("β=0.5"), and (e) using a strong support constraint-based pruning ("β=1"). As shown in Figure 4 , our pruning techniques reduce the number of node pairs by 3 to 4 orders of magnitude compared with the total number of node pairs. More specifically, the strong level constraint-based pruning ("B l =Br=0") and the strong support constraint-based pruning ("β=1") can prune around 90% and 99% node pairs, respectively. Table 4 shows the similarity score matrix obtained after applying these two pruning techniques on the similarity detection of G and G ′ in Figure 2 . Clearly, the iterative circuit similarity with pruning results in a very sparse matrix, while the most significant elements in this matrix are well preserved. For example, (V11, V ′ 8 ) and (V9, V ′ 7 ) are pruned due to the support and level constraints, respectively while (V9, V ′ of the constraints. Nevertheless, the most useful node pairs are preserved after our pruning and the same node matching can be obtained compared to the completely computed similarity score matrix shown in Table 3 . As a result of the sparsity of the similarity matrix, the maximum matching algorithm (min-cost network flow) is significantly faster. In Section 4, we will show that these pruning techniques do not degrade the quality of the similarity detection and node matching when we apply the circuit similarity to the proposed IDUCS flow.
Circuit Similarity-based Placement
Following Figure 1 , circuit similarity detection can be employed to discover the topological correspondence between the original netlist and the modified netlist. Such information is then used to improve the efficiency of time-consuming CAD phases including placement, routing and verification.
We use the proposed circuit similarity algorithm to speed up placement, which is one of the most time-consuming phases in FPGA design cycle. More specifically, given an original network G, its placement can be computed by performing a highly-optimized but time-consuming placement (e.g., VPR). For another network, G ′ , which is modified due to optimization in an incremental iteration, the similarity between networks G and G ′ is firstly computed, and a matching of corresponding nodes is obtained afterwards.
′ is assigned the same coordinates as node V . Hence, the node matching between the two networks gives a good candidate for the initial placement of the modified network G ′ . In return, based on such node correspondence, a high-quality final layout of network G ′ can be obtained more efficiently with a further refinement (e.g., low-temperature simulated annealing) on the initial placement results.
A CASE STUDY ON PLACEMENT
As a case study on the application of the proposed IDUCS flow, in this section we perform experiments which employ circuit similarity to improve the efficiency of the placement phase in the incremental design flow.
Experimental Settings
In this work, we consider an island-style FPGA architecture, which includes an array of clustered logic blocks (CLBs) interconnected by programmable routing. As shown in Figure 5 , two CAD flows are compared in our experiments. Both flows include two design iterations and they share the first iteration, which starts from a logic-level netlist (BLIF). A technology mapping (using ABC command "if -K 4") is first performed on this netlist to map it into a 4-LUT-based network. After the mapping, T-VPack [20] is performed with "no cluster" parameter to generate a CLB-based network, where each CLB contains one LUT and one flip-flop. The timing-driven placement in VPR is then used to produce the placement result (".p"), and the timing-driven routing with a detailed timing analysis is finally performed.
In the second iteration, we perform a logic-level optimization on the mapped netlist using the following ABC script "rwsat2": st; rw -l; b -l; rw -l; rf -l; fraig; rw -l; b -l; rw -l; rf -l where each command (alias) is a logic optimization in ABC, e.g., "st" (structural hashing) aggressively destroys the initial boundaries among internal nodes (LUTs); "rw" (rewrite) and "rf" (refactor) reconstruct the network by reducing the AIG size and level; "fraig" (functionally-reduced AIG) changes the current network structure and transforms into a functionally-reduce AIG [3] . Hence, in the modified netlist, the name matching among the nodes are not preserved and the structure of the network is changed. We will employ the proposed circuit similarity to discover the node correspondence purely based on topological information of the original and the "rwsat2"-modified netlists.
Starting from the modified netlist, we compare the following two flows: (a) IDUCS flow and (b) from-scratch flow, as shown in Figure 5 . Flow (b) uses VPR to replace the entire modified netlist. Flow (a) first computes the circuit similarity between the original and the modified netlists and uses it to generate an initial placement, which is further refined by a low-temperature annealing process using the VPR placement (initial temperature is set to 0.1 in VPR). As stated in Section 3, based on different pruning settings and annealing parameters, we develop two versions of circuit similarity. A high-quality version, CS, uses β = 0.5, B l = Br = 1 and inner num = 1 2 . A turbo version, CS-t, uses β = 1, B l = Br = 0 and inner num = 0.1. Both CS and CS-t are evaluated in our experiments.
Our proposed circuit similarity algorithm is implemented in C and evaluated on the 20 largest MCNC benchmarks. All results are collected averaged over five runs and benchmarked on a Linux server with dual-core 2.19GHz CPU and 5GB memory. The CS2 package [8] is used to solve the min-cost network flow for the maximum matching problem. Table 5 shows the characteristics of the logic-level netlist before (column "original") and after "rwsat2" optimization (column "rwsat2"). CIs (COs) include the PIs (POs) and register outputs (inputs). 
Experimental Results
Quality of the initial placement. Table 6 compares the initial placement generated by the proposed circuit similarity (column "CS" and "CS-t") and the one generated by VPR (a random initial placement) in terms of bb cost (bounding box cost) and delay cost, two key measures of the placement process. Clearly, both CS and CS-t produce the initial placement with a much better quality than VPR's, e.g., compared to VPR's initial results, CS reduces the bb cost and delay cost by 40% and 31%, respectively. This result shows that the topological node correspondence extracted by the circuit similarity algorithm indeed discovers the intrinsic connection between the original and modified logic-level netlists, and thus provides a reliable guidance to generate the placement for the modified netlist.
Quality of the final placement. Table 7 compares the final placement results produced by flow (a) (including CS and CS-t) and flow (b) shown in Figure 5 . Final bb cost, final delay cost and estimated critical delay are compared between the circuits produced by the two versions of flow (a) and flow (b). As shown in Table 7 , for bb cost and delay cost, both versions of IDUCS produce quality very close to the results produced by from-scratch flow. For critical delay, CS and CS-t reduce it by 4% and 1%, respectively compared to from-scratch flow. The comparison between CS and CS-t shows the effectiveness of the proposed pruning techniques (in Section 3.3). CS-t, geared with aggressive pruning and significantly lower annealing effort, still produces placement with comparable quality to CS and VPR.
Runtime comparison. Table 7 also compares the runtime of the placement (column "Placement runtime (s)") of different flows. Note that a timeout is invoked if IDUCS takes longer than the original netlist. It shows that CS-t achieves 28X speedup on average (up to 93X), compared with the from-scratch VPR placement. Due to space limitations, a detailed breakdown of the runtime for each circuit is not shown. Since computing the similarity between two circuits is much faster than re-placing them from scratch, more speedup is expected when applying IDUCS to larger circuits. In practice, one can use CS-t as a quick estimation of the solution quality for an iteration during the incremental design. If the quality is within a satisfied range, the VPR placement can be performed for a better quality.
CONCLUSIONS AND FUTURE WORK
In this paper, we have presented IDUCS, an enhancement to the incremental FPGA design flow using circuit similarity. The engineering effort from the previous design iterations is captured by the proposed circuit similarity detection algorithm. Using placement as a case study, we experimentally demonstrate the effectiveness of the proposed IDUCS. Compared with VPR placement in a two-pass design process, our IDUCS-based placement is 28X (up to 93X) faster while preserving the wire length and delay. The speedup is achieved because of the high-quality initial placement generated based on circuit similarity.
In the future, we will integrate the predefined matchings (e.g., the naming matching) into our IDUCS to further enhance both the efficiency and the quality of the design. In addition, we will study the effectiveness of applying our IDUCS to the routing and verification for FPGAs.
