Abstract-This paper proposes an efficient architecture independent packing method for commercial FPGA. All specific logics of commercial FPGA such as carry chain arithmetic, x-LUT, are pre-designed into reference circuits according to its architecture. Due to complex architecture of contemporary FPGA, to enumerate all reference circuits in a fine-grain manner is impractical. To overcome this problem, coarse-grain manner is adapted in the approach. By using constraint satisfaction problem technique the proposed method matches pre-designed reference circuits from the given user logic circuit. Transformation from the reference circuit to the pre-packed cluster is simplified by using several specifically designed instructions. In the next stage, those directly connected FFs are absorbed into the pre-packed clusters. The Last stage packs LUTs and FFs into clusters in a delay-based manner. This method is architecture independent and can be applied for any other commercial FPGAs as long as the pre-designed reference circuits are modified accordingly. The results obtained and compared with commercial tool, ISE MAP, and academic tool, PAM MAP, have shown the effectiveness of the proposed method.
I. INTRODUCTION
Contemporary commercial field-programmable gate arrays (FPGAs) consist of a cluster of configurable logic blocks (CLBs) formed by look-up tables (LUTs) and flipflops (FFs) as well as arithmetic circuitry, configurable I/O blocks (IOBs) and specialised hard IP blocks. For example, a SLICE, a half of CLB, in the latest Xilinx Virtex-7 FPGA family device contains four six-input LUTs, eight FFs, carry chain arithmetic logic and other circuitry. It is widely acknowledged that FPGAs are slower, less area-efficient and less power efficient than custom ASICs [1] . However, the programmability of FPGAs, gives them the advantage of short time to market. As a result, they have been widely used in a variety of applications such as domestic communications and automotive electronics.
Packing, which falls between technology mapping and placement, is an extremely important step of the FPGA computer aided design (CAD) flow. This step is most commonly regarded as packing LUTs and FFs together to form clusters [2] . However, in commercial FPGAs, packing is the step that the various logic gates of technology mapped circuit including not only LUTs and FFs but also other logic gates are mapped to FPGA fabric according to the available hardware resources. Packing algorithms are well-studied in the literature for the academic FPGA model, which consists of several basic logic elements (BLEs). Each BLE has one LUT and one FF. The FF can be optionally bypassed for implementing combinational logic only. Local interconnect is available for realising fast paths within the cluster. The output of LUT/FF drives both local interconnect and general interconnect. Inputs to the cluster come from general interconnect [2] .
The earliest work based on the academic FPGA model proposed an area-driven packing algorithm (VPack) in the earlier version of versatile placement and routing (VPR) CAD tool [3] . This used the simplest graph pattern match to pack LUTs and registers into BLEs in the first step and packs BLEs into clusters in the second step. Marquardt further extended the previous work carried out by Betz to perform timing-driven packing (T-VPack) [4] and improve speed and density. Recently, Verilog-torouting (VTR) [5] , the latest version of VPR was proposed, in which hardcore IPs are supported in the packing stage.
Tom et al [6] proposed a non-uniform depopulation technique, (Un/DoPack), which runs the FPGA CAD flow twice. First iteration is the regular CAD flow. In the second iteration, packing uses the layout result of the first iteration and depopulates the congested regions. While reducing the channel width, Un/DoPack, similar to the other depopulation-based packing approaches, observes an increase in total area and critical path delay.
T-NDPack [7] proposed an objective cost function with consideration of the criticality in terms of delay and routability simultaneously, which consequently reduces the channel width requirements and the depth of the critical path. However, it incurs logic area overhead. It was claimed that minimum channel width and critical path delay were reduced by 11.07% and 2.89% respectively while increasing the number of CLBs by 13.28% compared to T-VPack.
Easwaran et al proposed a routability driven poweraware packing method (W-T-VPack) [8] with introduction of a new packing cost function based on predicted individual net length. It claimed that W-TVPack outperforms T-RPack [9] and iRAC [10] in terms of energy by 11.23% and 9.07%, respectively.
Rajavel et al proposed a many-objective FPGA circuit packing strategy (MO-Pack) [11] that minimised the channel width and the energy of a circuit implementation without incurring any overhead on critical path delay.
Yang et al proposed a yet another many-objective FPGA packing method (YAMO-Pack) [12] . It claimed that YAMO-Pack outperforms iRAC and MO-Pack in terms of channel width by 38.8% and 42.2%, respectively and in terms of delay by 11.8% and 11.5%, respectively. However, it requires acceptably more CPU time.
All methods mentioned above target the academic FPGA model, which is significantly simpler than that used for commercial FPGAs. Ahmed et al [13] from Xilinx reported an architecture-specific packing for Virtex-5 FPGAs. However, it can only be used for Xilinx FPGA devices. Moreover, Shao, et al developed an areadriven architecture independent PAM MAP algorithm [14] . The architecture they used differs from the academic model, but it targets area reduction only. To our best knowledge, no timing-driven architecture independent packing method has ever been published for commercial FPGA. The remainder of the paper is organized as follows. Section II gives details of Virtex-7 FPGA architecture, which will be used in the experiment for demonstration. Constraint satisfaction packing techniques and specific designed instructions are given in Section III. Section IV discusses comparison results between the proposed method and other tools. Conclusion is then given in Section V.
II. VIRTEX-7 FPGA CLB ARCHITECTURE
To show the complexity of the contemporary commercial FPGA architecture, a virtex-7 FPGA is reviewed in this section. This architecture will be used for evaluation experiment for demonstration purpose. A Virtex-7 logic block, which is referred to as a CLB, comprises two SLICEs (SLICEL and SLICEM) and a switch matrix. SLICEL and SLICEM are exactly identical, except that LUT in SLICEL is used for logic only and SLICEM can be used for implementing memory cells. The switch matrix allows for connections from a SLICE back to the same SLICE, between the two SLICEs, as well as into rows and columns of general interconnect. Each SLICE contains four 6-input LUTs and 8 flip-flops. The LUTs in Virtex-7 are implemented as what Xilinx called true 6-LUTs, rather than being constructed using smaller LUTs that can be optionally combined together via multiplexers. The output of two true 6-LUTs, either in top half of a SLICE or bottom half of a SLICE, can be constructed as one 7-LUT via multiplexer F7MUX. Two 7-LUTs can function in one SLICE at the same time. Besides, two 7-LUTs can be further combined together via multiplexer F8MUX to form an 8-LUT in one SLICE. Both outputs of 7-LUT and 8-LUT can be registered individually. Fig. 1 shows the architecture of a SLICE of Virtex-7 FPGAs. An isomorphism of a graph 1
is equivalent to the constraint satisfaction problem [16] . A variable i is associated with each vertex
∈ , and all variables take values on domain V 2 . Let n be the cardinality of V 1 . Finding a sub-graph isomorphism is then equivalent to finding a complete assignment satisfying the following structure constraint:
Packing problem is similar to isomorphic match problem. A user circuit C can be described by a directed graph 1
, where each vertex 1 i v V ∈ in G 1 corresponds to a component or a primary input or primary output in C, and each directed edge 1 i e E ∈ corresponds to a wire connecting between two different vertexes in C. The set of given circuits is a set of configurable circuits implementing different types of logic functions, which is known as reference circuits from packing point of view and can also be described by directed graphs respectively. Each directed graph 2 2 2 ( , )
corresponds to a reference circuit. These configurable circuits are preconstructed manually according to available FPGA hardware logic resources. Packing algorithm identifies all isomorphic matches in a user design circuit according to a set of given reference circuits.
In order to match reference circuits in a user design circuit, several constraints should be applied. Type constraint should be satisfied for the purpose of matching exact type of vertex in the circuit such as LUT and FF. Start constraint is used for the outgoing edge from a vertex. Similarly, end constraint is for the incoming edge from a vertex. These two constraints are used for matching one particular edge of graph, i.e., from one type of logic gate to another. Input constraint and output constraint are used for primary input and primary output respectively. Shared input constraint identifies shared inputs which is used in the case of more than one sink net shared by two pins.
As long as the reference circuits represent all the functionalities that FPGA hardware resources can implement, it can always find a feasible solution for packing result. However, it is impossible to enumerate all reference circuits for a complex contemporary FPGA, which makes isomorphism packing impractical. Let us consider a case of two 6-LUTs and a 2to1 multiplexer F7MUX forming one 7-LUT in one SLICE. If ignoring sequential outputs, there are 4 cases already, as shown in Fig. 2 . Hence, four reference circuits must be constructed in order to match all these patterns. If considering sequential outputs, the number of combination patterns can be increased significantly. It is therefore crucial to select the proper reference circuits, achieving not only less number of reference circuits but also covering all the functionalities a SLICE of contemporary FPGA can implement.
In order to reduce the number of reference circuits, the construction of reference circuits in the proposed method only considers combinational logic. Although the sequential logic is not included in the reference circuits, it will be dealt with after graph pattern match in the second step of packing method. By doing so, it can not only reduce the complexity of the individual reference circuit but the count number of the reference circuits as well. Those different logic functions that behave a similar function are categorized as one function type. For example, there are four different ways to form 7-LUT in one SLICE, as shown in Fig.2(a), Fig.2(b), Fig.2(c) and Fig.2(d) , respectively. The graph, shown in Fig. 2(a) , is the subset of the graph shown in Fig.2(d) . The graphs, shown in Fig.2(b) and Fig.2(c) , are also the subset of the graph shown in Fig.2(d) . Therefore those four graphs are considered as one function type. One function type accordingly has only one reference circuit. The directed graph of reference circuit is modified by inserting a virtual primary input (VPI) at the input of the vertex and inserting a virtual primary output (VPO) at the output of the vertex, as shown in Fig. 3 . After a reference circuit is matched from a given user design circuit by utilising graph constraint satisfaction technique, transformation from the reference circuit to the pre-packed cluster process is required. The process for the newly created cluster involves creating a new cluster, wires connection, wires disconnection and specifying configurations such as buffer, MUX, LUT and FF. A key observation is that for a given reference circuit wire connections for the newly created cluster and the configuration settings never alter. In addition, the transformation processes of different reference circuits are identical. The net connections and the configuration values for different created clusters are different. Therefore each step in the process can be used as an instruction. As a result, the whole process works as executing instruction one after another. For a different specific architecture, reference circuits are different and those reference circuits must be modified accordingly. However, the execution of the instruction is the same for a different architecture. The designed instructions are architecture independent, simple but effective, as shown in TABLE I. 5. Copy properties from 6-LUTs and set properties by using following instructions, in which "INIT" is the 6-LUT initial value and "NAME" is the 6-LUT instance name. copy_property (A6LUT,INIT, slice_a,A6#LUT) copy_property (B6LUT,INIT, slice_a,B6#LUT) copy_property (A6LUT,NAME, slice_a, ANAME) copy_property (B6LUT,NAME, slice_a, BNAME) set_property (slice_a, FXLUT::TRUE) 6. Set configurations by using following instructions, in which AOUTMUX, AUSED, BUSED are SLICE configurations of Xilinx Vertex 7 FPGA device. set_configuration (slice_a, AOUTMUX::F7) set_configuration (slice_a, A6#LUT::A6#LUT) set_configuration (slice_a, AUSED::0) set_configuration (slice_a, BUSED::0)
For the consideration of timing issue, the constraint satisfaction problem technique of graph matching mentioned in early sections is only used for the first stage of the proposed packing. In this stage, only combinational specific logics are matched and packed for a given user design. As a result, the input to the second stage is a netlist consisting of pre-packed combinational clusters, hard IP blocks, LUTs and FFs. In the second stage it packs selected FFs to pre-packed combinational clusters, in which the FF directly driven by the output of the cluster is selected. In other words, if the FF is driven by the output of other block such as a LUT or a FF, this FF is ignored. It is known as FF absorption stage. In the same way, it repeatedly packs the selected FFs into the cluster until no more FF can be selected for packing. After this stage completes, the netlist consists of prepacked combinational clusters, pre-packed sequential clusters, hard IP blocks, LUTs and FFs. Final stage deals with LUTs and FFs in a delay-based manner, which is similar to MO-Pack [11] and YAMO-Pack [12] , to pack them into clusters.
The pseudo code of proposed algorithm is outlined as follows.
IV. RESULTS
The proposed method is developed under Microsoft Visual Studio 2010 and implemented in C++. The results have been run on the PC with an INTEL CPU 2.4 GHz and 4 GB RAM.
To verify the effectiveness of the proposed method, design circuits in register transfer level Verilog format from the benchmark suite in Quartus II university interface program (QUIP) [17] are chosen. The selected designs are architecture independent and those circuits can be logically optimised by Xilinx commercial logic synthesis tool XST. Xilinx ISE MAP and the proposed method are then applied to the output of XST to pack logic into Xilinx Virtex-7 FPGA SLICE. The device xc7k160t-fbg676-3 is chosen for demonstration. It can be seen that the proposed method can achieve comparable results compared to Xilinx ISE MAP. It should be noted that since the proposed method is architecture independent it can be used for Altera FPGA architecture as well as long as the pre-designed reference circuits are modified accordingly to be suitable for Altera FPGA architecture.
Other published methods such as iRAC [9] , MO-Pack [11] , YAMO-Pack [12] etc are not comparable because they are targeting academic FPGA model. The method presented in [13] is not comparable either, because the test suite used is from industry and not available. Therefore, PAM MAP [14] is chosen for comparison, since PAM MAP is architecture independent and it can target Virtex-7 as well. The comparison results are shown in TABLE III. It can be seen that the proposed method can outperform PAM MAP in terms of area and delay in all tested cases, achieving, on average, 6% and 11% improvement, respectively.
V. CONCLUSIONS
The latest FPGAs contain composite logic blocks with LUTs, FFs, MUXs and other arithmetic circuitry. Packing design elements into the available logic resources is an extremely complex problem. In this paper, an architecture independent packing method for the commercial FPGA device is proposed. The proposed method has three stages. In the first stage, the constraint satisfaction problem technique of graph matching is utilised to implement specific logic such as 7-LUT, 8-LUT and carry chain arithmetic logic from the given user design circuit. Second stage packs the selected FFs to pre-packed combinational clusters. In the third stage, the delay-based method is carried out to deal with unclustered LUTs and FFs. The experimental results show that the proposed approach achieves similar performance in terms of speed compared with Xilinx commercial tool ISE MAP. The proposed algorithm also outperforms area-driven architecture independent PAM MAP, which can achieve on average, 6% and 11% in terms of area and speed, respectively. Currently he is a Professor Emeritus at Edinburgh Napier University, Edinburgh, UK. His main research interests include the synthesis, optimization and automation in the field of digital electronics.
