Routing delays dominate other delays in c:iirrent FPGA designs. \Vc have proposed a novel Globally Asynchronous Locally Synchronous (GALS) FPGA architect,ure called the GAPLA to deal with this problem. In tlie GAPLA arcliitectiire, The FPGA area is divided into 1oc:ally synchronoiis blocks and the communications between them arc through asynclironous 1/0 interfaces. automatic design flow is developed for the GAPLA architectiire. Starting from behavioral description. a design is partitioned into smaller modules arid fit to G.iPLA synchronous blocks. Tlic asyrichronous communications between modules are then sytliesized. The CAD flow is parameterized in modeling the GAPLA architecture. By rriariipulatirig the parameters, we could study different factors of t,he designed GAPLA arcliitccturc. Our experimental results show an average of 20% performance improvement c:oiild be achieved by the GAPLA architecture.
INTRODUCTION
Routing delays have become a major roadblock for FPGA performance arid tlie situation will only be worse when technology continues to scale arid FPGA chips continue to grow large. Long routings not, only increase the wire delay itself, but also riccd to go through more routing switch boxes, making tlie situation worse. For example. tlie Xilirix VirtexII ~~2~8 0 0 0 FPGA has a c:orner-to-c:orner interc:onnec:t delay of around 15ns [l] . Different approaches of solving this problem have bccri proposed. [2] arid [3] pipelines the long interconnect de1a.y arid [I] proposes a svritliesis flow synthesis flow to allow the long interc:onnec:t to run for several clock cycles. In tliosc approaches, interconnects arc t,reated as circuit, components instead of conventional wires. The interc:onnec:t retiniing registers can be very expensive in area which make their FPGA size several times bigger than conventional FPGAs.
'The author is currently with hIentor Graphics Co Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Using asynchronous design is another possible solution. Asynchronous design provides average-case performance. In terms of interconnect delavs; performance is dictated bv tlie average of the interc:onnec:t delays rather than the one with worst delay. Hcricc the use of long routings docs not ncccssarily lead to a significant performance penalty. \Vc designed tlie GAPLA: a novel Globally ;Is,ynclironous Local1.v S,vnchronous Programmable Logic Array architecture. GALS systems can be seen as synchronous logic blocks wrapped in as,vnchronous I/O interfaces. Interconnects inside each block is short and fast ~ which allows the synchronoiis logics to run at, higher speed. Interconnects bet,ween synchronous logic blocks have longer delay, but they will riot affect the clock speed of the logic blocks arid only corrie into picture when there are c:ommiinic:atiois between sync:hronoiis blocks. Therefore, performance improvement could be cxpect,ed.
ICCAD '06,
An automatic design flow for tlie GAPLA architecture has also been developed [5] . Starting from a behavioral c:irc:iiit description, a design is first partitioned into smaller modules where each module can fit int,o one synchronous block (also c:alled asyn,ch,ron,ou.s as,vnchronous corrirriuriicatioris between modules are generated and put into each module. .iftcr that, a coarse-grained placer is used to place tlie modules to tlie GAPLA chip space and c:onnec:tions between islands are routed. Each module is then synthesized by calling existing FPG.i tools. The C.iD flow is designed a s an auxiliary tool to study the GAPLA arw e . It is parameterized in modeling the archi Therefore, by changing the values of the architectlira1 parameters, we could study the effect of these factors. In this paper, u-e report the results of our study of the GAPLA architecture using the above CAD flow.
n d ) . Then the c:ontrol sequences for

RELATED WORK
Several as,vnclironous FPGA architectures have beeri proposed in t,he last, decade [6, 7, 8, 9; 111 . Several of these architectures adopt the GALS concept. STACC [7] is loosely based on Slitherland's n3ic:ropipeline design. The dock signal of the data array is replaced by tlic handshaking control signals of the timing array in a micropipeline like structure. PCA [8] is a self-rec:onfigiirable programmable logic: architec ture consisting of a laver of logic array arid a laver of built-infacilities. Data communications bctwccn logic blocks is rcalized by a wormhole message passing mechanism tlirougli the biiilt-in-fac:ilities which can be expensive in time. Royal et a1 proposes another GALS FPGA architecture [ll] arid tlie idea of using GALS architecture to limit, t,he impact of long interconnect wire delay on the total FPGA performanc:e.
But to our knowledge, no CAD tools ham been proposed for those FPGA archit,ect,ures. In [13] , an automatic synthesis flow is proposed for the highly-pipelined asynchronoiis FPGA of [9] , which is a different asynchronoiis design style to GALS. In [12] , an automatic methodology to produce GALS system is proposed. But the main focus of [12] is t o aiitomatic:ally generated the GALS system from a higher level circuit description, while in our research. tlie GALS interface is a built-in feature of the GA4PLA FPGA. Our research focuses on finding t,he optimal values for a set of 3. THE GAPLA ARCHITECTURE Figure 1 gives a basic building t,ile of the GAPLA4 arcliitecture called an usynchwnous islonds. The GAPLA arcliitectiire is a mesh of asynchronoiis islands. Each island contains a synchronous logic block arid 4 usynchronous wruppers. Each wrapper cont,ains a local clock generator and 1/0 port c:ontrollers. The structure of the sync:hronoiis logic: block can be any oftlie conventional F P G A structures. But the size of each synchronous logic block must be big enough t,o implement reasonable funct,ions. In our design, we adopt the Virtex I1 logic: array structure. The 4 dock signals generated by tlie clock generators are all distributed into tlie synchronous logic block. Logic tiles inside t,he synchronous block can freely choose to connect to one of these clock signals. The 1ogic:s (:ontrolled by the same clock signal are called a clock domuin. Thus, the size arid shape of each clock domain of the GAPLA4 architecture is programmable within the limit of a synchronous logic: block. The routing resources between asvrichronous islands contain horizontal arid vertical routing channels for both data arid handshaking control signals. Adjacent, asynclironous islands are also d to enable fast c:ommimic:ations. Please refer to [4] for details oftlie architecture design.
The execution time of an application mapped on tlie GAPLA4 F P G A architecture consists of two parts: computation time arid corrirriuriicatiori time. Corriputatiori time is tlie time for synchronous logic blocks t,o finish t,he programmed computations. Cornrriuriicatiori time is the time consumed by the as,vnchronous communications between logic blocks. Tlie best performance of the GAPLA architectlire is the best tradcoff between communication time and computation tirnc. The architectural parameters which affect this tradcoff arc as follows. 
4.
Tlie size of a s,yriclirorious logic block. A large logic block means more operations can be put into one clock domain, which generally decreases the local clock spccd and increases the overall c:ompiitation time. But the corrirriuriicatiori time will decrease sirice more corrirriunicat,ions mill be done synclironous inside a clock domain.
The number of asynchronous 1/0 ports for each asynchronoiis island. Increasing the number of I/Os will increase tlie area overhead but will lessen tlie I/O corist,raints during the application part,it,ioning process which could improve the logic usage of tlie logic block and performance. Since our design of llers are very simple arid has srriall layout, areas, u-e can afford t,o add more 1/0 ports as long as it will benefit t h stem performance.
Tlie number of global routing channels. This factor not only affect the routability of GAPLA architecture and also affect the performance sirice the routing might be congested arid need to detour if the routing resource is limited which increases corrirriuriicatiori time.
CAD FLOW
The CAD flow is developed to automatically implement designs to be the GAPLA FPGA. It is also used to investigate tlie architecture design. As rrieritioried above, performance of t,he GAPLA is affect,ed by three groups of pararneters. The CAD flow models the architectlire based on them too. By given these param ' different values, we c:oiild compare the irnplcrncntation results of a set of benchmarks and get t,o know t,he effect of these parameters. Figure 2 shows a block diagram of the overall CAD design flow. The grev boxes are modules we proposed arid white boxes arc modules using existing C.iD tools. Partitioning is required first if tlie design is bigger than a given asynchronous island. Controls for asynchronoiis c:ommiinic:ations are then added t o each module to ensure functional correctness. Aft,er that, each module is synt,hesized and place-androuted using existing FPG.i design tools. Also, all the rnodiiles are fed into a coarse-grained placer and router which places each module to the GAPLA chip space arid finishes the global routing bet,ween modules. Finally, a simulation model of the design irnplcrncntcd on the G.iPLA architccture is formed for performance evaluation. If all the performance constraints are met ~ the design is ac:c:omplished. In the following subsections, we briefly explain each functional module in the following.
Partitioning
IVlien partitioning a design int,o modules; we try to rninimize the c:ommiinic:ation time between partitioned modules. Also. partitioning is coritiucteti under two constraints: tlie area constraint, where the area of each module must be less than the giveri area of a s,vnchronous block; the 1 / 0 coristrairit where the number of input/output ports must be less than tlie give number of iriput/output ports per async:hronoiis island. To calculate the asynchronous cornrriuriicatiori time, ~v e first build a CDFG representat,ion of the design and edges in the CDFG are given c:ommiinic:ation weights. The c:ommiinication weight of an edge consists of two parts: its "(:onnniiriication frequency" arid its "length" . The corrmiunicatiori frequency of an edge is defined as: 
l ( e ) is used to localize all tlic interconnects. An edge which spans more c:ontrol steps may have better chance of mapped t o a long interconnects;. Therefore, it should be more 1ikel.v t,o mapped to asynchronous communication channel. The final weight of an edge for the partitioning is a weighted combination offactors f ( e ) arid l ( e ) :
o arid is the maximum communication frequency of all edges, 7nin(Z(e)) is minimum length of all edges. Tlie GAPLA architecture allows multiple 1/0 ports of tlie same clock domain to be active at, t,he same time. In this arc user defined coefficient arid ci+p = 1. m u z ( f ( e ) ) case; tlie time overhead for these asynchronous communications overlaps; which leads to performance benefits. Tlie part,it,ioning algorit,hm should take advant,age of t,his and tern in a way that the overlap among asyrichronous corrirriuriicatioris is maximized. But because partitioning is done before the actiial system timing information is obtained, an estimation method is required. TVc use the factor N ; tlic number of control steps where data transmissioris across partitions are required, t o represent this.
Thus; the overall cost function of the partitioning algorithm is formed as:
wliere node i , j belong to d i f f e r e n t partitions.
A simulated aririealirig algorithm is used as tlie partitioning algorithm. After one iteration of partitioning, each partition is synthesized (logic synthesis without doing placement and routing) separately. The partitions that meets t,he area and I/O constraints are treated as an individual partition in the final result. The partitions that still violate the area and I/O constraints are further partitioned using the above algorithm.
Asynchronous Communication Control
To add asynchronous communications, we need to provide a proper sequence of control values for tlic control signals of tlie corresponding asynchronous 1 / 0 port controllers. Tlie asynchronoiis handshaking process is aiitomatic:ally managed by the built-in asynchronous FSMs inside the 1/0 port controllers based or1 these signals. Therefore, we need t o know at what c,vcle an asyrichrorious corrirriuriicatiori should take place. To gather this information, operations inside each module are scheduled first.
We use ari As-Soon-As-Possible (ASAP) algoritlirri to schedule each module in order t o get the best performance. Because of t,he archit,ect,ure design, t,he inter-module asynchronous communications block both tlic sender's and receiver's opcrations. Thus. deadlock situation could occur after sclieduling. Deadlock o(:(:iirs when two or more c:ommiinic:ating processes waiting for each other's data in order to continue cxecut,ing. It can be solved by const,ructing Communicat,ion Deperidericv Graph (CDG) [14] . A CDG contains all tlie c:ommiinic:ation nodes of the sy . And a directed arch between two corrmiunicatiori no n a CDG iff there is a sequent,ial dependency bet,ween the two nodes in any of the m. The dependencies of c:ommiinic:atiori nodes iri the constructed CDG are enforced in all tlie processes of the system by adding durnrny control edges to the processes. A4fter adding these edges, an ASA4P scheduling algorithm is used to schedule each process and deadlock is avoided.
The outputs of the scheduling are cycle-accurate descriptions for each modiile. After scheduling, we know exactly at what c:yc:le data needs to be sent to or received from other modules. Therefore, the control signals for the asynchronous communications can be added accordingly.
Module Placement and Routing
The placer will place each module t,o an asynchronous island. The optimization goal during placement is to minimize tlie total corrirriuriicatiori cost between modules;. Sirice each communicat,ion edge carries a communicat,ion weight, as explained before, the goal of the placer is thus to: ex11 ex12 where node i, j belongs to different modules. D,, is the dist,ance between t,he clock domains where node i , j are placed.
Since the affect of global routing on the system perforriiarice is great1.v reduced. a simple arid fast line-search based router [15] is used to route the asynchronous communicat,ions t,o the global routing channels of GAPLA FPGA. The inter-module routing resoiirc:es are represented by two matrices Horixorital Routing Sources (HRS) arid Vertical Routing Sources (VRS). The two matrices can be initialized at run time to model different configurations of the GAPLA a r c hlire. Nets are picked up onc:e at a time from the netlist arid routed. For rnultiple-terrriirial riets, the two terrriirials with the longest hIanhattan distance are routed first. 
Performance Simulation
Simulation is used to estimate the performanc:e of the design irnplcmcntcd on G.$PLA architecture. From the syrit,hesis results of each module, the information about the clock frequency for each module is obtained. From the module placer arid router, tlie iriforrriatiori about tlie placerrient position of each module and the interconnect delays bet,ween modules are obtained. These information toget,her with the c:yc:le-ac:c:iirate VHDL descriptions of each module is fed into our VHDL simulation model of the G.$PLA. The GAPLA simulation model contains t,he models for t,he pausible local clock generators, the 1/0 port controllers, and the interisland routing chaririels. The local clock generators are prograrnrncd to the corresponding clock frequencies of the modules. The modules are wrapped by the asynchronous interd through the routing channels, c:omposirig a sirnulation model of tlie circuit irriplerrieritatiori. Input t,races are t,hen read t,o t,he model and t,he performance can be observed.
ARCHITECTURE PARAMETERS
Studying Methodology
To study these parameters, we need to implement a set of bericlirriarks on different corifiguratioris of thern. Our benc:hmark set consists of 1 2 synthetic: benchmarks generated by T G F F [lo] . T G F F generates Directed .$cyclic Graphs (DAGS) with different number of nodes and connectivity intensities. We then assign each node an arithmetic: operatiori arid generate a VHDL description code from each graph as a benchmark. Thus, all the benchmarks arc corm put,ation intensive ones wit,h few or no control flow which are the cases for most FPGA applications. To exc:liide the fa(:-tor of multipliers. only additiori arid subtractiori operations are assigned. Table 1 gives the statistics for the benchmark set arid their irnplcrncntation results on a \'irtcx I1 FPG.$.
It is time forbidden to study the three parameters at the same time. Therefore, tlie parameters are deterrriiried orie by one. The size of a logic block is studied first since it is the single most important parameters for the GAPLA architecture. To do that; tlie riurriber of I/Os per clock dorriairi arid the routing capacities between modules are assumed to be infinite. Thus, the 1/0 constraint,s during part,it,ioning ' Y sirice tlie routing tiistarice can be estimated as the hlariliatt,an distance between terminals. After the size of a logic are lifted and the routing proc I I Figure 3: the benchmark set for different logic block sizes.
Average Performance improvements on be determined by looking into the partitioning results since, as explained before, tlie riurriber of I/Os per clock dorriairi could be fairly large without incurring huge area overhead. A4fter t,hat, t,he global routing capacities is determined by running the router on the after-placement benchmarks with tlie first two parameters fixed ori tlie GAPLA FPGA. Tlie experimental results are explained in tlie following Subsections.
Size of a Synchronous Logic Block
For every benchmark, 6 different sizes for a logic block are tried. They are 36, 64, 100, 144, 256, 400 (in t, erms of CLBs) . The performancx improvement of all the 1 2 benchmarks are surnrnarized in Figure 4 . Tlie average performance improvement for all 12 benchmarks is given in Figure 3 . From the results, the GAPLA FPGA with logic block size 256 CLBs delivers the biggest performance improvement or1 average for all the 12 benclirnarks. Thus, the size of a logic block is chosen to be 256 CLBs. The results also show that GAPLA FPGA could not give sound performance i nproverrierit for srriall applications like ex1 to ex4. If orilv tlie last 8 benchmarks are considered, t,he average performance improvements could be more t,han 28%. The Average performance improvement for the last 8 benchmarks are also sliowri in Figure 3 . As rricritioricd before; the area of a I/O port controller is relatively small. Therefore, a reasonably large number of I/Os can be integrated. In the last subsection, the size of a logic block is clioseri as 256 CLBs arid during the experiments, the riurnhcr of I/Os per clock domain arc considered t,o be always sufficient,. The actual 1/0 requirement,s for the benchmark set with logic: block size 256 CLBs are siimmarized in Table 2 .
I/Os Per Asynchronous Island
For the result,s, the average number of I/Os required per age of 16 data wires per communicat,ion channel as in our benc:hmarks. Therefore, the total number of data wires per asynchronous wrapper comes t o 256.
Routings Between Asynchronous Islands
Previous experiments assume that the global asynchronous routing channels are sufficient and therefore each net can use the shorted connection route. In this Subsection, ~v e study the impact of asynchronous routing channels on the pcrformance of the GAPLA FPGA architecture. The first two sets of arc:hitec:tiiral parameters are fixed, namely 256CLBs per logic block and each asynchronous wrapper contains 8 input ports arid 8 output ports arid 256 data wires. After routing, the asynchronoiis c:ommiinic:ation time is more a(:-curat,e based on the actual routing pat,h. The experiments arc conducted for different routing configurations (in tcrrns of riurriber of as,yriclirorious corrirriuriicatiori channels). Tlie experimental results are given in Table 4 . In the table, "P.1" represents the performance improvcmcnts cornpared to the synchronous implementation on a \'irt,ex I1 FPGA4. Tlie entries marked with "-" mean that tlie routing is incomplete under the corresponding c:onfigiiration.
The results show that if the global ing structure has less t,han 20 channels, some of t,he benchmarks can not he siic:c:essfiilly routed. After that, all the benc:hmarks (:an be routed. And increasing the number of asynchronous tracks only has slight impact on the system performance (less than 1 percent on average). A4nd increasing the number of global asynchronoiis routing diannels will greatlv increase tlie area overliead of tlie GAPLA arcliitect,ure, t,herefore, we choose 20 channels for the global routing struct,ure. We also assume that,, on average, each async:hronoiis channel has 16 bits of data wires. Therefore, the total nurnhcr of global data wires is 320.
Area Overhead
Haven chosen tlie parameters, we estimated the area overhead of t,he GAPLA architecture by implementing the building components of the architecture in silicon. The area ovcrhead is estimated to be at 19.9%. (Detailed estimation is riot shown due to page limitation.)
CONCLUSIONS
of each s,vnchronous logic block, the number of I/Os per as,vnchronous island; arid tlie number of routing channels between island, using t,he parameterized CA4D tools. From the experimental results, the following values arc chosen for these parameters: 256CLBs per logic block, 8 input ports, 8 output ports, 256 data mires per asynchronous u-rapper, and 20 global routing channels arid 320 global data wires. The area overhead of the GAPLA architecture using this (:om figuration is around 19.9%. Tlie average performance irriprovement for all the benchmarks is 17.8%. If only the large benchmarks, which arc suitable for the GAPLA FPGA; arc considered, tlie average performance improvement on tlie last 8 benchmarks is 25.4%.
