For multi-gigahertz designs in nanometer technologies, data transfers on global interconnects take multiple clock cycles. In this paper, we propose a regular distributed register (RDR) micro-architecture for multi-cycle on-chip communication. An RDR architecture structurally consists of a two-dimensional array of islands, each of which contains a cluster of computational logic and local register files. We also propose a new synthesis methodology based on the RDR architecture. Novel layout-driven architectural synthesis algorithms have been developed for RDR. Application of these algorithms to several real-life benchmarks demonstrates 44% improvement on average in terms of the clock period and 37% improvement on average in terms of the final latency. 
INTRODUCTION
There are two important inflection points in the development of deep sub-micron (DSM) process technologies. One is when the average interconnect delay exceeds the gate delay, which happened during mid 1990's and led to the timing closure problem. The other is when we cannot reach every part of the chip in a single clock cycle, which is happening now. It has been shown in [1] that, even with the aggressive interconnect optimization techniques (such as buffer insertion and wire-sizing), 7 clock cycles are still needed to go from corner-to-corner for the predicted die-size in the 0.07µm technology generation, assuming a 5GHz clock, based on NTRS'97 [2] . Although the exact clock cycles may vary given the recent update of the roadmap [3] , this still clearly suggests that multi-cycle on-chip communication is a necessity in multi-gigahertz synchronous designs. However, given the fact that most existing design tools only deal with the first problem but completely lack consideration of multi-cycle communication, further system performance increase is at risk.
To address the multi-cycle communication problem, one can explore the following design methodologies:
1. Asynchronous design: The state transitions of an asynchronous design are triggered by events instead of periodic clocks. This makes asynchronous designs operate correctly, regardless of the delays on gates and wires [4] . However, due to the lack of design tools and performance overhead, it only applies to a very limited class of circuits. In general, it is unclear whether asynchronous designs can yield high performance in practice. [5] : In a GALS design, all major modules are designed in accordance with proven synchronous clocking disciplines. Each module is run from its own local clock. Data exchange between any two modules strictly follows a full handshake protocol. GALS hopes to combine the advantages of synchronous and asynchronous design methodologies. However, the overhead for the "self-timed wrapper" may compromise both performance and area of the design.
Global asynchronous locally synchronous (GALS) design

Synchronous design with multi-cycle communication:
Synchronous design is still by far the most popular design methodology. It is well understood and supported by the mature CAD toolset. However, the ever increasing gap between gate delays and interconnect delays requires the handling of multi-cycle communication that cannot be overcome by the traditional synchronous design flow. This paper will focus on the synchronous designs and propose a way to systematically handle multi-cycle communication.
proposed is the distributed-register architecture which helps to explicitly separate the long interconnect delays from logic delays. Under this architecture, [11] performed an integrated resource sharing and placement to eliminate the slack time violation due to the interconnect delays. Note that the irregular structure used by both [10] and [11] may cause difficulty for interconnect delay estimation. Regular circuit and layout structures [12] can be employed to avoid this problem. Generally, regular structure facilitates predictability and simplifies the implementation process.
In this paper, we present a new synthesis methodology for synchronous designs with multi-cycle communication. Our contributions are as follows (i) we propose a regular distributed register (RDR) micro-architecture which offers high regularity and direct support of multi-cycle communication; (ii) we propose synthesis methodology and develop novel architectural synthesis algorithms which efficiently synthesize behavior-level input onto the RDR architecture.
The remainder of the paper is organized as follows. Section 2 introduces the RDR architecture. Section 3 sketches our synthesis methodology and the algorithms for RDR architecture. The experimental results are shown in Section 4, followed by the conclusions and future work in Section 5.
REGULAR DISTRIBUTED REGISTER ARCHITECTURE
In this section, we propose a regular distributed register (RDR) micro-architecture. It provides high regularity and direct support of multi-cycle communication over global interconnects. 
Figure 1. A 2× × × ×3 island-based RDR architecture
An RDR architecture consists of a two-dimensional array of islands. The size of each island is chosen such that intra-island computation and communication can be done in a single clock cycle. In other words, the data obtained from a local register can be processed by a certain functional unit, and then be stored to a local register within only one cycle. Figure 1 illustrates a 2×3 island-based RDR architecture. Figure 2 details the structure of a single island, which consists of the following components: 
Figure 2. Components of a single island
As discussed in [10] , one of the advantages of distributed register architecture over centralized register architecture is that it can achieve a short clock period and effectively reduce the overall performance degradation due to the interconnect delay.
Since we distribute registers to each island, the delays of long wires do not lengthen the clock period. The potential drawback of this approach is that it may demand extra communication cycles for inter-island data transfer. Fortunately, this can be properly harnessed by a smart coarse placement to hide as many critical data transfers as possible. The regularity from RDR architecture ensures the placement a meaningful delay estimation on interconnects.
The RDR architecture has the added advantage that by varying the size of the basic island, we can target at different clock periods and systematically explore the cycle time vs. latency tradeoff.
Given a target clock period, the following formula shows how to compute the geometrical dimension of a basic island:
where T clk is the target clock period, D logic is the largest logic delay, D opt-int (x) is a function which estimates the interconnect delay over a certain distance x, W i is the island width, H i is the island height, and D intra-island is the average intra-island delay. The average intra-island delay should be no greater than the largest logic delay D logic plus the worst-case interconnect delay, which approximates to 2×D opt-int (W i +H i ) (i.e., the estimated interconnect delay over a corner-to-corner round trip within an island). Figure 3 shows an RDR architecture with a 12×12 island-based array for a 5GHz design in 70nm technology by 2008 [2] . We assume a chip dimension of 620 mm 2 (24.9mm x 24.9mm) in which the signal of a wire can travel up to 7.52mm within 1 clock cycle under interconnect optimization. We need a total of 7 clock cycles to cross the chip. Based on the above formula, we can derive the base dimension of each island W i =H i =2.08mm. 
PLACEMENT-DRIVEN ARCHITECTURAL SYNTHESIS USING RDR ARCHITECTURE
In this section, we present our architectural synthesis system for RDR architecture, named MCAS. We will first introduce the overall design flow in Section 3.1, followed by a motivational example in Section 3.2. Then we will present the key modules of the MCAS system, including the scheduling-driven placement, the placement-driven simultaneous rescheduling and rebinding, and the datapath & FSM generation. Figure 4 shows the overall synthesis flow of the MCAS system. MCAS starts with a synthesizable behavioral C or VHDL description. RDR architecture specification is needed (including the island structure, functional unit library and delay table). The target clock period is also given and used in the followed synthesis steps. If the final design cannot meet the clock period requirement, we can adjust the island size of the RDR architecture and perform another iteration by binary search of clock period.
Overall Design Flow
We first generate the control data flow graph (CDFG) from the behavioral descriptions. In the next step, we obtain the resource allocation from a force-directed scheduling algorithm [13] using the critical path length as the timing constraint. Then we perform an initial functional unit binding and derive an interconnected component graph from the bound CDFG.
After that, the interconnected component graph is fed to the scheduling-driven placement to provide location information (i.e., island index) of each functional unit. The scheduling-driven placement algorithm will be discussed in Section 3.3. Based on the physical information, we perform simultaneous rescheduling and rebinding on the CDFG. This algorithm will be presented in Section 3.4. At the backend, all of the scheduling and binding information is back-annotated to the CDFG and fed to the datapath & FSM generation module. A datapath in structural VHDL format and controllers in behavioral FSM style are generated. This module will be discussed in Section 3.5
The synthesis system finally generates RT-level VHDL files for logic synthesis and outputs floorplan constraints and multi-cycle path constraints for placement & routing. 
A Motivational Example
In this subsection, we use a motivational example to illustrate the advantage of using multi-cycle communication and the need for the consideration of multi-cycle communication during architectural synthesis. Figure 5 is a data flow graph (DFG) extracted from a discrete cosine transform (DCT) algorithm [14] . In this DFG, nodes 1, 2, 5, 6, 9 and 10 are addition or subtraction operations, and nodes 3, 4, 7, 8, 11 and 12 are multiplication operations. In this example, we assume that the delay of a multiplication operation is 2 ns and that of an addition or a subtraction operation is 1 ns. (a)
(b )
Figure 5. (a) Schedule and binding without consideration of interconnect delays; (b) Layout of wirelength-driven placement
In the traditional architectural synthesis approaches, interconnect delay is assumed to be negligible compared with the functional unit delay, which is not realistic anymore in DSM era. Without considerations of interconnect delays, the DFG is scheduled in 6 clock cycles with an estimated clock period of 2 ns. The total schedule latency is 12 ns. Two multipliers and two ALUs are allocated. The nodes in the same pattern are bound to the same functional unit.
However, interconnect may introduce extra delays on the DFG edges after place & route. Figure 5 (b) shows the layout produced by a wirelength-driven placement. Each box represents a functional unit, and the numbers inside the box denote the DFG nodes bound to the functional unit. The horizontal wires represent short interconnects with a delay of 1 ns. The vertical wires represent long interconnects with a delay of 2 ns. The interconnect delays are back-annotated to the DFG edges. On the DFG edges in Figure 5 (a), a solid line represents a long interconnect delay, and a dash line represents a short interconnect delay. The introduction of interconnect delay has lengthened the actual clock period to 4 ns, resulting in 24 ns of the final schedule latency.
Observe that in Figure 5 , the interconnect delay has significantly compromised the final latency. We can minimize the negative impact of interconnect delay by using our RDR architecture to allow multi-cycle communication. Figure 6 shows the rescheduled result based on fixed placement and binding under the assumption that interconnect delays can be more than one clock cycle. The resulting clock period is 2 ns. Although the cycle number increases to 9 clock cycles, the total schedule latency is reduced to 18 ns. Note that in the figure, a short line on a dash edge indicates to merge a 1 ns interconnect delay to a 1 ns operation.
The following subsections will demonstrate that the latency can be further reduced if we consider the multi-cycle communication during scheduling and binding, which are two crucial steps of architectural synthesis. 
Scheduling-Driven Placement
In previous work, rescheduling based on fixed binding and placement is used to reduce scheduling latency [10] . However, the effect of scheduling on placement has been rarely studied.
In Figure 6 , we have seen that long interconnect delays are on the critical path of DFG. A pure wirelength-driven placement may produce a poor solution with long critical path. To address this problem, we propose a scheduling-driven placement algorithm, in which scheduling guides the placement to find a placement solution with a minimal total schedule latency.
Using the same example from Figure 6 , Figure 7 shows that by applying a scheduling-driven placement with the critical path awareness, the DFG can be scheduled in 8 clock cycles and the total schedule latency can be reduced 16 ns. , which is derived from the bound CDFG. Nodes in N * represent the functional units to which operation nodes in G are bound such as ALUs, multipliers, dividers, etc. Edges in E * represent the data transfers between these nodes. These edges are annotated with a delay D(e) corresponding to the physical delay between the functional units. The goal is to place the nodes of N * so that the total schedule latency of G is minimized.
We integrate scheduling with an SA-based coarse placement algorithm [15] . A fast list scheduling is performed on G instead of the classical timing analysis at every temperature during the SA process to identify critical edges in E*, and assign higher weights to them. By reducing the weighted wirelength, we try to hide as many critical data transfers into intra-island communication as possible, and make the uncritical data transfers go through the inter-island, multi-cycle communication over global interconnect.
Initially, we define the bin structure of the coarse placement to be the given island structure. The criticalities of the corresponding nets are obtained and converted to weight on the nets at each temperature during the SA process (once scheduling-based timing analysis is performed). Our net weighting method is similar to [16] . The criticality of an edge is defined to be
crit(e)=1-slack(e)/L
where L is the schedule latency and slack(e) is the edge slack produced by the list-scheduling algorithm.
After the placement, the functional units that are placed in the same bin will be clustered into the LCC of the corresponding island.
Placement-Driven Simultaneous Rescheduling and Rebinding
In [17] , functional unit binding is performed simultaneously with a floorplanning to estimate the quality of the floorplan. In [18] , floorplanning is used to estimate layout after scheduling and allocation. The limitation of both [17] and [18] is that they only optimize the clock period without performing rescheduling to reduce the clock cycle number. A concurrent scheduling and binding algorithm based on a given floorplan is proposed in [9] . It uses the concept of dynamic critical path list-scheduling (CPLS) introduced by [19] . The algorithm schedules the ready operations in descending order of the critical path length, and simultaneously binds the operations to functional units in such a way that the binding incurs the least increase of total schedule latency. However, this algorithm does not consider potential resource competition during scheduling and may produce suboptimal solution. Figure 9 illustrates this limitation. According to the algorithm in [9] , both nodes 3 and 4 will be ready and compete for ALU1 in clock cycle 3. Since they have the same priority (i.e., critical path length), either one of them may be chosen and be bound to ALU1. If node 3 is scheduled first, the DFG will be scheduled in 3 clock cycles. However, if node 4 is scheduled first and bound to ALU1, we will end up with a DFG scheduled in 4 clock cycles.
To overcome this problem, we propose a new algorithm based on a force-directed list-scheduling framework [13] . It integrates with simultaneous rebinding, and tries to minimize the schedule latency with consideration of interconnect delays.
The first step of our algorithm is to defer the node selection. The node with the least force is deferred. The critical path length (CPL) and the earliest start time (EST) are used as the secondary and tertiary priority functions to break ties. The nodes are deferred one-by-one until enough functional units are available. In the second step, the remaining ready nodes are scheduled and bound in decreasing order of CPL, and EST is used to break ties. It is possible that some nodes cannot be scheduled to the earliest clock cycle due to resource competitions. In the third step, if there are spare resources available, the previously deferred nodes will be explored and scheduled to the current clock cycle in the reverse order of deferral. After that, the algorithm will proceed to next iteration until all nodes are scheduled and bound.
Datapath & FSM Generation
After the previous phases, the binding and scheduling information is back-annotated to the CDFG's edges and nodes. The backend of our architectural synthesis system will extract this information to construct datapath and controllers. The datapath, including instances of functional units, registers and steering logic, is generated as a structural VHDL file. This step also generates floorplan information and multi-cycle constraints for RDR synthesis flows. The floorplan information is used to constrain the placement location for every instance in the datapath. The multicycle constraints correspond to the multi-cycle communication paths between registers, and are used to guide the physical design tools to optimize the clock period.
In each island, an FSM controller is generated to control the instances inside the island. These distributed controllers of different islands have identical state transition diagrams, but different output signals. The VHDL files for the datapath and the controllers, the floorplan and multi-cycle paths constraints, are fed into the logic synthesis and physical design tools to produce the final design layout.
EXPERIMENTAL RESULTS
We implemented our MCAS system in C++/UNIX environments. To obtain the final performance results, Altera's Quartus II version 2.2 [20] is used to implement the datapath part into a real FPGA device, Stratix™ EP1S40F1508C5. All of the pipelined multipliers are implemented into the dedicated DSP blocks in the Stratix™ device. We set the target clock frequency at 200 MHz and use the default compilation options. We impose LogicLock™ to constrain every instance into its corresponding island, and set multi-cycle path constraints for multi-cycle communication paths.
For comparison, we also set up two alternative flows. Figure 10 shows the three flows labeled as 1, 2 and 3. Flow 3 is our MCAS flow discussed in Section 3.1. The simplest flow (flow 1 in Figure  10 ) uses the traditional scheduling algorithm based on fixed binding information. Similar to flow 3, flow 2 is also based on the RDR architecture and the location information provided by the scheduling-driven placement. However, flow 2 only performs scheduling for the given binding instead of simultaneous rebinding and scheduling in flow 3. The same list-scheduling algorithm is applied for all three flows. These three scheduling flows are converged in later synthesis phases. We have tested the three different flows for a set of real-life benchmarks, which include several different DCT algorithms, such as Planar Rotation (PR), WANG, LEE and DIR, and several DSP programs such as MCM, HONDA, CHEM, and U5ML12. All of the benchmarks are from [21] . In the experiments, we applied 7×4 RDR architecture for small designs (PR, WANG, LEE, DIR, MCM, and HONDA), and 4×4 architecture for CHEM and U5ML12. In Table 2 , we list the control step numbers (CS), clock periods (CP) reported by QuartusII, and total latencies (Lat, the product of CS and CP) produced by the three flows. Considering the interconnect delay, flows 2 and 3 introduce more cycles for the communication between registers. Compared with flow 1, flows 2 and 3 produce 14% more cycles. However, since flows 2 and 3 separate the communications from the computations and even apply multi-cycle path constraints for communications, the individual paths in the final layout are reduced, resulting in much smaller clock periods (more than a 40% reduction).
We also illustrate the total latencies in Figure 11 , where the three bars in every group represent the results from flows 1, 2 and 3 respectively. Compared to the traditional flow, our architectural synthesis based on RDR approaches (flows 2 and 3) reduces the final latencies of the designs by 35% and 37% respectively. It can also be seen that flow 3 has better latency than flow 2. It proves our conviction that scheduling-driven placement can reduce schedule latency and simultaneous rescheduling and rebinding can further improve design performance. Table 3 lists the resources used by different design flows in terms of LUT and register. It can be seen that flows 2 and 3 introduce less than 20% LUT overhead, but more than 100% registers as overhead. Since our RDR architecture uses more registers than the traditional approach, the register usage is increased. The increased register number also increases the complexity of the steering logic structure, such as multiplexors, which then contributes an observable portion of the area in the final layout, especially for an FPGA design. 
CONCLUSIONS & FUTURE WORK
We have proposed a novel RDR architecture to support multicycle on-chip communication in multi-gigahertz designs. Compared with several existing methodologies, the regularity of RDR architecture facilitates the predictability of interconnect delays at the higher design levels. An architectural synthesis system using the RDR architecture has been developed. The experimental results on Altera's Startix™ device have demonstrated the effectiveness of our proposed architecture, design methodology, and synthesis algorithms.
In the future, we will extend our architectural synthesis system to support control-intensive applications. Many problems, such as variable renaming and allocation, distributed controller generation, etc., will be further studied. In addition, we have observed that the steering logic has a great impact on the performance and area of the final layout, and we will consider optimizing them in the future synthesis flow.
