Crosstalk delay within an on-chip bus can induce severe transmission performance penalties. The Bus-grouping Asynchronous Transmission (BAT) scheme is proposed to mitigate the performance degradation. Furthermore, considering the distinct spatial locality of transition distribution on some types of buses, we use the locality to optimize the BAT. In terms of the implementation, we propose the Differential Counter Cluster (DCC) synchronous mechanism to synchronize the data transmission, and the Delay Active Shielding (DAS) to protect some critical signals from crosstalk and optimize the routing area overhead. The BAT is scalable with the variation of bus width with little extra implementation complexity. The effectiveness of the BAT is evaluated by focusing on the on-chip buses of a superscalar microprocessor simulator using the SPEC CPU2000 benchmarks. When applied to a 64-bit on-chip instruction bus, the BAT scheme, compared with the conservative approach, Codec and Variable Cycle Transmission (DYN) approaches, improves performance by 55 + %, 10 + %, 30 + %, respectively, at the expense of 13% routing area overhead.
Introduction
With the technology being pushed to nano-meter geometries, the coupling capacitance of interconnect buses is growing to dominate the total capacitance [1] , which causes severe propagation delay on on-chip buses [2] , thereby resulting in the performance bottleneck of many on-chip systems.
Many approaches have been proposed to tackle this problem. Those approaches can be classified into two major categories: Shielding [3] , [4] and Codec [5] , [6] .
The Shielding approaches include "passive shield" and "active shield" [3] . Both types of approaches use the effective Miller capacitance to reduce the total effective capacitance between neighboring bus wires. In general, the "active shield" can achieve better shielding performance than "passive shield" at the cost of more complex layout design and power dissipation, and both impose considerable area overhead.
Although shielding schemes can be used to protect a few critical signal wires from crosstalk, it is inappropriate to adopt these schemes to masses of connect wires due to high area overhead in many situations. The Codec approaches are based on the key observation: crosstalk delay depends on different pattern transitions -some Crosstalk-Sensitive (CS) transitions induce longer delay, while the other Crosstalk-Insensitive (CI) transitions induce shorter delay. Thus, the crosstalk delay can be reduced by encoding the CS patterns to the CI patterns [5] .
To adopt the Codec approaches, two primary questions need to be answered: 1) How much overhead (the number of extra bits) do they impose to make the buses obtain immunity against depressing crosstalk delay? 2) How to figure out a practical codec methods? To the first question, B.Victor et al., through comprehensive mathematic computing, have proved that the lower bound of overhead is about 44% [6] . This result implies significant area-inefficiency. To the second question, besides searching-and-selecting approaches in full code space, there are few comprehensive mathematic-based construction methods. In other words, it is very hard to construct a practical codec strategy, especially for a large bus width (such as 64 or wider) where searching-and-selecting approaches are too time-consuming to put into practical use.
Besides the above conventional schemes, there are several new schemes:
Delay-line [7] : Considering that skewing the switch time of CS transitions can also reduce the effective Miller capacitance, thereby mitigating the crosstalk delay, M. Ghoneima et al. presented Delay-line scheme. However, it needs lots of complicated synchronizing and calibrating operations.
Variable Cycle Transmission(DYN) [8] : DYN handles data transmission through variable cycle assignment -CS transitions are assigned more transmission cycles, while CI transitions are assigned less transmission cycles. If bus width is relatively narrow, DYN, generally, can significantly improve performance. Unfortunately, as bus width increases to meet higher bandwidth transmission, CS transitions tend to be more common, which makes the dynamic cycle assignment scheme less efficient. This phenomenon results from typical "cask effect."
To sum up, the above approaches, according to the different means to deal with the CS transitions, either absolutely eliminate the emergency of CS transitions [5] , [6] , or mitigate the negative effects of CS transitions [3] , [7] , [8] .
The former is at the expense of significant area overhead (extra wire placement and codec logic) which is unacceptable due to the limited on-chip area and strict layout design Copyright c 2008 The Institute of Electronics, Information and Communication Engineers in many situations, while the later is at the cost of complexity of transmission structure (of course some moderate area overhead). The complexity of the transmission structure, however, might be reduced to an acceptable level through developing some sophisticated techniques.
In this paper, we alleviate the transmission performance degradation by employing the "mitigate" strategy. We propose a new scheme: Bus-grouping Asynchronous Transmission (BAT), by extending the DYN approach.
The key idea of the BAT can be illustrated in Fig. 1 . Each row denotes a pattern transition, the shaded ovals denote the Crosstalk Sensitive (CS) parts in these transitions. In this example, there are six pattern transitions, and each encounters at least one CS part (denoted as shaded ovals). The transmission time is determined by the CS parts.
Applying the DYN approach, we will encounter CS transitions in all the transitions, which makes this approach fail to improve transmission performance. Assume that CSfree transitions consume t time and the CS-contained transitions consume T time, where t < T . In the example, we need 6T to complete these patterns transmission. However, if we divide the bus into two subgroups, then it can be seen not all sub-transitions encounter CS transitions (denoted as the striped transitions in Fig. 1(b) ). Then, we employe the DYN approach on the two groups of sub-transitions, respectively. In the way of transmission, the more essential change is that the two groups of sub-patterns are independently transmitted on the grouped bus (asynchronous), and assembled in the receiving end. In the example, supposing the buffer capacity is sufficient at the receiving end, we just require {4T + 2t} (max{3T + 3t, 4T + 2t}) to complete the six transitions. The transmission performance is improved. In particular, the main contributions of this work are as follows:
• We propose the Bus-grouping Asynchronous Transmission (BAT) scheme. Through dividing the original bus into proper-grain sub-buses by placing some elaborate shielding wires and applying the DYN approach [8] on them, the potential performance can be exploited as much as possible.
• We use the locality of Crosstalk Factor (CF) to optimize the BAT scheme. We find that some types of buses, such as instruction buses, have distinct spatial locality of crosstalk effects. To evaluate the impact of crosstalk effects, we present the metric of CF. Through analyzing the bus CF distribution, we can optimize the BAT scheme.
• We present the Differential Counter Cluster (DCC) synchronous mechanism and the Delay Active Shielding (DAS) scheme in terms of the implementation of the BAT scheme. The DAS scheme employs the Active Shield scheme [3] and Delayed Line scheme [7] . Both DCC and DAS are highly efficient and can be easily implemented.
The rest of this paper is organized as follows: Sect. 2 analyzes the DSM (Deep Sub-Micro) bus transmission performance characteristics and presents the proposed BAT scheme. Section 3 presents the BAT implementation. The evaluation results are shown in Sect. 4. Finally, Sect. 5 concludes this paper.
Proposed BAT Scheme

Bus Model
A DSM bus model is illustrated in Fig. 2 . The C I denotes the capacitance between two adjacent wires and the C L denotes the capacitance between a wire and the substrate. The delay factor is a function of the ratio λ = C I /C L [9] .
Sotiriadis et al. have presented a comprehensive analysis of the delay estimation for coupled lines and multiple drivers based on Elmore delay [5] , [10] . The essential conclusions are depicted in 
The victim lines belong to boundary lines (the left line is the victim line.)
four classes according to the level of significant delay factor. As Li et al. adopted in [8] , we assign transmission cycles to different types of transitions by the different classes.
For instance, if a transition is included in Class-II, then this transmission will consume two cycles.
"Cask Effect" and Bus-Grouping
Consider a n-bit bus and a sequence of patterns:
The transition between the two consecutive patterns P i , P i+1 is denoted as P i →P i+1 . As indicated in Table 1 , this transition is evaluated as one of the four transitions classes: Class-I, Class-II, Class-III and Class-IV (listed in the 4th column of Table 1 ).
The corresponding delay vector of the transition
(1 < j < n) denotes the theoretical required number of cycles to transmit the jth bit of P i+1 . Clearly, the required cycles to transmit the P i+1 succeeding P i is the maximum element of P i (i.e. max {d
For example, the corresponding transition between the two patterns {0 1 0 0 1 0 0 1}→{0 0 1 1 0 1 0 0} is {− ↓ ↑ ↑ ↓ ↑ − ↓}. The delay vector D is {1 3 2 2 4 3 1 1}. According to Table 1 , this transition belongs to the Class-IV, so we need four cycles to complete this transmission.
Cask Effect: From Eq. (1) we find that if any one element of the D i is equal to the delay factor of "worst" transition, the whole pattern transmission has to endure the most conservative situation though the transitions on other lines might be very crosstalk insensitive.
In order to mitigate the "cask effect," we propose the Bus-grouping Asynchronous Transmission (BAT) scheme: Divide every pattern into sub-patterns at the sending end, asynchronously transmit these sub-patterns, and then assemble these sub-patterns to the original pattern at the receiving end.
To evaluate transmission performance, we develop a metric T total : the total time to transmit the k patterns (P 1 , P 2 , . . . , P k ), where
Assume a bus divided into two sub-buses at the sth bit † , the corresponding two sub-patterns are {v 
Supposing there are sufficient buffers at the receiving end, we just need T total to transmit the k patterns, where
The condition: T total < T total , is always satisfied in practical situations; thus, the performance can be improved by adopting the BAT scheme. Furthermore, it can be inferred if we divide the original bus into more fine-grain subbuses, we can achieve higher transmission performance, at the expense of more area overhead and higher implementation complexity. This is an intrinsic tradeoff among performance, area overhead, and implementation complexity.
Optimizing BAT
It is a reasonable grouping strategy to equally divide a bus into identical sub-buses if the transmitted data satisfies the roughly uniform distribution. However, if the data distribution exhibits some kinds of spatial locality, this equally grouping strategy is not efficient enough. Taking advantage of the locality, we could achieve higher performance through Unequal Division -dividing the original bus into several sub-buses with different width.
We define Crosstalk Factor (CF) to describe the spatial locality. If some lines have higher CF, it tends to suffer severer crosstalk delay. The CF can be reflected in the delay matrix which is composed of D i . The CF of the jth line is denoted as
In our observation, through 10 SPEC CPU2000 benchmarks simulating, we find that the CF distribution of different benchmarks is similar in trends with each other on the † The d same bus, and the CF distribution on different buses shows large difference (such as between instruction bus and data bus in the experiment). Furthermore, comparing with the CF locality on data bus, the CF locality on instruction bus is more distinctive, as shown in Fig. 3 .
There are four prominent CF peaks in Fig. 3(a) , which results from the format of the adopted MIPS-like instructions. In the experiment, we adopt three-operand instruction format. There are four "activity fields" in this format: three operands fields and one opcode field, which is popular in modern superscalar architecture.
We can use the locality to realize more efficient grouping strategy than the identical grouping strategy for this type of buses on which transmitted data exhibits distinct locality. Group the bus into sub-buses at these points where the distribution of CF presents "peak"s. As Fig. 4 shown, we divide the instruction bus into sub-buses with different width and the data bus into sub-buses with same width.
The bus-grouping can be realized by placing several delicate shielding lines into the original bus. Although the shielding lines placement seems application-specific, the basic idea can be extended to general situations. Because in practical design processes, the CF distribution can be obtained by running a high-level system simulator (such as a C/C++, or SystemC-implemented simulator). The extraction of CF locality can be accomplished in an early design phase, and then the extracted locality can be used to guide high performance bus fabrication.
Implementation of BAT
The BAT design diagram is illustrated in Fig. 5 . It is com- The DCC component is responsible for synchronizing the sub-buses transmission. The CDA analyzes the data transition, and conveys the crosstalk information to the VSG module. The VSG module, combining with the synchronization information generated by the DCC module, synthesizes a set of data valid-indicating (VI) signals which indicate the receiving end to sample the sub-patterns. There is one-cycle slack between the leading CDA operation and the trailing VSG operation. At the receiving end, the Assembler can "assemble" these sub-patterns taken apart at the sending end to original patterns as long as there are no empty buffers. The following shows more details about the mechanism of the DCC synchronization. Additionally, in view of the importance of protecting data valid-indicating (VI) signals from other signals influence by crosstalk, we will detail the proposed shielding scheme -Delay Active Shielding(DAS).
DCC Synchronization
These receiving end buffers have to be guarantied not to overflow. This guarantee can be provided at the sending end since we know the differential numbers of the transmitted patterns on every pair of sub-buses.
Assuming the length of these buffers is L, a set of bidirectional counters, ranging from −L to L, are required. These counters, initialed to zero, record the differential numbers of patterns transmitted on every pair of sub-buses. For instance, C(i, j) is such a counter monitoring the transmission state of the ith group and the jth group. If a valid subpattern transmission is completed through the ith group subbus, C(i, j) is increased by one for j = 1, 2, . . . , i − 1, i + 1, . . . , n; and, if a valid sub-pattern transmission is completed through the jth group sub-bus, C(i, j) is decreased by one for i = 1, 2, . . . , j − 1, j + 1, . . . , n. When C(i, j) overflow, stop the ith group sub-bus transmission and hold its state. When C(i, j) underflow, stop the jth group sub-bus and hold its state. We employ a counter "tree" to implement this counter cluster. The synchronization logics can be explained as follows ('OF' short for 'OverFlow,' 'UF' short for 'UnderFlow,' '+' means logical OR):
• Hold ith sub-bus, if and only if {OF (C(i, 1) 
} is true; • Hold jth sub-bus, if and only if {UF (C(1, j) 
The number of required counters for a bus divided into g sub-buses is C 2 g . For instance, 6 (C 2 4 ) counters are required for a bus grouped into 4 sub-buses.
DAS Scheme
Our goal is developing a not only robust, but also areaefficient shielding scheme for the VI signals. Here, "area" mainly implies the area occupied by the VI wires and grouping wires. Figure 6 illustrates the Active-Shield scheme [3] . The shielded line -VI line -is in the middle. The shielding lines are driven by the same signal with different strength (because the shieling wires are narrower and have smaller load capacity than the shielded wire). So the transition types of VI signal belong to the Group-2 whose delay factor is just 1 (shown in Table 1 ). The VI signal can be shielded well.
Furthermore, can these active-shielding wires be reused as grouping wires to reduce the extra wires overhead?
The answer is positive, as long as some modifications to the original Active-Shield scheme are carried out. Otherwise, the signal transitions on data lines will be influenced by the VI signals transitions, which might lead the receiving end to capture unstable data. Inspired by the Delayline scheme [7] , we address this problem by skewing the transitions of the VI signals off that of data transmission and capturing. The implementation is using the positive (or negative) clock edge to trigger the data transmitting at the sending end and the data capturing at the receiving end, and using the negative (or positive) edge to trigger the transmitting of data VI signals. The reason for the feasibility of this implementation is that the crosstalk factor of the VI line is just 1 (as shown in Table 1 ), which is less than a half of the crosstalk factor of Class-II transitions. Therefore, a half cycle is enough to setup the VI signal at the receiving end. Through these modifications, the VI signal transmission is delayed a half cycle after the data transmission, which will greatly mitigate the crosstalk effect between bus wires and VI signal wires which are reused as grouping wires. We name the modified shielding scheme as Delay Active-Shield (DAS) scheme. Figure 7 illustrates the DAS timing sequence.
Finally, it can be seen that the BAT scheme is scalable with the bus width with little extra implementation complexity, which makes it easy to be adopted in the situations where the width of the bus is relatively large (such as 64 or wider) and transmission performance is critical.
Evaluation
The performance of the BAT is evaluated using Simplescalar3.0 tool set [11] on 10 SPEC2000 CPU benchmarks [12] . Firstly, since the instruction fetch delay is the performance bottleneck of modern superscalar CPU [13] , we study the data flow between Level-1 instruction cache (L1-icache) and instruction buffer unit. Similarly to [13] , a Harvard architecture is adopted. Secondly, we study the data transmission on data bus connecting Level-1 Data cache (L1-dcache) and datapath. The width of instruction bus (Ibus) and data bus (D-bus) are 64-bit and 32-bit, respectively.
Performance Comparisons
We use Mean Required Transmission Cycle (RTC) to evaluate transmission performance. We compare BAT scheme against several other representative crosstalk mitigation schemes: Codec (CDC) [6] , Passive-Shied (PSD) [4] , Active-Shied (ASD) [4] , Variable Cycle Transmission (DYN) [8] and the original conservative (ORI) approach.
The CDC approach transforms the transitions from Class-IV or Class-III into Class-I or Class-II; therefore, twocycle period is required to complete one pattern transmission. The PSD approach interleaves the original bus wires with shielding wires, so any original transition will be transformed into Class-II. Therefore, both RTCs of CDC and PSD approaches are two cycles. The ASD approach can transform all original transition into Class-I, which implies the best performance: RTC is one cycle, but imposes the worst area overhead in all of the mentioned approaches.
The DYN approach employs variable cycle transmission [8] , so the RTC is variable from one cycle to four cycles.
In addition, the RTC of ORI approach is four cycles. (Notice that the transmission performance of ORI approach, in reality, maybe achieves one cycle per pattern, but here the time period of "one cycle" is equal to the time period of the mentioned four cycles.)
BAT Applied to Instruction Bus
Firstly, we simulate the transmission process on instructionbus (I-bus). The CF distribution is shown in Fig. 3(a) . The buffer size is configured to 8 words (4 × 64-bit). Figure 8 shows the RTC variation with different number of subgroups. The average RTC of the 10 benchmarks can be reduced to 2.57 using DYN approach (which is equivalent to 1-Group configuration in our approach). We can further reduce it to 1.79 with 4-Group configuration and even fall to 1.62 with 8-Group configuration. However, when the number of groups exceeds 4, the performance marginal utility is unattractive. So 4-Group configuration is an optimum trade-off between performance speedup and area overhead.
Compared with ORI, DYN [8] and Codec [6] approaches, the average performance improvement using BAT scheme with 4-Group configuration is 55.3%, 30.4% and 10.5% respectively.
BAT Applied to Data Bus
We applied BAT to data-bus (D-bus) with identical group configuration. The buffer size is configured to 4 words (4 × 32-bit). The result is shown in Fig. 9. From this figure , we find the intrinsic crosstalk effect within D-bus is more mitigate than that within I-bus, so the D-bus RTC is less than instruction bus (I-bus) on average. Although the improvement of BAT applied to D-bus is not as significant as BAT applied to I-bus, compared with DYN scheme, we still gain 12.5% performance improvement on average with 4-Group configuration.
Furthermore, from I-bus and D-bus experimental results, we can infer that adopting BAT scheme is more suitable in large bus width and crosstalk-intensive situations. 
Overhead Analysis
The overhead consists of two parts: 1) time overhead, caused by extra circuit logic on the transmission path, and 2) area overhead, not only caused by the extra circuit logics, but also some extra shielding wires. The time overhead is insignificant because only the DCC module which just increases a delay of a "transmission" gate is on the transmission critical path. The area overhead mostly results from a) extra shielding wires, b) extra buffers and c) the transmission logic. From experimental results, we find that b) and c) is not so substantial to modern VLSI. The original onchip buffers, moreover, can be reused as the required buffers with minor modifications in many cases. Among the three sources of area overhead, a) is the most important one which must be dealt with, especially when the bus is routed using scare top-level metal resources. The bus routing area consists of not only the metal wire area, but also the space between the neighboring wires. Empirically, as Li et al. assumed in [8] , the wire is set equal to the space in width when we compute the area overhead. In addition, since we are just concerned about the relative overhead, the length of the bus is insubstantial.
For the ASD approach, the active shielding wire is a little "fatter" than the original data wire for the sake of manufacturing considerations. Generally, the active shielding wire is twice to three times as wide as the ordinary wire [3] , and the area overhead is about 13% in the most conservative situation (we adopt the most conservative value -3). Although we use the most conservative value to evaluate the normalized area, the routing area overhead is still far more efficient than CPC, PSD and ASD approaches do. Furthermore, except the ASD approach which occupies the most routing area, our BAT provides the best performance compared to the other four approaches. The variation of area overhead among different approaches is shown in Table 2 . The performance metric -RTC -is also listed in this table for comparison.
Finally, we study the impact of buffer size on transmission performance. Our experimental results show that the performance improvement is disproportional to the buffer size, as indicated by Fig. 10 . Too large buffers do not lead to significant improvement in performance (but impose chip area), which implies that a set of small buffers is enough to synchronize the data receiving without sacrificing the BAT performance. Figure 10 suggests that 4-8 words buffer is an optimum choice. We implement BAT structure for a 64-bit bus in Verilog-HDL and synthesize it using the Synopsys Design Compiler with a target UMC 0.18 μm technology. The bus is configured to 4 groups, and receiving end buffers are configured to 4 words (4 × 32-bit). The total overhead (the sum of combinational and non-combinational circuit area of CDA, VSG, DCC and Assembler) is about 81, 540 μm 2 , and this overhead is acceptable.
Conclusions
This paper presents a new on-chip bus transmission scheme: Bus-grouping Asynchronous Transmission (BAT). BAT can significantly mitigate the delay effects of crosstalk sensitive transition and thereby accelerates data transmission. Furthermore, from our experimental observation, we find that Crosstalk Factor distribution on some types of buses are of distinct spatial locality, such as on instruction buses. This characteristic can be used to optimize the BAT performance. In terms of BAT implementation, two efficient techniques are presented: Difference Counter Cluster (DCC) synchronous mechanism and Delay Active Shielding (DAS). We evaluate the effectiveness of the BAT scheme focusing on the on-chip buses of a modern microprocessor and using the SPEC CPU2000 benchmarks. When applied to on-chip instruction bus, the proposed scheme improves performance by 55.3% compared to the original pessimistic approach that always assumes the worst case at the expense of 13% routing area overhead. helpful discussions.
