In deep sub-micron technology, the crosstalk effect between adjacent wires has become an important issue, especially between long on-chip buses. This effect leads to the increase in delay, in power consumption, and in worst case, to incorrect result. In this paper, we propose a de-assembler/assembler structure to eliminate undesirable crosstalk effect on bus transmission. By taking advantage of the prefetch process where the instruction/data fetch rate is always higher than instruction/data commit rate in high performance processors, the proposed method would hardly reduce the performance. In addition, the required number of extra bus wires is only 7 as compared with 85 needed in [6] when the bus width is 128 bits.
INTRODUCTION
In deep sub-micron technology, coupling capacitance between interconnects is the dominant factor in the total wire capacitance. It derives from one signal and its neighboring wire switching at different directions. This effect, crosstalk, will lead to additional delay and power consumption of a signal. Even worse, in some cases, it may cause malfunction of a circuit. Thus, elimination of crosstalk has become a very important design issue. Since in a bus structure, a number of wires are laid in parallel for a long distance, the crosstalk problem in a bus structure is especially salient.
Two major categories of crosstalk elimination approaches have been proposed. One category is designed for power consumption and its objective is to minimize the total crosstalk in all wires. Pervious work such as spacing and shielding [1] is two famous approaches in this category. Other approaches such as [2, 3, 4] are also designed to reduce the total crosstalk. Another category is designed for performance and its objective is to minimize the maximum crosstalk effect among all wires. Kuo et al. [5] proposed techniques at post-compiler level for performance improvement. In addition, [6] and [7] use bus-encoding methods to achieve this goal. Both of them [6, 7] proposed encoding data to be crosstalk free before it is transmitted on buses. At receiving end of the bus, a decoder logic decodes the data into the original one. In this paper, we will focus on the second problem, i.e., the elimination of certain data transmission patterns so that the maximum crosstalk effect is minimized. In this regard, Victor et al. [6] proved theoretically that the maximum wire number for encoding n-bits bus is log Fn+2 where Fn is the n th number of Fibonacci sequence. These bus encoding methods become impractical when the number of bus lines become large. For example, a 128-bit bus will be encoded with 171 wires in theory and with 213 wires in practice. For a high performance processor like superscalar and VLIW architecture, the width of a bus is usually wide. Using methods as such are not appropriate.
In this paper, a new bus structure is proposed for wide bus architecture in high-performance processors. To hide memory latencies, a common technique used in high-performance processors is prefetching. This technique is to prefetch instructions or data into buffers before they are used by the processors. By inserting a de-assembler and an assembler at the sending and receiving ends of the bus, respectively, certain transmission patterns that cause undesirable crosstalk can be eliminated. Moreover, our method takes advantage of the prefetch process where the instruction/data fetch rate is always higher than instruction/data commit rate. Therefore, in our approach there is almost no penalty in terms of dynamic instruction count.
The rest of this paper is organized as follows. Section 2 describes the crosstalk model. Section 3 gives our motivation. Section 4 presents our novel bus architecture. Section 5 shows the experiment results. Finally, Section 6 concludes this paper.
CROSSTALK MODEL
There are two kinds of capacitance with which a single wire is associated. One is the capacitance C ground between the wire and ground, and the other is the coupling capacitance C couple between the wire and its neighboring wires. The total capacitance C total of a signal wire is calculated as follows.
where n depends on the types of coupling of its neighboring wires. A more detailed analysis of C total on delay can be found in [9] . 
The coupling capacitance of a wire can be classified into four types 1C, 2C, 3C and 4C according to the C couple of two wires [7] . Let the crosstalk effect on a single wire (victim) depends on the signal transition of its neighboring wires (aggressors). We use a tri-tuple (wi−1, wi, wi+1) to represent the wire signal pattern at a certain time, where wi represents the victim while wi−1 and wi+1 are aggressors. Table 1 
MOTIVATION
In order to study 1) the relationship between instruction/data fetched rate and instruction/data committed rate and 2) the percentage of 3C and 4C crosstalk patterns incurred in bus transmission, the transmission on the instruction bus for the DSPstone benchmark is profiled. Experiments were performed by using Simplescalar 3.0 [10] , and the out-of-order 4-issue superscalar architecture is used to simulate the speculative fetching. Figure 2 is the percentage of committed instructions to the total fetched instructions for different examples in the benchmark set. It shows that the number of committed instructions is only about 30% to 40% of the total number of fetched instructions for all example. In other words, the instruction fetch rate is much higher than instruction commit rate in bus transmission. Table 2 is the second profiling result. The column labeled bits of instruction gives the total bit number of fetched instructions, and the column labeled bits of 3C and 4C is the bit number of 3C and 4C crosstalk patterns. The column labeled ratio of 3C and 4C shows the ratio of 3C and 4C crosstalk bits to the total fetched bits. From the table, we know that the ratio of 3C and 4C crosstalk patterns is very low. Since 3C and 4C types of crosstalk take only a small portion of the total transmitted data but cause serious delay penalty, we propose a de-assembler and an assembler structure on both ends of the bus to eliminate these two types of crosstalk.
THE DE-ASSEMBLER AND ASSEMBLER TECHNIQUES
We develop a bus structure to de-assemble/assemble data on a bus such that 3C and 4C crosstalk patterns are eliminated. Figure 3 shows the overall architecture. 
Basic Idea
In our technique, a bus is first partitioned into several channels, channel1, channel2...channeln. Data transmitted on a channel is referred to as a data segment which is denoted as datat, i, where t is the time stamp and i is the channel position index. Each data segment is regarded as a basic data transmission unit. Figure 4 illustrates how our deassembling and assembling mechanisms work at cycle Tt. In Figure 4 , datat, 1 represents the data segment prepared to send on the first channel position in the current cycle, and datat−1, 1 represents the data segment sent on the first channel position in the pervious cycle. The data segments datat−1, i sent in the pervious cycle Tt−1 are stored in the registers in the de-assembler. At the beginning of sending data, the datat, i and the datat−1, i are checked to see if any 3C or 4C crosstalk incurs. If no 3C or 4C crosstalk is found, then datat, i is transmitted on the channeli. Otherwise, the datat, i is shifted to the next one channel position channeli+1 and a data segment with all 0's (or all 1's) called an NOP segment is inserted onto the channeli. Once a datat, i is shifted to channeli+1, it is required to be checked with datat−1, i+1 to see if any crosstalk incurs between them. The checking continues until datat, i finds a position channelj where datat, i and datat−1, j incurs no crosstalk, or it reaches the last channel of the bus. Those data segments datat, i which are not able to be sent during this current cycle Tt due to the NOP segments insertion would be shifted to the next cycle Tt+1. For example, in Figure 4 , assume datat, 1 with 3C /4C crosstalk occurs between datat−1, 1 and datat−1, 2. Then the datat, 1 is shifted two channel positions and will be sent at position channel3. Since the data segments are shifted two channel positions, datat, n−1 and datat, n would be sent in the next transmission cycle Tt+1.
As to the assembler, it is required to remove all inserted NOP segments and pack the valid data segments to form the completed instructions as shown in Figure 4 . After the packing, the assembler would inform the processor the number of completed instructions transmitted during the current cycle. Those data segments which cannot be packed into a complete instruction will be stored in a buffer queue waiting for the next assembling processing.
The worst case of transmission happens when the 3C or 4C crosstalk occurs between datat, 1 and every data segment transmitted at cycle Tt−1. Thus, the bus is filled with all NOP segments at current cycle transmission. Since the NOP segments do not result in crosstalk with any other data patterns in the next transmission cycle, all data segments can be sent without incurring any 3C /4C crosstalk patterns. Therefore, the worst case is to double the transmission cycles, that is, one cycle for data segments transmission and one cycle for NOP segments alternately.
Insertion of Separation Bits
Since crosstalk may occur across the boundary of two adjacent channels, shielding wires have to be inserted between every pairs of channels. Moreover, whether all 0's (or all 1's) pattern is an NOP segment or a real data segment requires a mechanism to make distinction. For these two purposes, our separation bits, s, are designed as follows.
We say a set of bit-patterns is a crosstalk free cyclic if any pairs of the patterns in the set does not incur 3C /4C crosstalk. For example, a set of patterns, (000, 001, 100, 101, and 111 ) is a crosstalk free cyclic. Hence, in addition to acting as a state remembering bit, the separation bit must be designed to be a crosstalk free cyclic.The appropriate separation bits is chosen to form a (|s|+2)-bit crosstalk free cyclic, where |s| is the length of separation bits and the 2 are the last bit of datat, i and the first bit of datat, i+1.
Since we have 4 patterns for datat, i and datat, i+1 combination and two more patterns to tell datat, i to be an NOP segment or a data segment, we need to find a set of codes which is crosstalk free cyclic and of size at least 6. For |s| = 1, the maximum size of its crosstalk free cyclic codes has only size of five (000, 001, 100, 101, and 111 ). These codes are not enough to accommodate 6 different patterns. Let the size of s be increased to 2. The maximum number of the crosstalk free cyclic codes is now over 6. In fact, for |s| = 2, there are more than one choices. Table 3 shows all possible choices. For example, when the NOP segment is designed to be all 0's pattern, two codes for s bits can be used. One is to have s = 10 for datat, i being a data segment and s = 00 for datat, i being an NOP segment. Similarly, if the NOP segment is designed to be all 1's pattern, two codes for s bits, (00, 10) and (01, 11) can be used. Figure 5 is an example of using all 0's pattern as the NOP segment and the selected codes for s are the (10, 00) pair. In this case, the first two patterns, (0-1-0-0) and (0-1-0-1) , at the left tell that datat, i is a real data segment, and the two patterns, (0-0-0-0) and (0-0-0-1), at right tell that datat, i is an NOP segment. Moreover, the six patterns form a crosstalk free cyclic. Finally, one special condition is designed for the last channel position channeln. Since the last channel has no adjacent channel channeln+1, only one bit is required to decide whether the data sent on the last channel position is an NOP segment or not.
The De-assembler and Assembler Architectures
In order to check if any crosstalk occurs between data segment to be sent at current cycle and the data segment already sent at pervious cycle in parallel rather than in sequential, we design a parallel checking architecture. In this section, we describe our de-assembler and assembler architectures. The de-assembler architecture is shown in Figure 6 . In this example, the width of the whole bus is 128 bits and the width of each channel is set to 32. Hence, the bits from 127 to 96 are grouped as channel1, the bits from 95 to 64 are grouped as channel2,... etc. and the total number of channels is 4.
To detect if a crosstalk occurs between the current data segment, datat, i and the data sent in channeli at pervious cycle, two logic elements named data reg and cross detector are designed. The data regi is used to store the data segment sent on channeli at pervious cycle. For each channeli, the cross detectori, j , where j from 1 to i, is a combinational logic used to check if data regi and datat, j induce crosstalk. Note that in order to check if datat, i can be sent on channel k , for k from i to n in parallel, one or more cross detectori, j s are designed for each channel position channeli. For a data regj, it is checked with all data segment datat, i to be sent, for i from 1 to j as shown in the Figure 6 .
Next, all the output signals of the cross detectori, j s are sent to a logic element named Sel logic. With inputs from all cross detector s, Sel logic will decide which data segment is to be sent on channeli. Then, the output of Sel logic is passed to the first level multiplexor, MUX1i, where the inputs to MUX1i are datat, j for j from 1 to i and NOP segment. This multiplexor is used to select the data segment or NOP segment to be sent. Finally, the output of cross detectori, j s are also sent to the second level multiplexor, MUX2i, which is used to determine what the separation bits are. Now, taking datat, 2 as an example, two crosstalk detectors, cross detector2, 1 and cross detector2, 2, are used to detect if crosstalk occurs between data reg2 and datat, 1, and between data reg2 and datat, 2. The output of cross detector1, 1, cross detector2, 1 and cross detector2, 2 are sent to the Sel logic. Then, the outputs of Sel logic are used as the select signal of MUX12. The inputs to the MUX12 includes datat, 1, datat, 2 and NOP segment. Finally, the MUX22 is used to choose separation bits.
At the receiving side of the bus, an assembler is designed to remove the NOP segments. The architecture for the assembler is shown in Figure 7 . The input of the assembler is a set of data segments with separation bits interleaving within them. First, a DSel logic is constructed to determine which incoming data is data segment and which channel position be passed to. The inputs to DSel logic contain two kinds of signals. One is the separation bits which record the information to distinguish a data segment from a NOP segment. The other is the number of data segments left unpacked at Finally, an instruction control unit is designed to detect how many instructions are packed. The result of the instruction control unit is then sent to the processor.
EXPERIMENT RESULTS
In order to demonstrate the effectiveness and efficiency of our method, a set of experiments are conducted. The simoutorder simulator from Simplescalar 3.0 [10] incorporated with our de-assembler and assembler architecture is used to simulate the out-of-order superscalar architecture without caches. We take instruction bus as the demonstration example. Each instruction is 32-bit long, and four instructions are issued in parallel so that the total bus width is 128 bits. We adopt DSPstone as the benchmarks.
The first experiment is to understand how many extra cycles are needed to execute a program. Table 4 shows the results. The columns labelled T CC and pen are the total cycle counts of the original circuit and the cycle penalty using our architecture, respectively. It can be seen that there is almost no cycle count overhead (less than 1%) for 8-bit, 16-bit, 32-bit channel sizes. In the worst case, the cycle count overhead is only 0.21% (dot product when channel size is 32). The second experiment is to understand the extra wire overhead. The area overhead includes the extra wires required for separation bits and the area of the de-assembler/ assembler. Table 5 shows the comparisons of our results to Victor's memoryless approach [6] . Four cases for different channel sizes by using our method (4-bit, 8-bit, 16-bit and 32-bit per channel) and two cases (theoretical and practical) in Victor's paper are shown. The results show that when the number of bus width is getting wider, the effectiveness of our approach becomes more significant. For example, when the bus width is 128 and the channel size is 32, the number of extra wires using our method is only 7 as compared with 59 and 85 needed for the theoretical and practical cases, respectively, proposed in Victor's paper. As to the area overhead for the de-assembler and the assembler, we choose the case of 128-bit bus width with 32-bit per channel for experiment. Two logic circuits are designed using Verilog and synthesized by the Synopsys Design Compiler. Table 6 shows the comparisons of our results to Victor's memoryless approach [6] . The gate count is obtained by synthesizing circuits using only NOR gates and inverters, and the area is synthesized with the TSMC 0.13µm cell library. The result shows that the de-assembler in our design takes more area than the encoder in Victor's approach [6] . This overhead is mainly from the logic for crosstalk detectors. In addition, registers are needed in our approach because the de-assembler have to store the data segments transmitted in the pervious cycle. The third experiment is to see how much performance improvement can be obtained by eliminating 3C and 4C crosstalk. The result is simulated with Spice [12] , and the case of 128-bit bus width with 32-bit per channel is taken. The values of capacitances for C ground and C couple in different technology, are obtained from the Berkeley predictive technology model (BPTM) [11] . Table 7 shows the simulation result. In this table, the first column gives the process technology (65nm, 90 nm). The second column gives different bus length (3mm and 5mm). The third column to the seventh column report the wire delay without and with crosstalk. The next two columns report the critical path delay for the de-assembler and assembler. All the delay information is normalized to the wire delay without crosstalk (0C ). The last column reports the improvement ratio of our design, it is calculated by the formula 1− 2C wire delay + deassembler delay + assembler delay 4C wire delay where ori tcc and new tcc are the total transmission cycle count of the original circuit and the new circuit, respectively, and rate are the clock length reduction rate for 65nm and 90nm technologies. Figure 8 shows that the improvement rate for different cases can achieve 35% in 90nm technology and achieve about 40% in 65nm technology. It shows that the improvement rate of performance is less significant when the channel size is smaller. In addition, the improvement rate is getting higher when the process scales down. 
CONCLUSION
In this paper, we have proposed a new bus structure to eliminate 3C /4C crosstalk effect during data transmission. By inserting a de-assembler and an assembler at the sending and receiving ends of the bus, respectively, certain transmission patterns that cause undesirable crosstalk can be eliminated. We take advantage of the prefetch process where the instruction/data fetch rate is always higher than instruction/data commit rate in high performance processors. According to the experimental results, our method achieves 40% in 65nm technology and more performance improvementrate at the expand of a small number of extra wires as compared with the original design.
