Abstract-The crosstalk delay associated with global on-chip interconnects becomes more severe in deep submicrometer technology, and hence can greatly affect the overall system performance. Based on a delay model proposed by Sotiriadis et al., transition patterns over a bus can be classified according to their delays. Using this classification, crosstalk avoidance codes (CACs) have been proposed to alleviate the crosstalk delays by restricting the transition patterns on a bus. In this paper, we first propose a new classification of transition patterns, and then devise a new family of CACs based on this classification. In comparison to the previous classification, our classification has more classes and the delays of its classes do not overlap, both leading to more accurate control of delays. Our new family of CACs includes some previously proposed codes as well as new codes with reduced delays and improved throughput. Thus, this new family of CACs provides a wider variety of tradeoffs between bus delay and efficiency. Finally, since our analytical approach to the classification and CACs treats the technology-dependent parameters as variables, our approach can be easily adapted to a wide variety of technologies.
I. INTRODUCTION

A
RECENT international technology roadmap of semiconductors (ITRS) [1] has shown a troubling trend: while gate delay decreases with scaling, global wire delay increases. This is because, with the process technologies scaling down into deep submicrometer (DSM), the crosstalk delay becomes dominant in global wire delay due to the increasing coupling capacitance between adjacent wires. Hence, the crosstalk delay has become a serious bottleneck of the overall system performance.
The analytical model proposed by Sotiriadis et al. [2] and [3] , which is a widely used delay model, gives upper bounds on the delay of all wires on a bus. According to [2] and [3] , the delay of the kth wire (k ∈ {1, 2, . . . , m}) of an m-bit bus is given by where λ is the ratio of the coupling capacitance between adjacent wires and the ground capacitance, τ 0 is the propagation delay of a wire free of crosstalk, and k is 1 for 0 → 1 transition, −1 for 1 → 0 transition, or 0 for no transition on the kth wire. In this model, the delay of the kth wire depends on the transition patterns of at most three wires, k − 1, k, and k + 1 only. The transition patterns over these three wires can be classified based on (1) into five classes, denoted by Di for i = 0, 1, 2, 3, 4, and the patterns in Di have a worst case delay (1 + i λ)τ 0 . This classification enables one to limit the worst case delay over a bus by restricting the patterns transmitted on the bus. That is, by avoiding all transition patterns in Di for i > i 0 , one can achieve a worst case delay of (1 + i 0 λ)τ 0 over the bus. Based on this principle, crosstalk avoidance codes (CACs) of different worst case delays have been proposed (see [4] - [6] ). For example, forbidden overlap codes (FOCs), forbidden transition codes (FTCs), forbidden pattern codes (FPCs), and one lambda codes (OLCs) achieve a worst case delay of (1 + 3λ)τ 0 , (1 + 2λ)τ 0 , (1 + 2λ)τ 0 , and (1 + λ)τ 0 , respectively. Based on (1), a worst case delay of τ 0 can be achieved by assigning two protection wires to each data wire [5] . Other types of CACs, such as those with equalization [7] or 2-D CACs [8] , have been proposed in the literature. For CACs, since the area and power consumption of their encoder/decoder (CODECs) are all overheads, the complexities of the CODECs are important to the effectiveness of CACs. Thus, efficient CODECs have been proposed for CACs [9] - [11] . The classification of transition patterns based on the model in [2] and [3] has two drawbacks. First, the model in [2] and [3] has limited accuracy because of its dependence on only three wires: the model overestimates the delays of patterns in D1-D4, while it underestimates the delays of patterns in D0. For this reason, the scheme with a worst-case delay of τ 0 in [5] is invalid since its actual delay is much greater. Second, the actual delay ranges in some classes overlap with others. This, plus the overestimation of delays for D1-D4, implies that the delays of existing CACs are not tightly controlled. These drawbacks motivate us to include more wires and to classify the transition patterns without overlapping delay ranges.
In [12] , we have proposed a new analytical five-wire delay model. Two extra neighboring wires are included in the delay model [12] , and the delay of the middle wire of five neighboring wires is determined by the transition patterns on all five wires. This five-wire model has better accuracy than the model in [2] and [3] for Di for i = 0, 1, 2, 3, 4 [12] . This paper confirms that using more wires leads to improved accuracy.
There are two main contributions in this paper.
1)
We approximate the crosstalk delay in a five-wire model and propose a new classification of transition patterns. 2) We propose a family of CACs based on our classification.
The work in this paper is different from previous works, including our previous works, in several aspects. 1) Although the delay approximation in this paper is also based on a five-wire model, it is different from that in our previous work [12] . The delay approximation in this paper is carried out by extending the approach in [13] from a three-wire model to a five-wire one. 2) Our classification of transition patterns is different from that in [2] and [3] [based on (1)] in two aspects. First, our classification has seven classes as opposed to five based on (1) . Second, while the delays of some classes overlap for the classification based on (1), all classes in our classification have nonoverlapping delays. These two key differences allow us to have a more accurate control of delays for transition patterns. 3) Our new family of CACs is also different from previously proposed CACs, all of which are based on the classification in [2] and [3] [based on (1) ]. While some codes in this new family are shown to be the same as existing CACs, OLCs, FPCs, and FOCs, this family also includes new codes that achieve smaller worst case delays and improved throughputs than OLCs, which have the smallest worst case delays among all existing CACs.
The rest of this paper is organized as follows. In Section II, we first propose our classification and compare it with that in [2] and [3] . We then present our new family of CACs in Section III and compare their performance with existing CACs in Section IV. Some concluding remarks are provided in Section V.
II. INTERCONNECT DELAYS AND CLASSIFICATION
A. Interconnect Modeling
Since the functionality and performance in DSM technology are greatly affected by the parasitics, distributed RC models are widely employed to analyze on-chip interconnects. In this paper, we consider the distributed RC model of five wires shown in Fig. 1 , where V i (x, t) denotes the transient signal at time t and position x (0 ≤ x ≤ L) over wire i for i ∈ {1, 2, 3, 4, 5}, r and c denote the resistance and ground capacitance per unit length, respectively. Also, λc denotes the coupling capacitance per unit length between two adjacent wires. The value of λ depends on many factors, such as the metal layer in which we route the bus, the wire width, the spacing between adjacent wires, and the distance to the ground layer. We consider a uniformly distributed bus with the same parameters r , c, and λ for all the wires. 
B. Derivation of Closed-Form Expressions
When determining the delay of a wire, the model in [2] and [3] considers only the effects of either one or two neighboring wires [see (1) ]. To address the drawbacks of the model in [2] and [3] described above, additional neighboring wires need to be accounted for. In our delay derivation below, whenever possible we consider four neighboring wires of a wire, i.e., two neighboring wires on each side, to determine its delay. To approximate the delay of a side wire (wires 1, 2, n − 1, or n) of an n-wire bus, three neighboring wires are considered. This is because the side wires are affected by fewer neighboring wires. This scheme is similar to the model in [2] and [3] and appears to work well. We focus on the 50% delay, which is defined as the time required for the unit step response to reach 50% of its final value.
In [13] , the crosstalk of two coupled lines was described by partial differential equations (PDEs), and a technique for decoupling these highly coupled PDEs was introduced by using eigenvalues and corresponding eigenvectors. In this paper, we extend this approach from a three-wire model to a five-wire one. Specifically, we first use the technique in [13] to decouple the PDEs that describe the crosstalk of four coupled wires, then solve these independent PDEs for closed-form expressions, and finally approximate the delays of each wire.
The PDEs characterizing five wires with length L are given by
where R = diag{r r r r r},
T , and
The eigenvalues of C/c are given by
2 λ, and
2 λ. Their corresponding eigenvectors e i s are given by 1] T , respectively. With a technique for decoupling PDEs similar to [13] 
where 
, which is given by a sum of a constant and three exponent terms,
Then the 50% delay of wire 3 can be evaluated by solving V 3 (L, t) = 0.5V dd .
For side wires, PDEs characterizing four wires with length L are given by
where
. 
58.52 59.04
, and p 4 = 1 + (2 + √ 2)λ. Their corresponding eigenvectors e i s are given by
respectively. By decoupling the PDEs for side wires, we have
The expressions of wires 1 and 2 are given by
Then the 50% delays of wires 1 and 2 can be evaluated by solving V i (L, t) = 0.5V dd for i = 1, 2.
C. Pattern Classification
First, we consider the classification of transition patterns over five wires with respect to the delay of the middle wire (wire 3). In this paper, we use "↑" to denote a transition from 0 to the supply voltage V dd (normalized to 1), "-" no transition, and "↓" a transition from V dd to 0. We first focus on patterns with a ↑ transition on wire 3 in a five-wire bus and derive V 3 (L, t) for each pattern as described in Section II-B. There are 3 4 = 81 different transition patterns, which can be partitioned into 25 subclasses as shown in Table I according to the expressions of the output signals on wire 3. All transition patterns in each subclass have the same expression V 3 (L, t). The coefficients for all 25 subclasses are shown in columns 3-5 of Table II . Then the expressions V 3 (L, t) of all patterns in the 25 subclasses are evaluated for their 50% delays. By grouping subclasses with close delays into one class, we can divide the 81 transition patterns into seven classes Ci for i = 0, 1, . . . , 6 shown in Table II . For all 25 subclasses, evaluated and simulated delays are provided in columns 6 and 7 of Table II , respectively. For all seven classes, the difference between the evaluated delay and the simulated delay in Table II is small.
All evaluations and simulations are based on a free PDK 45-nm CMOS technology with 10 metal layers [14] . We assume that the top two metal layers, layers 9 and 10, are used for routing global interconnects, and that metal layer 8 is used as the ground layer. An interconnect model in [15] is used for parasitic extraction. For a 5-mm bus in the top metal layer, the key parasitics, resistance, ground capacitance, and coupling capacitance, are given by R = 68.75 , C gnd = 41.32 fF, and C couple = 505.68 fF, respectively. The bus is modeled by a distributed RC model as shown in Fig. 1 with 100 segments. The two important parameters used in our delay approximation are τ 0 = 0.5RC gnd = 1.42 ps and λ = C couple /C gnd = 12.24. Since the crosstalk delay on the bus constitutes a major part of the whole delay, the delays introduced by buffers are ignored. We assume that ideal step signals are applied on the bus directly. The closed-form expressions are evaluated for 50% delays via MATLAB and the simulation is done by HSPICE.
From Table II , it can be easily verified that C5 and C6 are the same as D3 and D4 in [2] and [3] , respectively. That is, the middle three wires of the transition patterns in C5 (C6, respectively) constitute D3 (D4, respectively). The transition patterns in D0, D1, and D2 are divided into five classes C0-C4 in our classification with following relations:
Note that the coefficients c i for i = 0, 1, 2 of the expression of wire 3 are independent of the technology and determined by different patterns. For a given pattern, the coefficients c i are fixed, and the delay is a function of τ 0 and λ. Since the ratio t/τ 0 appears in the exponent term, varying τ 0 would scale delays in all classes. Thus, the classification does not depend on τ 0 . The coupling factor λ could affect the delay differently. In the following, we verify our classification for technology with different coupling factor, λ = 1, 2, . . . , 13, and show the results in Fig. 2 . Different classes are denoted by different line styles. Each class contains multiple lines, which represents a subclass. Patterns in each subclass have the same delay. For λ ≥ 3, the ranges of delays in all classes do not overlap. Also, the delay in each subclass increases linearly with λ. This implies that our classification is valid provided that the coupling factor λ is at least 3.
Then, we consider the classification of transition patterns over four wires with respect to the delays of the side wires. We classify patterns by considering the worst case delays of wires 1 and 2, respectively. Note that the classification with respect to the delays of wires 4 and 5 would be the same by symmetry. We first focus on patterns with a ↑ transition on wire 2 in a four-wire bus. There are 3 3 = 27 different transition patterns. As described in Section II-B, we first derive the expressions V 2 (L, t) of these 27 patterns shown in Table III . By evaluating these patterns for their 50% delays, we group patterns with close delays into one class, and form five classes jC for j = 0, 1, 2, 3, 4 as shown in Table III . Then, we focus on patterns with a ↑ transition on wire 1. There are 3 3 = 27 different transition patterns. As described in Section II-B, we first derive the expressions V 1 (L, t) of these 27 patterns shown in Table IV . By evaluating these patterns for their 50% delays, we group patterns with close delays into one class and form three classes jC for j = 0, 1, 2 as shown in Table IV . When both wires 1 and 2 have transitions, the delay on wire 2 is larger than that of wire 1, which can be verified from Tables III  and IV . In this case, we focus on the delay of wire 2. When only wire 1 has transition, we focus on the delay of wire 1. The difference between the evaluated delay and the simulated delay is small, as shown in Tables III and IV with one exception (the pattern ↑↑↓↑ in 1C in Table III) , which does not change our classification.
From Tables III and IV , the classes 3C and 4C of our classification are exactly the same as D3 and D4 in [2] and [3] , respectively. The class 1C and 2C of our classification are subsets of D1 and D2 in [2] and [3] , respectively. The class 0C is a subset of D0 ∪ D1 in [2] and [3] .
Similar to the classification of middle wires, we conclude that the classification on the side wires does not depend on τ 0 . To verify our classification for technology with different 
1.55
1.61
1.62
1.55 1.64
9.70 9.38
12.89 13.03
17.02 16.05 
22.59 22.48
24.12 24.22
26.02 26.06
26.89 27.06 ↓↑↑↓ 0 0 4 π 0 27.45 27.68
37.44 37.74
38.61 38.89
39.06 39.40
40.12 40.39
41.63 41.98
52.99 53.44
55.79
coupling effects, we consider coupling factor λ = 1, 2, . . . , 13, and show the results in Fig. 3 . Each class contains multiple lines, each of which represents a pattern in Tables III and IV. For λ ≥ 1, the ranges of delays in all classes do not overlap. Also, the delay in each subclass increases linearly with λ. This implies that our classification on side wires is valid provided that the coupling factor λ is at least 1.
In addition to being a finer classification, the new classification has no overlapping delays among different classes.
[19, Fig. 4 ] compares the simulated delays of different classes based on the classification in [2] and [3] and our new 
1.59
1.64
2.90
4.65 4.99
4.54 3.49
5.53
5.88
7.39
6.89
9.70 9.35
10.54
12.89
13.03
13.14 
21.86 21.91
23.23
25.10 25.30
26.02 26.06 classification. In [19, Fig. 4 ], the grey bars identify the minimum and maximum simulated delays in every class. Note that only two extremes are important, and not all delay values in the grey bars are achievable by some transition patterns. In [19, Fig. 4(a) ], the thick line segments denote the upper bounds for delay of each class based on (1). The upper bounds by the model in [2] and [3] overestimate the delays of D1 through D4 and underestimate the delay of D0. As shown in [19, Fig. 4(a) ], the actual delays in D0, D1, and D2 overlap with each other. Some patterns with smaller delays have the potential to transmit information at a higher speed, but are 
III. NEW MEMORYLESS CACS
A. Previous CAC Design
CACs reduce the crosstalk delay for on-chip global interconnects by encoding a k-bit data word (
. Two kinds of CACs, i.e., CACs with memory and memoryless CACs, have been investigated in the literature [16] . CACs with memory need to store all codebooks corresponding to different codewords (c 1 c 2 · · · c n ), since the encoding depends on the data word (x 1 x 2 · · · x k ) as well as the preceding codeword. In contrast, memoryless CACs require a single codebook to generate codewords for transmission, because the encoding depends on the data word only. Hence, memoryless CACs are simpler to implement than CACs with memory. We focus on memoryless CACs in this paper.
The codebook of a memoryless CAC satisfies the property that each codeword must be able to transition to every other codeword in the codebook with a delay less than the requirement. Most memoryless CACs in the literature are based on the model in [2] and [3] . The key idea is to eliminate undesirable patterns for transmission. Existing memoryless CACs include OLCs, FPCs, FTCs, and FOCs [4] - [6] , [17] , which achieve a worst case delay of (1 + λ)τ 0 , (1 + 2λ)τ 0 , (1 + 2λ)τ 0 , and (1 + 3λ)τ 0 , respectively. As mentioned above, the scheme that was proposed to achieve a worst case delay of τ 0 is invalid since the model in [2] and [3] underestimates the delays for 0C. Thus, OLCs achieve the smallest worst case delay (1 + λ)τ 0 among existing CACs.
There exist several methods to obtain a memoryless codebook based on pattern pruning, transition pruning, or recursive construction. The pattern pruning technique is quite straightforward, and gives a codebook with a smaller worst case delay by eliminating some patterns. For example, FOCs cannot have both 010 and 101 patterns around any bit position, and FPCs are free of 010 and 101 patterns [17] . The transition pruning technique [6] is based on graph theory. This method first builds a transition graph with all possible codewords as nodes and all valid transitions as edges, and then finds a maximum clique. A clique is defined as a subgraph where every pair of nodes is connected with an edge. A maximum clique is defined as a clique of the largest possible size in a given graph. Since every pair of nodes is connected, a maximum clique in this graph constitutes a memoryless codebook with the largest size. The codebook generation method is based on exhaustive search. Although it is easy to get a maximum clique from a transition graph with a small n, the complexity increases rapidly with n. This is because the number of edges in an n-bit transition graph is upper-bounded by 2 n−1 (2 n − 1), which increases exponentially with n. In fact, it is an NP problem to find a maximum clique for given constraints [18] . The recursive technique constructs an (n + 1)-bit codebook from an n-bit codebook [4] , [5] . Since for a small n a largest codebook can be obtained easily via the second method, a codebook for an n-wire bus can be constructed recursively.
B. CAC Design With New Classification
Since our classification of patterns is different from that in [2] and [3] , the CAC designs should be reconsidered with our new classification. In the following, we first introduce a recursive method for codebook construction under different constraints, and then derive the size of codebooks.
In this paper, we use the recursive method to obtain a memoryless codebook for the following two reasons. First, it is complex to apply the pattern pruning technique, since our new classification is based on transitions over five wires, and it is not clear which patterns have larger worst case delays and should be removed. Second, it is hard to find a maximum clique for a transition graph with a large n. In our method, we first start with a five-bit codebook, obtained by searching for maximum cliques in a five-wire bus, and then build an (n + 1)-bit codebook by appending "0" and "1" to codewords of an n-bit codebook while satisfying delay constraints.
Our new classifications partition patterns over five adjacent wires into seven classes, C0-C6, and patterns over four adjacent wires into five classes, 0C-4C. Similar to the CAC design based on the model in [2] and [3] , the new classifications are conducive to the design of CACs by eliminating undesirable transition patterns with large worst case delays.
To get valid five-bit codebooks, we first assume that the allowed patterns are from C0 to Ci for i = 0, 1, . . . , 6 in our classification for middle wires. Then, for the side wires, we assume patterns are from 0C to jC based on the classification for side wires. Under these two assumptions, there are many configurations of constraints, which are referred to as (Ci, jC), where i ∈ {0, 1, . . . , 6} and j ∈ {0, 1, . . . , 4}.
Since the worst case delay of a bus is determined by the largest delays among all wires, for an n-bit (n ≥ 5) bus under (Ci, jC) we require that the worst case delays on middle wires and side wires are close enough. By our classifications, we find 0C is close to C0; 1C close to C2 and C3; 2C close to C4; 3C close to C5; and 4C close to C6. Hence, among all configurations of constraints (Ci, jC), we only focus on (C0, 0C), (C2, 1C), (C3, 1C), (C4, 2C), (C5, 3C), and (C6, 4C). When n ≤ 4, the constraint Ci cannot be enforced. Hence, the constraint (Ci, jC) reduces to jC. The constraint (C0, 0C) appears to be too restrictive, and hence we do not investigate it in this paper. The last configuration (C6, 4C) is trivial, since it allows arbitrary transitions.
In the following, we propose a scheme for finding an n-bit codebook C (Ci, j C) (n). For simplicity, we denote C (Ci, j C) (n) as C(n) when there is no ambiguity about the constraint. First, for a five-wire bus under constraint (Ci, jC), a pattern transition graph is obtained. We search the graph for the largest five-bit codebooks. One or two five-bit codebooks of maximum sizes exist for each constraint in Table V , where we denote an n-bit binary codeword (c 1 c 2 · · · c n ) as a decimal number n i=1 c i 2 n−i for simplicity. In [6] , a bit boundary in a set of codewords is said to be 01-type if only codewords with 00, 01, and 11 are allowed across that boundary, and a bit boundary is said to be 10-type when only codewords with 00, 10, and 11 are allowed across that boundary. It is shown that the largest clique for a given constraint has alternating boundary types. Thus, there are two largest cliques. Similarly, from Table V , we conjecture that the largest codebooks have alternating constraints, i.e., C 0 5 and C 1 5 , for every five consecutive wires. For constraint (C4, 2C), only one maximum five-bit codebook exists. We assume C 1 5 is the same as C 0 5 for constraint (C4, 2C). Since we have two types of constraints, two largest codebooks for each constraint can be obtained, except for (C4, 2C), where the two codebooks are the same. Then we apply Algorithm 1 to obtain C(n). In the initialization, we pick a five-bit codebook C 5 = C 0 5 . Then, the algorithm recursively appends one bit to the codewords in the codebook in each iteration. The recursive construction allows us to derive the size of the codebooks. Let V ( 
Then, for n ≥ 5, the number of codewords in an n-bit bus is equal to counting the valid transitions and is given by
In the following, we first focus on constraints (C3, 1C), (C4, 2C), and (C5, 3C) . The codes based on these constraints are shown to have the same codebooks as OLCs, FPCs, and FOCs, respectively. Then, we consider constraint (C2, 1C) , which would lead to codes with a smaller delay at the expense of a lower code rate. Several lemmas and theorems about the aforementioned codebooks and their sizes have been established below. All the proofs are straightforward, and hence omitted for conciseness. See the extended version [19] of this paper for more details.
C. Codes Under (C3, 1C)
The OLCs have a worst case delay (1 + λ)τ . According to [17] , the worst case delay (1+λ)τ can be achieved if and only if the transitions ↑↓ ×, -↑-, and ↑-↑ plus their symmetric and complement versions (e.g., ↑↓ × and × ↓↑ are symmetric, and -↓-is the complement of -↑-) are avoided, where ↑, ↓, ×, and -denote 0→1, 1→0, don't care, and no transition, respectively. The first constraint of avoiding ↑↓ × ensures that a transition between any two codewords does not cause opposite transition on any wire. This condition is referred as a FT condition. The second constraint of avoiding -↑-ensures that 2C patterns are removed. This constraint ensures two adjacent bit boundaries cannot both be 01-type or 10-type, and is referred to as a forbidden adjacent boundary pattern condition [17] . The last two FPs give the constraint that no patterns 010 and 101 appear in the codeword, which is referred to as a FP condition [17] . Codes satisfying these necessary and sufficient conditions are called OLCs. We denote the largest OLC codebook size for an n-bit bus as G n , and G n is given by
with initial conditions G 1 = 2, G 2 = 3, G 3 = 4, G 4 = 5, and G 5 = 7 [20] . With our classification, we explore codes under constraint (C3, 1C). From Table V, the two largest five-bit codebooks are given by C 0 5 = {0, 3, 14, 15, 24, 30, 31} and C 1 5 = {0, 1, 7, 16, 17, 28, 31}. An n-bit codebook C(n) can be obtained via Algorithm 1. The number of codewords is given by
where V is a 7-D all-1 vector and D (C3,1C) is a 7×7 expansion matrix. We further establish that the largest codebook sizes under constraint (C3, 1C) satisfy the following recursion.
with initial conditions |C (C3,1C) (n)| = 7, 9, 12, for n = 5, 6, 7, respectively.
In fact, we can further relate these codes with OLCs by the following. 
D. Codes Under (C4, 2C)
The (1+2λ) codes have a worst case delay of (1+2λ)τ . No necessary and sufficient condition is known for a code to be a (1 + 2λ) code. Two sufficient conditions FT and FP are found, which leads to two families of (1 + 2λ) codes, i.e., FTC and FPC, respectively. The size of an FTC codebook for an n-wire bus is given by F n+2 , where F n is the Fibonacci sequence that satisfies F n+2 = F n+1 + F n and has initial conditions F 1 = F 2 = 1 [6] . The FPCs for an n-wire bus have a larger codebook size 2F n+1 [4] .
With our classification, we explore codes under constraint (C4, 2C). From Table V , only one largest five-bit codebook is found C 0 5 = {0, 1, 3, 6, 7, 12, 14, 15, 16, 17, 19, 24, 25, 28, 30 , 31}. An n-bit codebook C(n) can be obtained via Algorithm 1 by setting C 1 5 = C 0 5 . The number of codewords is given by
where Since FPCs and our codes under (C4, 2C) can be obtained by excluding D3 plus D4 patterns and C5 plus C6 patterns, respectively, Theorem 3.2 is not surprising given that C5 and C6 are the same as D3 and D4, respectively. Theorem 3.2 implies that results in the literature regarding FPCs are also applicable to codes under constraint (C4, 2C).
E. Codes Under (C5, 3C)
The (1 + 3λ) codes have a worst case delay of (1 + 3λ)τ , which can be achieved if and only if ↓↑↓ and ↑↓↑ are avoided. So the necessary and sufficient condition for the (1 + 3λ) codes is that the codebook cannot have both 010 and 101 appearing centered around any bit position, which is referred as a FO condition. Codes satisfying the FO condition are called FOCs. It is shown that the largest FOC codebook for an n-bit bus is given by T n+2 , where T n = T n−1 + T n−2 + T n−3 is the tribonacci number sequence with initial conditions T 1 = 1, T 2 = 1, and T 3 = 2 [17] .
With our classification, we explore codes under constraint (C5, 3C). Two largest five-bit codebooks C 0 5 = {0, 1, 2, 3, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 24, 25, 26, 27, 28, 30 , 31} and C 1 5 = {0, 1, 3, 4, 5, 6, 7, 12, 13, 14, 15, 16, 17, 19, 20, 21, 22, 23, 24, 25, 28, 29, 30 , 31} are found. Via Algorithm 1, an n-bit codebook C(n) can be obtained. The number of codewords is given by 
F. Codes Under (C2, 1C)
With our classification, we explore codes under constraint (C2, 1C). From Table V 
G. Pruned Codes Under (C2, 1C)
For (C2, 1C), the restriction on the side wires is more relaxed than that on the middle wires, which results in larger worst case delays for the side wires. Hence, we prune the CACs under constraint (C2, 1C) by removing codewords with larger delays on the side wires in order to achieve a smaller worst case delay. Since the pruned codes have a smaller delay than OLCs, we call these pruned CACs improved OLCs (IOLCs). We obtain IOLCs by first finding an n-bit codebook via Algorithm 1 as in Section III-F, and then pruning the codebook with Algorithm 2. To prune the codebook C(n), we depending on whether n is odd or even.
The pruning algorithm for CACs under (C2, 1C) on an n-bit bus is shown in Algorithm 2. By pruning all codewords c n in C(n), the algorithm removes codewords with larger delay on side wires. With Algorithm 2, we get an n-bit IOLC under constraint (C2, 1C), and its size is given by
, and D is the same as that in (9) . Note that W 1 and W 2 are used instead of V, because of the pruning of valid patterns on side wires. We further establish that the largest codebook sizes of IOLCs satisfy the following recursion. 
IV. PERFORMANCE EVALUATION
In this section, we evaluate the performance of CACs based on our classification with extensive simulations, and compare them with existing CACs. Each CAC has two key performance metrics: delay and rate. The delay of a CAC is the worst case delay when the codewords from the CAC are transmitted over the bus. Codebook size and code rate are often used to measure the overhead of CACs. The codebook size of a CAC is simply the number of codewords. Suppose a CAC of size M is transmitted over an n-bit bus, then its rate is defined as log 2 M /n. A CAC of rate k/n implies that n − k extra wires are used in addition to k data wires so as to reduce the crosstalk delay. Hence, the code rate measures the area and power overhead of CACs: the higher the rate, the smaller the overhead. Obviously, there is a tradeoff between the code rate and delay of a CAC: typically a lower rate code is needed to achieve a smaller delay. To measure the overall effects of both rate and delay, we also define the throughput of a CAC as the ratio of code rate and delay. The assumptions for this definition are: 1) the clock rate of the bus is determined by the inverse of the worst case delay and 2) the throughput of the bus is linearly proportional to k, which is the number of data wires.
Since codes under (C3, 1C), (C4, 2C), and (C5, 3C) have exactly the same codebooks as OLCs, FPCs, and FOCs, their delay, rate, and throughput are also the same. Under constraint (C2, 1C), we propose two kinds of codes, i.e., unpruned codes and pruned codes (IOLCs). In the following, we compare their performance with OLCs in [5] with extensive simulations.
To compare the worst case delay of our IOLCs, unpruned (C2, 1C) codes, and OLCs, we simulate two buses, i.e., a 10-bit bus and a 16-bit bus, with all transitions between any two codewords in their codebooks and obtain the worst case delays of each wire. The simulation environment has been explained in Section II-C. Both buses have a length of 5 mm, and τ 0 = 1.42 ps, and λ = 12.24. For a 10-bit bus, the worst case delays of our IOLC, unpruned (C2, 1C) code, and an OLC are given by 10.14, 13.50, and 14.84 ps, respectively. The worst case delay of our IOLC and unpruned (C2, 1C) code are 31.67% and 9.03% smaller than that of the OLC, respectively. For a 16-bit bus, the worst case delays of our IOLC, unpruned (C2, 1C) code, and an OLC are given by 10.40, 13.92, and 16.11 ps, respectively. The worst case delay of our IOLC and unpruned (C2, 1C) code are 35.44% and 13.59% smaller than that of the OLC, respectively. See the extended version [19] of this paper for additional information.
For all simulations, our IOLCs have better delay performance than OLCs. Although both IOLCs and unpruned (C2, 1C) codes have almost the same code rate and better delay performance than OLCs, the delay performance of IOLCs is much better than that of the unpruned (C2, 1C) codes. With a more advanced technology where the coupling effect is significant, the improvement of our IOLCs is bigger.
The comparisons of the codebook size between our IOLCs, unpruned (C2, 1C) codes, and OLCs [5] and the throughput gain with respect to OLCs are shown in Table VI . The throughput gain of our CACs with respect to OLCs is given by the ratio between the throughput of our CACs and that of OLCs. The codebook sizes of the three codes are close. In all cases, the difference of the number of bits between our IOLCs and unpruned (C2, 1C) codes is within one bit. The difference of the number of bits between our IOLCs and OLCs [5] is within two bits for n ≤ 16. In respect to throughput, our IOLCs always have a greater throughput than OLCs, and their throughput gain ranges from 1.02 to 1.55 for an n-wire bus (5 ≤ n ≤ 16). The unpruned (C2, 1C) codes have better throughput in some cases than OLCs, and the throughput gain ranges from 0.78 to 1.10 for an n-wire bus (5 ≤ n ≤ 16). When unpruned (C2, 1C) codes have a lower throughput than OLCs, IOLCs can be used.
Our IOLCs and unpruned (C2, 1C) codes provide additional options for the tradeoff between code rate and code delay. In addition to achieving higher throughputs, the new CACs are also appropriate for interconnects where the delay is of top priority.
It has been shown that the encoding and decoding of OLCs, FPCs, and FOCs have quadratic complexity based on numeral systems [11] . Since codes under (C3, 1C), (C4, 2C), and (C5, 3C) have exactly the same codebooks as OLCs, FPCs, and FOCs, their CODECs also have quadratic complexity. Also, it is expected that the encoding and decoding of our IOLCs and unpruned (C2, 1C) codes have a quadratic complexity, since the codebooks of our IOLCs and unpruned (C2, 1C) codes are proper subsets of OLCs.
We remark that the simulation results in Sections II-C and IV are all based on 45-nm CMOS technology. We have also run the same set of simulations based on a 0.1-μm technology (omitted for brevity). Between the two sets of simulation results, the main conclusions of this paper and the key features of our proposed classification and CACs remain the same. For instance, the delays of the patterns in different classes do not overlap regardless of the technology. Also, the proposed CACs based on the new classification are also the same. This actually demonstrates that our approach to delay classification and CACs is applicable to a wide variety of technologies. This is because, in our approach, the dependency of the crosstalk delay on the technology is represented by the two parameters, i.e., the propagation delay τ 0 of a wire free of crosstalk and the coupling factor λ. Since our analytical approach to the classification and CACs treats these two parameters as variables, our approach can be easily adapted to a wide variety of technology.
V. CONCLUSION
In this paper, we proposed a new classification of transition patterns. The new classification has finer classes and the delays do not overlap among different classes. Hence the new classification is conducive to the design of CACs. To illustrate this, we designed a family of CACs with different constraints. Some codes of the family are the same as existing codes, OLCs, FPCs, and FOCs. We also proposed two new CACs with a smaller worst case delay and better throughput than OLCs. Since our analytical approach to the classification and CACs treats the technology-dependent parameters as variables, our approach can be easily adapted to a wide variety of technologies.
