Abstract-Increasing demand for high-speed inter-chip interconnects requires faster links that consume less power. Channel coding can be used to lower the required signal-to-noise ratio for a specific bit error rate in a channel. There are numerous codes that can be used to approach the theoretical Shannon limit, which is the maximum information transfer rate of a communication channel for a particular noise level. However, the complexity of these codes prohibits their use in high-speed inter-chip applications. A low-complexity signaling scheme is proposed here. This method can achieve 3-5-dB coding gain over uncoded four-level pulse amplitude modulation (PAM). The receiver for this signaling scheme along with a regular 4-PAM receiver was designed and implemented in a 0.18-m standard CMOS technology. Experimental results show that the receiver is functional up to 2.5 Gb/s. This was verified with a bit error rate tester (BERT) and we were able to achieve error free operation at 2.5-Gb/s channel transfer rate. The entire receiver for this scheme consumes 22 mW at 2.5 Gb/s and occupies an area of 0.2 mm 2 .
required SNR can be shown to be 18.4 dB [1] . This means that if the signal amplitude is 100 mV, the standard deviation of the permitted noise in the system could be as high as 12 mV. To reduce the BER, in general, one needs to either increase the signal amplitude or reduce the noise by using special circuit techniques, such as using higher current density and using larger devices [2] . Both solutions require more power and/or area. Since off-chip drivers can consume up to 70% power of a large pin-count digital chip [3] , reducing the power consumed by interconnect circuitry is extremely important.
Recently, there have been several attempts to reduce the number of required signal paths in a chip-to-chip communications with the use of incremental signaling or transmitting three bits over four signal paths [4] , [5] . Multilevel signaling can also be used as an alternative to reduce the number of required signal paths in a link. Furthermore, for a given data rate, multilevel signaling can be used to reduce the channel symbol rate, inter-symbol interference (ISI), and crosstalk [1] , [6] . The potential benefits of four-level pulse amplitude modulation (4-PAM) signaling for increasing data rates in physical short-bus systems have been shown in [7] [8] [9] . However, transmitted power is often increased to compensate for the impact of multilevel signaling on bit error rate (BER). Therefore, the use of coding schemes to reduce the required power is becoming more appealing.
There is still a significant gap between the Shannon limit and the data rates of the current state-of-the-art designs [10] . Introducing some redundancy at the transmitter (channel coding) can be used as an attempt to approach the Shannon limit and to find a low-power scheme [11] , [12] . In general, finding good codes is not a difficult task and randomly generated codes with a large block size can form codes that are close to the Shannon limit. However, the real problem is the complexity of these coding schemes. Although encoding is always a rather simple task, the decoding complexity increases exponentially with the block size, which can quickly result in impractical codes [13] . Therefore, instead of making the code more and more complex, the search should focus on finding low-complexity codes with good coding gain. In chip-to-chip communication applications, where high-speed implementation is the main concern, this becomes even more important.
Several suitable coding schemes for inter-chip communications were proposed in [14] by using a 6-PAM instead of 4-PAM. In this paper, we propose a novel coding scheme that provide significant coding gain without the need to expand the modulation to 6-PAM. This would be important in the case of peak power limited applications. Moreover it simplifies the implementation of the transceiver, which is extremely important at high data rates. The proposed method is explained in Section II. Section III compares simulation results of this method with those of a conventional 4-PAM scheme. Section IV introduces different architectures for implementation of the proposed signaling scheme. A high-speed, low-complexity analog implementation of this method is explained in Section V. Finally, a 4-PAM receiver for this signaling scheme along with a conventional 4-PAM receiver is designed and implemented in 0.18-m CMOS technology and Section VI presents the experimental results for this chip.
II. PROPOSED CODING SCHEME FOR INTER-CHIP COMMUNICATION
Low-complexity coding schemes for high-speed inter-chip communications applications have been introduced in [12] , which are based on the idea of coded-modulation [15] . The proposed signaling scheme in this paper is based on a coding scheme that is proposed in [11, pp. 680] . Although the idea is general and can be applied to different constellations, here, only a method for a four-point constellation is explained.
Similar to coded-modulation, as shown in Fig. 1 , a four-point constellation can be partitioned into two subsets. The minimum distance of the points in each subset is twice of the one in the original fpur-point constellation. Consider transmitting two bits per symbol: one bit to select the subset A or B and the other bit to select the points within each subset. Obviously, decoding the bit that selects subset A and B is more important because the minimum distance of the points inside each subset is twice of the minimum distance of the original 4-PAM. Therefore, protecting the important bit with a coding scheme is desirable. A simple method is to use a simple convolutional coding scheme such as duobinary . Fig. 2 shows a block diagram of the transceiver based on this idea. As shown in this figure, a Viterbi detector can be used to decode the signal in the receiver.
The trellis of this scheme is shown in Fig. 3 (a) and a minimum-distance error event is illustrated in Fig. 3(b) . In this event, the correct path is shown by the dashed line. If the minimum distance error event in 3(b) was the only probable error event, the total gain would have been about 3 dB. However, no coding gain should be expected since two source bits are represented with an alphabet of size four and there is no redundancy [11] . Although the minimum distance of the codes in the system is increased, there are actually infinite number of minimum-distance error events [see Fig. 3(c) ]. This explains why this method does not provide any gain.
An interesting way to achieve an appreciable coding gain is to force the two-state trellis to return to state zero at every fourth symbols [11] , as shown in Fig. 4 . Note that in the fourth symbol interval the coder does not have a choice of set A or B. Thus, only one bit instead of two can be transmitted in the fourth symbol interval. Now, there are at most three error events with the minimum distance starting at any given time, and nearly the full 3-dB improvement can be realized at high SNRs [11, pp. 682] .
As shown in Fig. 2 , the state of the trellis at each time step depends only on the second input bit Bit2 and the first bit Bit1 can be either zero or one in each branch. This means that there are actually two parallel branches between every two states of the trellis. Fig. 5 shows the actual trellis for one time-step. Fig. 5(b) shows the branch metric corresponding to each branch. It can be shown that these branch metrics can be simplified to those in Fig. 5(c) . In each branch metric, represents the received signal. The two branches between every two states can be merged by taking the minimum of the two branch metrics. Therefore, the branch metrics cab be calculated by
The proposed method needs one Viterbi decoder for each line in the bus. This increases the complexity of the receiver. To alleviate this problem and to reduce the latency of the decoder, the sequence can be transmitted in space over a bus rather than sequentially in time. The Viterbi algorithm can also be applied in space. This idea proposes a new scheme for chip-tochip communication. In this scheme, hereafter referred to as 4LINE-PAM4 scheme, seven input bits are converted to eight bits by means of a convolutional encoder in space as shown in the Fig. 6(a) . Therefore, the rate of this coding scheme is . These eight bits can be transmitted over four lines in a bus using a 4-PAM modulation. A Viterbi decoder can be used for decoding the original seven bits from the received signal. 
III. SIMULATION RESULTS
The performance improvement of the proposed 4LINE-PAM4 scheme compared to the conventional 4-PAM method has been verified by simulations with two different channel models.
A. Simulation Results With an AWGN Channel Model
With an additive white Gaussian noise (AWGN) model for the channel, the 4LINE-PAM4 scheme and the ordinary 4-PAM scheme were simulated in MATLAB. Simulation results show that the performance of 4LINE-PAM4 scheme is roughly 2.6 dB better than the performance of the 4-PAM at symbol error rate (SER) of 10 (see Fig. 7 ). As shown in Fig. 6(b) , the receiver can be also implemented digitally by using 3-or 4-bit analog-todigital converters (ADCs). Fig. 7 also shows the performance of these digital implementations.
B. Simulation Results With a More Realistic Channel Model
AWGN model is a good model for a channel with perfect termination and no ISI. However, the main sources of the noise in inter-chip applications in practical systems are usually ISI and residual reflection due to the imperfect termination. Therefore, a more realistic model for the channel [16] is used to verify the performance of the proposed method in two practical situations.
1) Case I-( m):
In this case, source and load terminations are 10% more than the characteristic impedance 50 . A 0.3-m microstrip is considered as a medium for this channel. Fig. 8(a) shows the magnitude of the channel transfer function for this channel. As shown in this figure, the attenuation of the channel is roughly 4 dB at 2.5 GHz and, therefore, this channel introduces moderate amount of ISI. However, there would be a small residual reflection since the source and load impedances are close to 50 (perfect termination). Fig. 8(b) , which illustrates the SER variation with SNR for 4LINE-PAM4 and 4-PAM schemes, shows roughly 5-dB gain for the 4LINE-PAM4 scheme. 
2) Case II-( m):
Here, a 0.20-m microstrip with a 55 source termination 55 and infinite load impedance is considered for the channel. Fig. 9 (a) illustrates less than 3-dB attenuation for this channel at 2.5 GHz. Therefore, the amount of ISI in this case is less than that of the previous case. However, it introduces more reflection because of the poor return-loss of the channel at the receiver. Fig. 9(b) shows the performance of both 4LINE-PAM4 and 4-PAM schemes in this case. Again, the 4LINE-PAM4 scheme shows roughly 4.5-dB performance improvement over 4-PAM at SER around 10 .
C. Power Efficiency of the Proposed Method
As stated in Section III-B, the proposed method can reduce the transmitted power by 3-5 dB. Since 20-mA current is required for a differential signal swing of 1 V p-p in each pair of wires of a typical 4-PAM driver with 50-source and load termination, this method can save roughly 6 mA per each pair of wires in the transmitter. As shown at the end of Section V, the overhead current for an implementation of a receiver based on the proposed scheme could be as low as 1.1 mA/bit and, therefore, not only the transmitted power, but also the total power of the transceiver can be lowered by using the proposed scheme.
IV. DIFFERENT IMPLEMENTATIONS FOR THE PROPOSED METHOD
The complexity and latency of the Viterbi decoder make the use of that restricted to moderate-speed links in the order of several hundreds of megahertz [17] . For high-speed applications, pipeline techniques can be used, which further increases the receiver complexity.
As an alternative, a tree representation can be used instead of the trellis. The trellis of this scheme has to go back to state zero at every fourth symbol (see Fig. 4 ). Therefore, there are only eight different paths in the trellis and, therefore, it can be represented by a tree as shown in Fig. 10 . Here, the problem is to find the minimum state metrics among the leaves in the tree. Obtaining branch metrics would be straightforward in currentmode circuitry using (1) and (2) . Fig. 11 shows a simple method for the receiver architecture of this signaling scheme. Each compare select unit (CSU) compares its two inputs and the output of this block would be the minimum of the two inputs. Alternatively, finding the path with minimum state metrics amongt the state metrics can be performed by a winner-take-all (WTA) circuitry [18] . Unfortunately, these approaches are not practical for data rates larger than 1 Gb/s due to their speed limitation and low resolution, especially in an advanced technology with a small power supply. An improved method that can be implemented at higher data rates is introduced in Section V.
V. ANALOG IMPLEMENTATION
If a flash-type method, comparing every couple of state metrics, is used as the minimum-finder block in Fig. 11 , 28 comparators are needed. This means a large circuit overhead for this method compared to the uncoded 4-PAM scheme. As mentioned earlier, the rate of the 4LINE-PAM4 scheme is . Interestingly, by forcing the trellis to go back to state zero at the third symbol instead of the fourth, the number of leaves in the tree reduces to four, see Fig. 12 . Therefore, the flash-type method needs only six comparators for finding the minimum state metrics among all leaves in the tree. This means that by reducing the rate from to (less than 5% reduction), the complexity of the receiver can be reduced significantly. This new method, 3LINE-PAM4 can also provide a better coding gain especially at low SNRs. The reason is that in the new trellis [see Fig. 12(a) ], there are only two minimum distance error events and, therefore, the probability of error is roughly of the one for 4LINE-PAM4 scheme (rate ) [11, pp. 680 ]. This has been verified by simulating both methods for the case of an AWGN channel (see Fig. 13 ). Fig. 14 shows the transceiver architecture of the 3LINE-PAM4 scheme, which its receiver is implemented in this work. Since the encoding in the transmitter should be performed on the bit2 and bit4 (see Fig. 2 ), 3LINE-PAM4 encoding needs only one XOR gate and the transmitter is essentially the same as the transmitter of a conventional 4-PAM scheme. Receiver circuitry calculates the state metrics for each of the four leaves of the tree in Fig. 12 and find the minimum state metrics by using six comparators to retrieve the original transmitted 5 bits.
Operational amplifiers (OP-AMP) can be used for implementing the required multiplication and additions in the state-metric calculator unit. However, the required power and area of high-speed implementation of this method makes this approach less desirable. Alternatively, it is straightforward to implement the branch metric and state metric units in current-mode circuitry. The detail of receiver architecture for this scheme is shown in Fig. 15 . The state-metric calculator unit computes the costs for all four paths in the trellis or equivalently for all four leaves in the tree of Fig. 12 . As shown in (1) and (2), calculating the two branch metrics corresponding to each input line needs two comparators. Therefore, there would be a total of six comparators in the state-metric-calculator unit. The detail of this unit is shown in Fig. 15(b) . As shown in this figure, the state metric calculator unit consists of three parts. Each part calculate the branch metrics, based on (1) and (2), by converting the input signal from voltage to current with a transconductance block and perform the branch metrics calculation in current mode using the results of the comparator outputs. Calculating the state metrics can be done by just adding the branch metrics, which would be straightforward in current mode circuitry. For example, as shown in Fig. 15(b) , the first state metrics can be generated by simply adding A1, A2, and A3 (see also Fig. 12) .
The outputs of state-metric calculator unit are compared with six comparators to find the minimum state metrics among the four leaves in the tree. The original 5 bits can be retrieved from the outputs of these comparators and the outputs of the comparators in the state-metric calculator unit. These 12 outputs are fed into a 12 by 5 digital decoder that decodes the original bits in the receiver. Fig. 16 shows the required circuitry for this decoder. As shown in this figure, the outputs of the first six comparators, which determine the minimum state metrics can be used to retrieve the second and the fourth transmitted bits, the bits that determine the path in the tree (see Fig. 14) . I7 to I12 are the outputs of the comparators in the state-metric calculator unit that are used for retrieving the rest of the transmitted bits. Fig. 17 illustrates the circuit implementation for the branchmetric-calculator unit, which is a direct implementation of (1) and (2) in current mode circuitry. For example, based on (1) would be either zero or . The right portion of the Fig. 17 shows the implementation for this. The output of the two transconductance are connected to each other to create the term in (1) and cascode current mirrors provide the constant term. The two comparators and the switches in the schematic implement the conditions in (1) and (2) . As shown in this figure, differential pairs are used instead of switches to increase the speed of this implementation. Also, four transconductance amplifiers, instead of one, are used for calculating each branch metric to eliminate the need for current mirrors and, thus, the additional delay in the signal path. Current " " and " " in this simplified schematic are used for generating the constant terms in (1) and (2). The state metric for each leaf in the tree (see Fig. 12 ) can be calculated by adding the three branch metrics for each path. This is straightforward in current-mode circuitry and can be performed by just connecting the outputs of branch-metric calculator units.
As discussed in [2] , there are numerous architectures for highspeed transconductance amplifiers. Simulation results show that a transconductance with 5 bit linearity would be sufficient for this application. Therefore, bandwidth was the primary criterion for selecting the architecture of the transconductance block. As shown in Fig. 18 , a simple architecture is selected for a high-speed implementation of the transconductance block [2] . Simulation results for this block show that its linearity is roughly 6 bits, which was sufficient. Another important unit in this design is the high-speed comparator block. The simplified schematic of the comparator is shown in Fig. 19 . The comparator is a slight variation of the comparators in [2] , [19] . This architecture has the capability of performing offset cancellation by switching in and out small capacitances. The main criteria for choosing this architecture were speed and offset cancellation capability. This architecture comprises of a preamplifier and a latch. During the track time, preamplifier amplifies the input signal. The amplified voltage is then latched at the rising edge of the clock ("Latch" signal in the schematic). A preamplifier is used to reduce the input referred offset of the comparator. Monte Carlo simulations shows that the comparator offset is roughly 35 mV. In order to remove the precharge phase from the output and to present a constant (non-data dependent) load to the comparator, the comparator is followed by an SR-latch [20] .
The 3LINE-PAM4 scheme and a conventional 4-PAM receiver are designed in a 0.18-m standard CMOS technology. As shown in Fig. 20 , the implemented conventional 4-PAM receiver consists of basically three comparators, which are iden- tical to those in 3LINE-PAM4 receiver. Therefore, a fair comparison between the two methods can be performed.
The functionality of the 3LINE-PAM4 scheme has been verified by HSPICE simulation. Simulation results show that the implemented 3LINE-PAM4 receiver can achieve 5-Gb/s data rate. At 5 Gb/s the 3LINE-PAM4 receiver draws 14.5 mA for decoding 5 bits (2.9 mA/bit). Whereas, the regular 4-PAM receiver draws 3.6 mA for decoding 2 bits (1.8 mA/bit). Therefore, the 3LINE-PAM4 method requires 1.1 mA more power-supply current per bit compared to the conventional 4-PAM method. However, as mentioned in Section III-C, the proposed method can also reduce the total power of transceiver due to the significant power saving in the transmitter.
VI. EXPERIMENTAL RESULTS
The receiver for the 3LINE-PAM4 scheme along with a regular 4-PAM receiver was implemented and fabricated in a 0.18-m standard CMOS technology. Fig. 21 shows the 1.88-mm 1.18-mm chip micrograph. Careful layout techniques have been used to reduce the skew between different lines in the 3LINE-PAM4 receiver. The design was pad-limited and the active area of the regular 4-PAM receiver is 0.026 mm whereas the active area of the 3LINE-PAM4 receiver is 0.2 mm . Fig. 22 , which depicts the chip layout, can be used to compare the required area for each building block of the receiver. The 3LINE-PAM4 receiver decodes 5 bits simultaneously and the per-bit area overhead of this method is 0.027 mm .
An 80-pin ceramic flat package (CFP80) is used for packaging of this chip. A printed circuit board test fixture is designed and fabricated for testing this chip. This board has two layers and high-speed signal traces are designed to have 50 characteristic impedance. PCB design methods are used to minimize the skew between different input lines of the 3LINE-PAM4 receiver. The communication channel that was used for the measurement consisted of a 0.8-m SMA cable, 80 mm of printed circuit board (PCB) traces, and a CFP80 package.
A LabVIEW virtual instrument (VI) was developed for lowfrequency measurement [21] . This VI generates the random data and 4-PAM signals at a rate of 1 KS/s for each pair of wires. In addition, it adds AWGN to the generated signal of each line. This VI uses National Instrument (NI) data acquisition cards with 12-bit digital-to-analog converters (DACs) to generate the required signals for the measurement. Digital NI input cards are used to capture the outputs of the chip. This VI also compares the transmitted and received data to measure the BER for both methods. Fig. 23 shows the experimental results for BER-versus-SNR measurements using this VI. As shown in Fig. 23 , the performance of the 3LINE-PAM4 scheme is roughly 2.3 dB better than that of the regular 4-PAM receiver, which is similar to the simulation results for an AWGN channel. This can be expected since at low-speed an AWGN model is a good approximation for the channel.
The functionality of 3LINE-PAM4 scheme at higher data rates (up to 2.5 Gb/s) was verified by a parallel bit error ratio tester (ParBERT). The receiver's input signals were generated by combining the output of two channels of ParBERT by a splitter/combiner. Fig. 24 shows a typical eye-diagram of a single-ended receiver input (splitter/combiner output). As shown in this figure, the single-ended signal level are: 450, 550, 650, and 750 mV. Therefore, the differential signal levels are: 100 and 300 mV. The retrieved outputs of the chip were fed back to the ParBERT input to measure the BER.
Ideally, we need a high-speed arbitrary waveform generator with four channels in order to generate high-speed 4-PAM signals superimposed by additive white Gaussian noise for BER versus SNR measurement of this chip. Unfortunately, due to the lack of the required equipment, the measurement of noise performance was not feasible at high data rates. However, experimental results at 900 Mb/s show that the 3LINE-PAM4 scheme needs a smaller SNR for a given performance. For example, reducing the signal swing by 10% and 20% results in BERs of and , respectively, for 4-PAM and BERs of and for 3LINE-PAM4. Experimental results show that the 3LINE-PAM4 receiver can function with zero error rate up to 2.5 Gb/s. It should be mentioned that simulation results show that the 3LINE-PAM4 receiver could work error free up to 5 Gb/s and the 2.5-Gb/s speed limitation is mainly due to the lack of proper equipment and skew compensation. Finally, the 3LINE-PAM4 receiver draws roughly 12 mA from a 1.8-V power supply at 2.5 Gb/s. This is in agreement with the simulated power at this speed which is 11.5 mA at 2.5 Gb/s. Simulation results show that the the current consumption increases to only 14.5 mA at 5 Gb/s. Table I summarizes the simulation and measurement results. As shown in this table, there is a good agreement between the simulation and measurement for low-frequency performance improvement of the proposed method.
VII. CONCLUSION
Multilevel signaling can be used to reduce the number of signal paths or to increase the data rate. Two coding schemes, 4LINE-PAM4 and 3LINE-PAM4, that can reduce the transmitted power of multilevel signaling have been introduced. These coding schemes have roughly 3-5 dB coding gain over the uncoded 4-PAM. Although more complex coding schemes can be used to achieve higher coding gain, the high-speed implementation is an important factor that should be considered when developing coding schemes for high-speed interconnect. Moreover, analog implementations of the proposed methods have been presented. The proposed low-complexity analog implementation of 3LINE-PAM4 makes its high-speed implementation feasible. This 3LINE-PAM4 scheme is designed and implemented in a 0.18-m standard CMOS technology. This coding scheme transmits 5 bits over 3 differential lines of 4-PAM resulting in a code rate. Experimental results not only verified the coding gain at low-speed, but also showed that the chip is functional up to 2.5 Gb/s. The measured supply current for the proposed method was 12.5 mA at 2.5 Gb/s and the active area of the implemented receiver based on this method was 0.2 mm .
