We present an area-efficient MAP-based turbo equalizer VLSI architecture by proposing a symbol-based soft-input softoutput (SISO) kernel which processes one multi-bit symbol in every clock cycle. The symbol-based SISO hardware can be shared by the equalizer and decoder, thereby reducing silicon area. Further, by introducing block-interleaved computation in the add-compare-select recursions, the critical path delay is reduced thereby improving throughput.
INTRODUCTION
The turbo decoding technique has found numerous applications in decoding of turbo codes [I] , turbo equalization [2], and low-density parity check codes (LDPC) [3] . The soft-input soft-output (SISO) module is a core computational kernel for turbo decoding. Thus, efficient implementation of SlSO algorithms is of interest. Extensive research has alrcady be done on implementations of turbo code decoders and turbo equalizers [4] - [12] . These include lowpower design [4]- [7] , memory optimization [SI- [9] , and high-throughput implementation, [IO] -[ 121.
Suppose a quadrature phase shift keying (QPSK) modulation and recursive systematic convolutional (RSC) encoder is employed with block-based transmission (see Fig. I ). Then, in each iteration, the maximum a posteriori (MAP) SlSO equalizer processes one QPSK symbol, but the SlSO MAP decoder decodes one bit. Therefore, two different hardware platforms are required for the equalizer and decoder architectures as they are designed for different trellis ' structures. Furthermore. there is inefficient hardware utilization since the block-based equalization and decoding are carried out alternatively between the two SISO blocks. The SISO decoder is in an idle state while the SISO equalizer is computing reliability values on each transmitted symbol in a block and vice and versa as shown in Fig. 2 . In this paper, we present a symbol-based SISO architecture which can be utilized by both the equalizer and the decoder, thereby reducing silicon area. Further, it is expected that the critical path delay will increase since the symbolbased architecture processes multiple bits within a clock cycle. In order to reduce the critical path delay, we apply block-interleaving computation to the add-compare-select (ACS) recursion thereby achieving high-throughput arcbilecture at the expense of reduced area savings. Thus, the block-interleaved symbol-based SISO decoder architectures enableus to trade-off area with throughput. But in all cases, the proposed architecture has significantly better area and throughput compared to existing turbo equalizer architectures. The rest of this paper is organized as follows. Section 2 gives a brief description of the MAP-based turbo equalizer and SISO algorithm for equalization and decoding. Then, in Section 3, the proposed symbol-based SlSO decoding scheme is derived and its VLSI architecture is described.
In Section 4, the architectural performance of the proposed method is evaluated in a 0.25 prn CMOS process when it is applied to turbo equalizer implementation.
REVIEW ON MAP-BASED TURBO EQUALIZER.
In the MAP-based turbo equalizer [2] . the intersymbol interference ( S I ) channel and symbol mapping (modulation) black in Fig. I = m>xlQk-u(s'l +7k,I"',"XI (2)
where P and I ' are trellis states and the mamu operation is implemented as 
SYMBOL-BASED DECODING
In this section, we derive a symbol-based decoding scheme for binary RSC, describe the proposed area-efficient VLSI architecture, and introduce the block-interleaved computation to reduce the critical path delay. The predicted area saving and throughput improvement are also provided.
Derivation
In order to share a hardware platform for both SISO equalizer and decoder, bit-level operations in SISO decoding need to be transformed into symbol operations. In this paper, we apply a look-ahead transform to the trellis of the binary encoder so that ACS recursion may be carried out on multiple trellis sections [14]. For simplicity, a 2-bit symbol decoding, where two trellis sections are grouped (see Fig. 3 ), is described. Extension to multi-bit symbol cases is straightforward. To compute the LLR L(vu] for 2-bit symbol decoding, we substitute n y -~ with -4-a in ( 5 ) as follows 
The first term (IO) can be expressed as
* "
inrwa{ g q [ a k -z ( s " j t l -/~]~" , n , I j l ,~) + buls)], tobx a n d j = k -r ti U , . . . , k.
Area-Efficient VLSI Architecture
By employing R-bit symbol-based SISO processing kernel derived previously, we can implement both SISO equalizer and decoder on a single hardware platform leading to area saving. The proposed area-efficient VLSI architectureis depicted in Fig. 4 . The architecture is composed of four units: 7-unit for computing branch metrics, a-unit for computing the forward metrics (ak). P-unit for computing the backward metrics (%), and A-unit for computing LLRs (L(uu)) [7] , 191. Note that two 7-units are needed for each equalizer and decoder to produce 2" different metrics. The ACS data-uath of a. W-units and N-unit for decodine n-bit svm-
A11 = t r~~] . u -z 3~" j +~( . s " , s , G~, i ) t i
bk(sjI . . -* bo1 are described in Fig. 5 and 6 for n = 2. Since the four, 7 , a, 4, and N units are implemented using 2-input adder, 2-input mas*, latches, and internal buffers, we can analyze Lhr hardware complexity in term of precisions, the number of bits in a symbol (T), the number of states in a trellis (Ma), and the processing window size ( L ) [9] . Table 1 summarizes results. As R is increased, the logic complexity increases ex-
we can compute 0 3~~) and L(uu-') as shown below,
.-s,l ponentially. However, the backward metric buffer size will be decreased by a factor of !. Note the size of the delay line 
High-Throughput Architecture
The crilical path delay of the symbol-based decoding architecture may be increased as shown in Fig. 5 . In order to reduce the delay caused by the look-ahead transform, a block-interleaved computation method is exploited leading to a pipelined ACS data-path at the cost of marginal buffer area increase. Since SISO block processing in turbo equalization satisfies three properties: 1.) computation between blocks are independent, 2.) computations between sub-blocks within a block are independent, and 3.) computation within a sub-hlock is recursive, the recursive loop delay of ACS can be reduced via interleaved computation, folding, and retiming transform. Consider the recursive architecture i n Fig. I . Note that the architecture in Fig. I cannot he easily pipelined or processed in parallel due to the presence of the feedback loop. However, if the processing is block-independent, and the computations between subblocks within a block are independent, then one can parallelize the architecture as shown in Fig. 8 . If we now fold the parallel architecture in Fig. 8 by a factor of 2, we obtain the folded block-interleaved architecture. Note that the folded block-interleaved architecture is inherently pipelined. Therefore, an application of retiming (see Fig. 9 ) results in reduction ofthe critical path delay by a factor of two over that of the original architecture in Fig. 7 .
SISO decoders can exploit the property that the forward and backward melrics m and Bk converge after a few constraint lengths ( K ) have been traversed in the trellis, independent of the initial conditions [9] . By using this property, one block can he segmented into several sub-blocks as *shown in Fig. 8 [ I I] . Thus, the symbol-based decoding can reduce the critical path delay via interleaved computation lending to high-throughput implementation. The extra hardware complexity due to pipelining is summarized in Table  2 . Only buffer size and the number of latches are affected.
Area Saving and Throughput Improvement
Based on the analysis results in Table I where, m and m are the critical path delays of SISO equalizer and decoder, respectively and RI is the code rate. Assunling nB = I K~, the throughput gain is the ratio (1) of Ta of the conventional approach over the proposed highthroughput architecture, y n r u ti Nmm
where Z(r-U)U extracycles are necessary because of blockinterleaved computation and r p is the critical path delay of the proposed high-throughput architecture. Note that the throughput gain, q. becomes larger as rp gets closer to nD.
EXPERIMENTAL RESULTS AND DISCUSSION
We employ a RSC encoder at the transmitter with a generator polynomial (23,3318. The coded bit stream is mapped to 4-level pulse amplitude modulation signals. We considered a staticchannel model, H l a 1 = 0.11071tl~J.815f10.~C7z~", and hence 16 states exist in each SISO equalizer and decoder. The conventional and proposed architectures are designed in VHDL, synthesized via Synopsys Design C o mpiler, and placed and routed via Cadence Silicon Ensemble by using a TSMC 0.25 pm CMOS standard cell library. The layouts are shown in Fig I 1 and the area and the critical path delay of each architecture are summarized in Table 3. We see that the area-efficient architecture results in 47% area savings, while the high-throughput architecture provides 2570 area savings. Those values are very close to those predicted by (15) and Table 1 and 2 as can be seen in Fig. 10 (a) . Note further that the high-throtighput architecture achieves 79% improvement in throughput while the area-efficient architecture results in a 11% improvement in throughput. Thus, the block-interleaved symbol-based SISO decoder architectures enable us to trade-off area with throughput.
