Abstract-We focus our attention on Gallager codes with parameters compatible with the IS-95 cellular radio standard. We discuss low complexity software and hardware implementations of an iterative decoder for fN; 0; 3g Gallager codes. We estimate that by using Gallager codes, a factor of five improvement in the code-division multiple-access system capacity relative to an uncoded system can be achieved, equivalent to a factor of two improvement relative to state-of-the art orthogonal convolutional codes. Simulation results demonstrate the good performance of short-frame Gallager codes in the additive white Gaussian noise and certain fading channels.
We will argue that the , rate 1/8 Gallager codes of [part I] are good candidates to provide an enhanced error-protection scheme in CDMA systems. Good performance in additive white Gaussian noise (AWGN) and fading channels makes them particularly suitable for mobile communications. Moreover, one can devise low-complexity decoders for these codes, an important consideration for hand-held radio units where power consumption is at a premium.
This paper is organized as follows. In Section II, we give a brief review of the sum-product decoding algorithm along with most of the relevant equations that are used extensively in later sections. In Sections III and IV, we discuss low complexity implementations of the sum-product decoder both in software and in hardware. In Section V, we discuss the computational complexity of encoding and iterative decoding for the , rate 1/8 Gallager codes. In Section VI, we present simulation results for Gallager codes for the AWGN and fading channels. We use simulation results for the AWGN channel in a simplified analysis of the single cell capacity of a CDMA system in Section VII, and show that Gallager codes provide a twofold increase in the system capacity relative to state-of-the art orthogonal convolutional codes [4] . Finally, we present the discussion of the results in Section VIII.
II. SUM-PRODUCT ALGORITHM
To decode Gallager codes, we use the sum-product algorithm [5] , [6] , a version of which was described in the original work of Gallager [7] , and which is sometimes referred to as "belief propagation" [8] , [9] . The sum-product algorithm operates by passing messages along the edges of a bipartite factor graph [10] that describes the conditional joint probability mass function of the codeword symbols given the received channel output. The sum-product algorithm is known by various names in special cases and in different application areas. For example, in the special case when the factor graph represents a trellis or hidden Markov model, the sum-product algorithm is known (in coding) as the BCJR algorithm [11] or (in signal processing) as the forward/backward algorithm [12] ; in statistics, it is an integral part of the Baum-Welch algorithm. For a tutorial treatment of factor graphs, the sum-product algorithm and applications, see [10] . MacKay [13] gives an excellent description of the sum-product algorithm that is used to decode Gallager codes, and we adopt his notation here.
As in Section II, Part I, let be a low-density parity-check matrix for a Gallager code , let denote the transpose of , and let be a transmitted codeword. The channel adds a noise sequence to to produce the channel output . Assuming 1 an-0090-6778/00$10.00 © 2000 IEEE tipodal signaling, the decoder starts by quantizing the channel output, forming the hard-decision vector whose th component is if otherwise.
Next, using arithmetic, the syndrome is formed. If the syndrome evaluates to the zero vector, the decoder produces as its output; otherwise, it tries to find an error-pattern that has the same syndrome as . For this purpose, the decoder performs a number of algorithmically identical steps called iterations. This process is described next.
With the parity-check matrix we associate a sequence of four real "message matrices": , , and , where 0 is the iteration index. Matrices and are computed from and . The elements of these matrices represent the messages passed back and forth between the symbol vertices and the check vertices in the corresponding factor graph. Following [5] and [6] The algorithm itself consists of the following steps [13] . 1) Initialization. For each site (code symbol) of the received sequence we initialize to the likelihood that the noise vector has a one in position (i.e., it "flips" the transmitted bit), and to the likelihood that the noise vector has a zero in position . Elements , , , and are initialized to 1 if 1, otherwise, to 0. 2) Iteration.
• Site-to-check propagation (Vertical Step). The algorithm runs over the matrices and column-wise and computes elements of and (hence the name "vertical step"). Let be the set of row indices of nonzero elements in th column of . Then (1) and (2) where denotes the set with the element excluded, and is a normalizing factor such that 1. In factor graph terminology, a site-to-check message passed on a given factor graph edge, is computed at a site as the normalized product of messages arriving on all other edges at that site, in accordance with the general sum-product rule [10] .
• Check-to-site propagation (Horizontal Step). The decoder runs over and row-wise and computes the entries of and (hence the name "horizontal step"). Let be the set of column indices of nonzero elements in the th row of . Then (3) and (4) where denotes the set with the element excluded and is the probability of observing the th component of the syndrome given the site values 0/1 and (this probability is either 0 or 1). In factor graph terminology, a check-to-site message passed on a given factor graph edge is computed at the check as the product of messages arriving on all other edges at that check, multiplied with a local indicator function , and then marginalized for the variable associated with the given edge, all in accordance with the general sum-product rule [10] . 3) Termination. After the vertical and horizontal steps, the posterior probabilities and are computed as follows:
If then 0, otherwise 1. Let be the tentative noise vector produced by the algorithm. If has the same syndrome as , the decoder stops, reveals and outputs ; otherwise it proceeds with the next iteration step. An upper limit is placed on the number of iterations and a decoder failure is declared when this upper limit is reached without convergence to the given syndrome. An example of iterative decoding is shown in Fig. 1 where a block of 256 bits (an 8 32 binary image) was encoded with a rate 1/2 code (so the block length is 512) and transmitted over a noisy Gaussian channel with noise level of 1.5 dB. The iterations are arranged column-wise in Fig. 1 , with only information bits shown. The first image in the top-left corner corresponds to the received codeword and the final image in the bottom-right corner corresponds to the correctly decoded codeword. It took 11 iterations to achieve correct decoding. (Reference [13, Figs. 12 and 13] gives another interesting decoding example.)
As MacKay [13] points out, the horizontal step (computation of check-to-site messages) can be simplified by operating with the differences between the elements of the matrices. Writing yields a simple product-form that is easily computed, from which the desired values are recovered as This simplification arises from the well-known fact [14] (see [15] for a recent generalization) that APP (a posteriori probability) computation can be performed by processing the dual of a given linear code taking the Fourier transforms of the symbol probabilities used in decoding the primal code. In the binary case, the Fourier transform matrix is 2 2 matrix that maps to , the first component of which is unity when represents a probability mass function. Each check represents a simple parity-check code, the dual of which is a repetition code with a trivial trellis structure.
III. SOFTWARE IMPLEMENTATION OF ITERATIVE DECODER
A software iterative decoder for Gallager codes can be implemented effectively if:
• it exploits the sparsity of the code's parity-check matrix;
• it pre-computes operations that are identical in every iteration and accomplishes them by a low-cost table look-up in each decoding step.
We represent the parity-check matrix as two related lists and ; each list contains the location of the nonzero elements of . Matrices and describe column-wise and row-wise, respectively. Example: The matrix [part I: (2)] can be represented by the following two matrices. and Similarly, matrices and are represented by row-wise lists (the algorithm runs over these matrices in the horizontal step), and matrices and are represented by column-wise lists (these are used in the vertical step).
Since the algorithm alternates between the vertical and horizontal steps, we create lists and SORT that establish a correspondence between the elements of and , and and , respectively. These lists are computed only once for a particular Gallager code.
In the vertical step, the sum-product algorithm runs through the lists that are the column-wise representations of and . The results of the computations (1) and (2) are placed in appropriate positions in the lists representing and by using list SORT. In the horizontal step, the algorithm runs over the lists representing and . Expressions (3) and (4) yield values that are placed in appropriate positions in the lists representing and by using SORT . Note that the products in (1)-(4) require computations over sets and . An efficient way to do it is to run forward-backward passes over the lists and as illustrated in Fig. 2 [16] .
Computations in the Logarithmic Domain: In the vertical and horizontal steps, we compute products of certain local costs (1)- (4) . Multiplications are computationally expensive compared with additions. It is possible to replace multiplications with additions if one operates in the logarithmic domain. An 
The function is tabulated for positive . Since decays quickly as increases, a small look-up table can provide high accuracy in evaluating the function [18] .
IV. HARDWARE ARCHITECTURE OF ITERATIVE DECODER
In this section we shall discuss general hardware algorithms that could be used to implement the iterative decoder. We emphasize the algorithm's inherent parallelism that could lead to low-cost hardware implementations.
The decoder consists of four processor units and three interconnection networks that form a pipeline design. A memory module is used as a data buffer. The schematic configuration that illustrates parallelism and pipelineability [19] of the decoder is shown in Fig. 3 .
In Fig. 3 the processor unit PU computes the difference between matrices and rather than the matrices themselves, making use of the simplification explained in Section II.
Recall that the sum-product algorithm alternates between computing the elements of matrices and . Hence, if the decoder is dedicated to decoding of a single codeword at a time, only a fraction of circuitry will perform decoding operations at any given moment of the decoding cycle (iteration step). To increase the circuit utilization, we suggest simultaneous decoding of several received codewords. This calls for use of a data buffer to store these codewords and enter them into the decoder when a free slot is available. The tradeoff in the proposed scheme is an additional decoding latency. (The decoder in Fig. 3 can contain five codewords in PUs and ICNs simultaneously so the total decoder's latency is 20 ms 5 codewords 100 ms.)
Iterative decoding is generally characterized by a variable decoding effort. Usually the number of iterations performed by a decoder is hard-limited in order to avoid these variations. In our scheme, the decoder may perform a larger number of iterations for particularly noisy received codewords since the data buffer will damp such computational peaks. This is illustrated in Fig. 4 .
The signal flow graph for the processor units PU is shown in Fig. 5 . Processor units PU execute the vertical step of the sum-product algorithm. Recall that the number of nonzero elements in a column of is either 2 or 3. Thus, the processor unit consists of the top part a) (weight 2 columns) and the bottom part b) (weight 3 columns). Appropriate routing for this arrangement is done by ICN and ICN . Fig. 6 shows the processor unit PU . It performs the syndrome computation of the tentative noise vector and the received vector and compares these syndromes. If they are identical, the control unit CU gives a command to output the decoded vector and load new and (i.e., start decoding of a new vector ). If the syndromes are not identical, PU computes the difference which is used in the next iteration for the same .
The signal flow graph for the processor array PU is shown in Fig. 7 . Note that in our case is always 3, so PU has a "uniform" structure (in contrast with processor units PU and PU ).
The function of interconnection network ICN is twofold. It switches signals from the column-wise structure of PU and PU [that corresponds to Fig. 5(a) and (b) ] to the row-wise structure of PU , and routes signals and to adjacent inputs of PU . Interconnection networks ICN and ICN perform routing from row-wise to column-wise arrangement of signals required for PU and PU . These networks are programmed only once for each Gallager code.
The networks ICN , ICN and ICN have global but static connections. One possible realization of these networks is a programmable cross-bar switch such that every input to the network can be connected to any output. This configuration is shown in Fig. 8 .
A better architecture for ICN , ICN and ICN is achieved by using Banyan networks [20] , [21] . An example of a three-stage Banyan network is given in Fig. 9 .
Let us denote the input (and output
The most immediate problem that arises in a Banyan network but not in a cross-bar switch, is message blocking. If two messages ( and ) try to arrive at the same output of a 2 2 cross-bar switch (the basic block in Fig. 9 ) then one message is rejected. One way to circumvent this difficulty is to introduce a small number of redundant paths into the Banyan network. An alternative is to introduce a certain depth buffer at the network's input. Since the arrival of messages to the network's input is not random but fixed for any given Gallager code, pre-computed routing of the messages through the network will allow to avoid any possible message blocking.
V. ENCODING AND DECODING COMPUTATIONAL COMPLEXITY
In this section we discuss one of the most important issues that determines the feasibility of a practical coding scheme. Whereas there are many good block and convolutional codes whose performance is known to be good, a restricting factor in the practical implementation of these codes is the high complexity of software and hardware that is required to extract this good performance.
As we recall, Gallager's motivation for studying low-density parity-check codes was precisely the existence of low-complexity decoders for such codes. Beside being amicable to low-complexity implementations, it is fortunate that Gallager codes also turned out to have many good properties that result in good performance in a variety of communications channels.
The possibility of implementing a coding scheme with low complexity is particularly important in the context of this work. In mobile cellular telephony it is next to impossible to implement expensive software or hardware algorithms on the mobile side because of a variety of severely restrictive factors, e.g., the need to maximize time intervals between battery re-chargings, to maintain small data delays, etc. There are two major contributing factors to the complexity of the coding scheme based on Gallager codes, namely, the costs of encoding and decoding.
A. Encoding
Let us first discuss the encoding complexity for , , Gallager codes. The code's generator matrix has dimensions 192 1536. Unlike the corresponding paritycheck matrix, the generator matrix is not sparse. Hence, the encoder has to store the entire generator matrix. The storage space required for the matrix is 192 1536/8 36 864 bytes. The size of the storage space can be slightly reduced if the generator matrix is in a systematic form, and the encoder needs to store only the nonsystematic part. For
Gallager codes this will save 192 192/8 4608 bytes.
Each bit of an output codeword is obtained by multiplying in a bit-wise manner a length 192 binary vector of input data with a column of the generator matrix and computing the overall parity. The number of clock cycles required to accomplish such multiplication depends on the length of processor's registers. For example, if the processor's word length is 32 bits then the operation will require 192/32 6 cycles, and the total number of clock cycles required for encoding is 6 1536 9216 (6 1344 8064 clock cycles if the matrix is in the systematic form).
A multiplier in a digital signal processing (DSP) chip does not normally contain a circuit for the parity computation. However, if this operation is done often (as in our case) such a circuit can be easily implemented with exclusive-OR gates. Hence, the cost of the parity computation is negligible in any application specific hardware.
For comparison, encoding of a rate 1/2, constraint length 7 convolutional code will require (for information block length of 192) 192 6 198 register shifts and 198 2 396 bitwise multiplications, with two multiplications for each register shift (since the code rate is 1/2). The total number of clock cycles required for encoding is 198 396 594. Hence, encoding of a Gallager code is slower than encoding of a convolutional code with parameters discussed above. Similar analysis applies to turbo codes since in a turbo code component codes are recursive convolutional codes of rate 1.
However, boolean operations bear a lower cost compared with arithmetic operations required for decoding of the received codeword. (By an arithmetic operation we mean either an addition or multiplication of two floating or fixed point real numbers.) The overall encoding/decoding complexity is dominated by the decoding complexity since decoding requires thousands of arithmetic operations per frame for Gallager codes.
B. Decoding
Let us calculate the decoding complexity for the Gallager codes. The computational complexity of the sum-product algorithm is roughly proportional to the number of nonzero elements in the parity-check matrix . In what follows, we compute the number of arithmetic operations (OPs) required for decoding (the number format could be either floating or fixed point depending on the implementation), i.e., we consider operations of adding or multiplying two numbers to have equal cost.
Consider a , Gallager code. We start with the horizontal step of the algorithm.
To compute the total cost of the horizontal step we need to take into account all parity-checks. additions are required to compute the difference of and , which is necessary in the horizontal step [13] . Computation of the elements of and requires two multiplications and one addition per element which amounts to the total complexity is 12 096 OPs. Thus, the cost of the horizontal step is 16 128 OPs.
The vertical step involves computations of particular elements and as well as computations of tentative posterior probabilities and . Construction 2 produces columns of weight 3 and columns of weight 2 (for code rates 1). Substituting the numerical values, we obtain 960 columns of weight 3 and 576 columns of weight 2.
For each weight 3 column a total of 5 multiplications is required to compute the quantities or . 1 Suppose we need to compute the quantities , and . We notice that and (a total of 3 multiplications); and (two additional multiplications). The normalization factor is convenient but not essential, and presently we do not take it into account. Therefore, 2 5 960 9600 OPs are required. For each column with 2 nonzero elements only 1 multiplication is required for so the complexity is 2 1 2 576 2304 OPs. The computation of the tentative posterior probabilities (5) and (6) will require only OPs as these can be obtained by multiplying (1) and (2) by an appropriate column element and . Thus, the total complexity of the sum-product algorithm per iteration is 16 128 9600 2304 3072 31 104 OPs. The complexity per information bit per iteration is then 31 104/192 162 OPs. For comparison, iterative decoding of 16-state turbo codes has a cost of 192 OPs per information bit per iteration [22] . [For a turbo code the number of multiplications per information bit is 12 , where is the state complexity of constituent codes. Note that this expression is independent of code rate. We will obtain the complexity of turbo-decoding at 192 OPs per infor-mation bit per iteration if we assume that the state complexity of constituent codes is 16 (as in [23] ).] The overall decoding complexity of Gallager codes depends on the number of iterations required for decoding. The mean of this number depends in its turn on the channel SNR and also on the definition of the mean, i.e., if one computes the average over all decodings (in which case the average will also depend to a large degree on the maximum number of iterations the decoder is allowed to perform) or over only successful decodings. For higher channel SNR both definitions will yield approximately the same results as the number of unsuccessful (nonconvergent) decodings becomes small (but the computed averages may diverge significantly for these two definitions as SNR decreases). We observed in simulations (where we computed the average over all decodings) that the decoder requires on the order of ten iterations for signal-to-noise ratios in the range of 2.5-3.5 dB.
VI. SIMULATION RESULTS

A. Performance of Gallager Codes in AWGN Channels
In this set of simulations we modeled the performance of the decoder based on the sum-product algorithm. The site likelihoods for AWGN channel and 1 antipodal signaling are [13] ( 8) and (9) If the received signal component is negative then the expressions (8) and (9) for the conditional probabilities and should be exchanged. In our simulations we ran the decoder on 5 10 frames for 2 dB and 10 frames for 2 dB. For both the AWGN and fading channels, we observed that the decoder produces no undetected errors, i.e., all decoding errors resulted from nonconvergent decoder's behavior (we declare nonconvergence if the number of iterations reaches the upper limit that in our simulations was set at 2000 iterations). Similar result was observed for Gallager codes with longer block lengths in [13] . Fig. 10 shows the performance of , Gallager code generated by Construction 2 in AWGN channel (while it is difficult to make a fully meaningful comparison between the result of Fig. 10 and the results reported in literature for other classes of codes due to different code rates, decoding complexities etc., the reader may be interested in [3] where certain results for short-frame turbo codes are reported).
B. Performance of Gallager Codes in Fully Interleaved Flat Rayleigh-Fading Channels
We simulated the performance of the sum-product iterative decoder for short frame Gallager codes in fully interleaved flat Rayleigh-fading channels, i.e., we assumed that fading amplitudes at different time instants are independent Rayleigh random variables with a unit second moment. The assumption of fully interleaved Rayleigh-fading channel may be somewhat unrealistic in IS-95 systems since only a single code frame is inter- leaved. However, good code performance in this type of channels may be an indication of good performance in channels with correlated fading (and certainly warrants further investigation in this direction).
We considered 1 antipodal signaling. We assumed perfect synchronization at the receiver. Then the received signal is where is the transmitted signal ( 1), is the white Gaussian noise with variance and is a Rayleigh random variable such that is 1.
In the first set of simulations we assumed that channel state information (CSI) is available to the decoder; i.e., the decoder knows the fading amplitudes for each transmitted symbol. Fig. 11 shows the performance of the sum-product iterative decoder for Gallager codes in a flat Rayleigh fading channel (some results regarding the performance of turbo codes in fading channels can be found in [3] and, most recently, in [25] ).
If the channel state information (Fig. 12) is not available to the decoder, the performance deteriorates by about 1 dB (we marginalize over to obtain an effective likelihood ratio [25] , [26] ). VII. CAPACITY OF A CDMA CELL It is known [27] that the system capacity of the reverse CDMA link assuming single cell and perfect receiver synchronization, can be approximated as (10) where is the number of users, is the processing gain, and is signal-to-interference ratio. We now use expression (10) to evaluate the CDMA cell capacity, i.e., the number of admissible users for a given bit-error rate (BER). We assume coherent detection in both forward and reverse links. The system capacity can be computed by using bit-error probabilities for Gallager codes in AWGN environment. From Fig. 10 and (10) we obtain the CDMA cell capacity versus signal-to-interference ratio. The result is shown in Fig. 13 . Fig. 13 also compares the cell capacity of a CDMA system that uses Gallager codes, with other CDMA systems built on similar principles. Ormondroyd and Maxey [4] consider a CDMA system that uses very low rate orthogonal convolutional codes. In particular, they consider rate 1/64 orthogonal convolutional codes. To make a fair comparison with their results we assumed that the processing gain is 64 in (10) (where 64 is formed is a product of the reciprocal of the code rate (1/8) and an additional spreading factor of 8). Note that the decoding complexity per information bit of rate 1/64 low rate orthogonal convolutional codes is comparable [27] with the decoding complexity (per information bit per iteration) of 1536, , 3 Gallager codes, but Gallager codes will typically require several iterations for decoding.
In digital speech transmission, typical data BERs that the system can tolerate are on the order of 10 . From Fig. 13 , we observe that an uncoded CDMA system does not admit more than seven users without an increase in the data BER, i.e., without a reduction in quality of service. The CDMA system with orthogonal convolutional codes supports 16 users at BERs of 10 , and the CDMA system with Gallager codes supports 37 users. Hence, a CDMA system with Gallager codes achieves more than a twofold increase in the system capacity compared with [4] , and more than a fivefold increase in the system capacity compared with an uncoded CDMA system.
VIII. DISCUSSION
In the present work we searched for good short frame codes that could be used in CDMA applications in digital speech transmission where processing delay constraints are critical. We were particularly interested in codes for which very efficient (hardware or software) decoding algorithms exist. Iterative decoding algorithms are the focal point of most recent research effort in the coding theory, and they were the primary objects of interest in our work.
We found a family of low rate Gallager codes that exhibits a good performance under the desired constraints. Theoretical results show that Gallager codes are good even for relatively short block lengths. CDMA systems that use Gallager codes could outperform other CDMA systems that use weaker errorprotection codes.
Gallager codes admit highly parallel software and hardware implementations that guarantee high speed data processing. Moreover, low complexity iterative decoders are readily available for Gallager codes.
We note that Richardson et al. [9] , [28] have recently provided analysis and design guidelines for irregular Gallager codes (introduced by Luby et al. [29] ) that perform remarkably well for long block lengths, coming even closer to the Shannon limit on AWGN channels than turbo codes do. It is conceivable that irregular Gallager codes might also provide some performance advantage for the relatively short block lengths considered here, but we have not investigated this possibility.
Since Gallager codes exhibit good performance in AWGN channels and, perhaps, more importantly, in fading channels that arise frequently in mobile communications, we conclude that Gallager codes could, perhaps should, become the error-control scheme of choice in future generations of CDMA systems.
