I. INTRODUCTION

I
N RECENT YEARS, there has been considerable interest in soft-output decoding algorithms; algorithms that provide a measure of reliability for each bit that they decode. The most promising application of soft-output decoding algorithms are probably turbo codes and related concatenated coding techniques [1] . Decoders for these codes consist of several concatenated soft-output decoders, each of which decodes part of the overall code and then passes "soft" reliability information to the other decoders. The component soft-output algorithm prescribed in the original turbo code paper [1] is usually known as the maximum a posteriori (MAP), forward-backward (FB), or Bahl-Cocke-Jelinek-Raviv (BCJR) algorithm [2] , [3] . This algorithm, originally described in the late 1960's, was generally overlooked in favor of the less complex Viterbi algorithm [4] , [5] , moreover, applications taking advantage of soft-output information were not evident. In this paper, we describe techniques for implementing the MAP algorithm that are suitable for very large-scale integration (VLSI) implementation.
The main idea in this paper can be summarized as extending well-known techniques used in implementing the Viterbi algorithm to the MAP algorithm. The MAP algorithm can be thought of as two Viterbi-like algorithms running in opposite directions over the data, albeit with a slightly different computational kernel.
This paper is structured in the following way. Section II is a brief description of the MAP algorithm in the logarithmic domain. Section III studies the problem of internal representation of the state metrics for a fixed-point implementation. Section IV focuses on efficient architectures to realize a forward (or backward) recursion. The log-likelihood ratio (LLR) calculation is also briefly described. Section V proposes several schedules for the forward and backward recursions. As the computations of the forward and the backward recursions are symmetrical in time (i.e., identical in terms of hardware computation), only the forward recursion is described in Sections III and IV.
II. MAP ALGORITHM
A. Description of the Algorithm
The MAP algorithm is derived in [3] and [6] to which the reader is referred to for a detailed description. The original derivation of the MAP algorithm was in the probability domain. The output of the algorithm is a sequence of decoded bits along with their reliabilities. This "soft" reliability information is generally described by the a posteriori probability (APP) . For an estimate of bit ( 1/ 1) having received symbol , we define the optimum soft output as (1) which is called the log-likelihood ratio (LLR). The LLR is a convenient measure, since it encapsulates both soft and hard bit information in one number. The sign of the number corresponds to the hard decision while the magnitude gives a reliability estimate. The original formulation of the MAP algorithm requires multiplication and exponentiation to calculate the required probabilities.
In this paper, we consider the MAP algorithm in the logarithmic domain as described in [7] . The MAP algorithm, in its native form, is challenging to implement because of the exponentiation and multiplication. If the algorithm is implemented in 0090-6778/03$17.00 © 2003 IEEE the logarithmic domain like the Viterbi algorithm, then the multiplications become additions and the exponentials disappear. Addition is transformed according to the rule described in [8] . Following [9] , the additions are replaced using the Jacobi logarithm (2) which is called the operation, to denote that it is essentially a maximum operator adjusted by a correction factor. The second term, a function of the single variable , can be precalculated and stored in a small lookup table (LUT) [9] . The computational kernel of the MAP algorithm is the Addoperation, which is analogous, in terms of computation, to the Add-Compare-Select (ACS) operation in the Viterbi algorithm adjusted by an offset known as a correction factor. In what follows, we will refer to this kernel as ACSO (Add-Compare-Select-Offset).
The algorithm is based on the same trellis as the Viterbi algorithm. The algorithm is performed on a block of received symbols which corresponds to a trellis with a finite number of stages . We will choose the transmitted bit from the set of { }. Upon receiving the symbol from the additive white Gaussian noise (AWGN) channel with noise variance , we calculate the branch metrics of the transition from state to state as (3) where is the expected symbol along the branch from state to state . The multiplication by can be done with either a multiplier or an LUT. Note that in the case of a turbo decoder which uses several iterations of the MAP algorithm, the multiplication by need only be done at the input to the first MAP algorithm [6] .
The algorithm consists of three steps.
• Forward Recursion. The forward state metrics are recursively calculated and stored as (4) The recursion is initialized by forcing the starting state to state 0 and setting (5) • Backward Recursion. The backward state metrics are recursively calculated and stored as (6) The recursion is initialized by forcing the ending state to state 0 and setting
The trellis termination condition requires the entire block to be received before the backward recursion can begin.
• Soft-Output Calculation. The soft output, which is called the LLR, for each symbol at time is calculated as (8) where the first term is over all branches with input label 1, and the second term is over all branches with input label 1.
The MAP algorithm, as described, requires the entire message to be stored before decoding can start. If the blocks of data are large, or the received stream continuous, this restriction can be too stringent; "on-the-fly" decoding using a sliding-window technique has to be used. Similar to the Viterbi algorithm, we can start the backward recursion from the "all-zero vector" (i.e., all the components of are equal to zero) with data { }, from down to . iterations of the backward recursion allows us to reach a very good approximation of (where is a positive additive factor) [10] , [11] . This additive coefficient does not affect the value of the LLR. In the following, we will consider that after cycles of backward recursion, the resulting state metric vector is the correct one. This property can be used in a hardware realization to start the effective decoding of the bits before the end of the message. The parameter is called the convergence length. For on-the-fly decoding of nonsystematic convolutional codes as discussed in [10] and [11] , five to ten times the constraint length was found to lead only to marginal signal-to-noise ratio (SNR) losses. For turbo decoders, due to the iterative structure of the computation, an increased value of might be required to avoid an error floor. A value of is reported in [12] for a recursive systematic code with a constraint length of five. In practice, the final value of has to be determined via system simulation and analysis of the particular decoding system at hand.
B. Upper Bounds for
All the following upper bounds are derived from the definition of in (2):
For practical implementation, one can notice that, due to the finite precision of the hardware implementation, the function gives a zero result as soon as is large enough. For example, if the values are coded in fixed precision with three binary places (a quantum of 0.125), then , thus it will be rounded to 0. In that case, the computation of the offset of the operator can be performed with two pieces of information: a Boolean (for zero) that indicates if is above or equal to the first power of two greater than 2.5, i.e., four. If is true, then the offset is equal to 0. If not, its exact value is computed with the five least significant bits of . The maximum number is , which will be quantized to 0.75, i.e., the width of the LUT is three bits for our example. An LUT is the most straightforward way to perform this operation [9] , [15] . In the general case, there is a positive value such that if (10) We also have (11) with equality if (note that quantization of the LUT is also discussed in [14] ).
III. PRECISION OF STATE METRICS
The precision of the state metrics is an important issue for VLSI implementation. The number of bits used to code the state metrics determines both the hardware complexity and the speed of the hardware. This motivates the need for techniques which minimize the number of bits required without modifying the behavior of the algorithm.
The same problem has been intensively studied for the Viterbi algorithm ( [16] , [17] ) and solutions using rescaling or modulo 2 arithmetic are widely used [18] , [19] . These techniques are based on the fact that, at every instant , the dynamic range of a state metric (i.e., the difference between the state metrics with the highest and lowest values), is bounded by . The forward recursion in the MAP algorithm is slightly different than the Viterbi algorithm since:
1) the outputs of the recursion are the state metrics themselves and not the decisions of the ACS; 2) the Addis an ACS operation with an added offset (ACSO). These differences lead us to question whether the well-known implementation techniques for the Viterbi algorithm are also applicable to the MAP algorithm. The first part of this section shows that the LLR result is independent of a global shift of all of the state metrics of the forward and backward recursions. The bounds on the dynamic range of the state metric are then given.
A. Rescaling the State Metrics
Let us first show that the operator is shift invariant, that is, it still produces a valid result if both of its arguments have a common constant added to them. Let , , and be real numbers. From the definition of , it follows that:
Thus (13) According to (13) , the operator is linear. Thus, a global shift of for all values (or ) would not change the value of , since the contribution of , when put outside the two operators, is cancelled. Thus, it is the differences between the state metrics and not their absolute values that are important. Rescaling of the state metrics can be performed.
B. Approximate Bound on the Dynamic Range of the State Metrics
Let us define as the minimum number such that, for all , there is a path through the trellis between every state at time and . Let us define as the maximum absolute value of the branch metric. Then, for all , a rough bound on the dynamic range of the state metrics is (14) Proof: Let and be, respectively, the maximum and minimum value of the state metric at time . Then, according to the definition of , there is a path of length in the trellis between every state at time and the state with value at time . Since at every step, the maximum state metric (with, eventually, a positive correction factor) is taken, in the worst case, among the path between and , the state metric can decrease by at each step. Thus, is at least equal to or greater than . Similarly, the maximum increase at each stage of the state metric among this path is ( is the maximum value of the correction factor added at each stage). Thus, is lower than, or equal to, . Grouping the upper bound of and the lower bound of leads to (14) .
Note that in the case of a trellis corresponding to a shift register of length , is equal to .
C. Finer Bound on the Dynamic Range of the State Metrics for a Convolutional Decoder
A more precise bound can be obtained in the case of a convolutional decoder using the intrinsic properties of the encoder. All the following developments are based on a previous work based on the Viterbi algorithm [17] . Note that this problem has already been independently addressed by Montorsi et al. [20] , where they extend through intensive simulation, without any formal proof, the result obtained in [21] for the case of Viterbi algorithm.
1) Exact Bound on :
The lower bound of can be obtained using the Perron-Frobenius theorem [25] . Let us work in the probability domain and let us assume that the branch probabilities (15) are normalized so that (16) In a real system, the are bounded (by the analog-digital conversion) and the standard deviation of the noise is a nonzero value, thus, according to (15) and (16), we have the relation for all (17) Let us first assume that the all-zero path is sent in the channel and that all the received symbols have the highest possible reliability. The forward recursion is performed on the received symbols. Let us study, in this case, the ratio of state probabilities between the state with the highest probability and the state with the lower probability when the forward recursion is performed. Note that this ratio, in the log domain, is associated with the maximum difference between state metrics, i.e., the dynamic range of the state metric.
The initial state vector is the uniformly distributed vector of length , where is the number of states of the trellis.
Since by hypothesis all the branch metrics are independent of time, we can express the forward recursion in an algebraic form using transition matrix (18) By recursion, we have (19) By construction, is a positive irreducible matrix (the coefficients are positive, and only performs a modification of the probability distribution of the state metric vector). Thus, according to the Perron-Frobenius theorem [25] , can be expressed, in the basis of eigenvectors , by a diagonal matrix with the two properties: 1)
, the Perron eigenvector of , is the only eigenvector of that has all of its components positive; 2)
, the Perron eigenvalue associated with is positive and for all . Since, in the trellis, all the states at time are connected to the states at time , we deduce that all the coefficients of are strictly positive. Using the Perron-Frobenius theorem for gives an extra property: the Perron eigenvalue of is strictly greater than its other eigenvalues. From this property, we deduce that this property is also true for , i.e., for all . Let be the decomposition of in the basis . The vector can be expressed as (20) with , for . Let us call (respectively, ) the maximum (minimum) coordinate value of vector , and the ratio . Conjecture: For all , . Proof: First, , since . Second, using (20) , we have (21) and thus (22) Finally, we justify the monotonic increasing of (which achieves the proof) by an intuitive argument.
is the likelihood ratio between the state that has the highest probability (state 0, by construction) and the state with the lowest probability. Since every new incoming branch metric confirms state 0, is an increasing function of . Using the same type of argument, if one, or more, of the first received signals do not have the highest reliability, the resulting ratio will be smaller than . Since the code is linear, the result obtained for the all-zero sequence is true for all sequences of bits. Thus, the logarithm of the ratio gives the maximum differences of the state metric.
2) Exact Bound in Finite Precision: The exact maximum difference obtained with a fixed precision architecture is obtained from (19) starting from the all-zero vector until the system reaches stationarity, i.e., if all state metrics increase by the same constant value at each iteration, is then equal to .
Note that this algorithm is a generalization of the algorithm proposed in [22] for the case of Viterbi decoder.
3) Simplification of the Computation of the Branch Metrics
is an -dimensional vector with elements { } (or { 1, 0, 1} in the case of a punctured code where 0 is used for a punctured bit). Using (13), the computation of the branch metrics can be extended and simplified (23) The first terms are common to all branch metrics, thus, they can be dropped. The last terms can be decomposed on the dimensions of the vector. Thus, the modified branch metrics are (24) where takes the value of zero for a punctured code symbol. This expression can be used to find the exact bound of .
4) Example:
As an example, let us consider a recursive systematic encoder with generator polynomials (7, 5) . Moreover, let us assume that the modified branch metrics are coded using 128 levels, from 15.75 up to 15.75 (the inputs are coded between 7.875 up to 7.875, with a step size of 0.125). We assume that the all-zero path is received with the maximum reliability. The resulting state transition diagram (with values of modified branch metrics) is given in Fig. 1 . Table I shows the evolution of the state metrics for the first eight iterations of the forward recursion.
As shown in Table I , the value of does not increase after seven iterations. The limit value is the maximum value of the state metric dynamic range obtained for our example. The approximate bound of (14) gives, for this example, . The bound obtained by the above method is much more precise and can lead to more efficient hardware realizations, since the precision of the state metrics is reduced. Note that the initial state vector is important (the all-zero vector). In the case where the initial state is known (state 0, for example), using an initial state that gives the highest probability possible for state zero and the lowest probability for all the other states can lead to some transitory values greater than . The natural solution to avoid this problem is to use the obtained eigenvector (vector (47.250, 0, 15.750, 0) in this example). For turbo-decoder applications, the method can also be used, taking into account the extrinsic information as the initial state.
IV. ARCHITECTURE FOR THE FORWARD AND BACKWARD RECURSIONS
This section is divided into two parts. The first part is a review of the architecture usually used to compute the forward state metrics [9] . The second part is an analysis of the position of the register for the recursion loop in order to increase the speed of the architecture.
A. Computation of the Forward State Metrics: ACSO Unit
The architecture of the processing unit that computes a new value of is shown in Fig. 2 . The structure consists of the well-known ACS unit used for the Viterbi algorithm (grey area in Fig. 2 ) and some extra hardware to generate the "offset" corresponding to the correction factor of (2). As said in Section II, the offset is generated directly with a LUT that contains the precalculated result of . Then, the offset is added to the result of the ACS operation to generate the final value of . In the following, we will call this processor unit an ACSO unit.
B. Architecture for the Forward State Metric Recursion
The natural way to perform the forward state metric recursion is to place a register at the output of the ACSO unit, in order to keep the value of for the next iteration. This architecture is the same as the one used for the Viterbi algorithm, and all the literature on the speed-area tradeoffs for the ACS recursion can be reused for the ACSO computation. Nevertheless, there is another position for the register which reduces the critical path of the recursion loop. Fig. 3 shows two steps of a two-state trellis. Three different positions of the recursion loop register are shown. The first position is the classical one. It leads to an ACSO unit. The second position leads to a compare-select-offset-add (CSOA) unit, while the third position leads to an offset-add-compare-select (OACS) unit. The last one, the OACS unit shown in Fig. 4 , has a smaller critical path compared with the ACSO unit. Briefly, in the case of a ACSO unit, the critical path is composed of the propagation of the carry ( ) in the first adder, the propagation of one full adder ( ) for the comparison (as soon as a result of the sum is available, it can be used for the comparison), the time of the LUT access ( ) and the multiplexer ( ), and then, once more, the time of the propagation of the carry in the offset addition. For the OACS unit, the critical path is only composed of the propagation of the carry in the first adder (the addition of the offset), the propagation of one full adder for the addition of the branch metric, another propagation of one full adder for the comparison, and then, the maximum of the LUT access and the multiplexer. Thus, the critical path is decreased from (25) to (26) The decrease of the critical path is paid for by an additional register needed to store the offset value between two iterations. The area-speed tradeoff is determined by the specification of the application. As mentioned by one of the paper's reviewers, a Carry-Save-Adder (CSA) architecture can also be efficiently used in this case [23] .
The last step of the MAP algorithm is the computation of the LLR value of the decoded bit. Parallel architectures for the LLR computation can be derived directly from (8) . The first stage is composed of 2 adders. The second stage is composed of two 2 operand operators. Finally, the last operation is the subtraction. A classical tree architecture can be used for the hardware realization of the operand operators.
V. GENERAL ARCHITECTURE
Each element of the MAP architecture has now been described. The last part of our survey on VLSI architectures for the MAP algorithm is the overall organization of the computation. Briefly speaking, the generation of the LLR values requires both and values, which are generated in chronologically reverse order. The first implication is that, somehow, memory is needed to store a given type of vector (say, ), until the corresponding vector ( ) is generated. Each state metric vector is composed of 2 state metrics (the size of the trellis), each one bits wide. The total number of bits for each vector is large ( ) and thus, the reduction of the number of state metrics is an important issue for minimizing the implementation area.
The first part of this section describes the architecture of a high-speed VLSI circuit for the forward algorithm. Then, through different steps, we propose several organizations of computation that reduce the number of vectors that need to be stored by up to a factor of eight. Note that several authors have separately achieved similar results. This point will be discussed in the last section.
A. Classical Solutions [( ) and ( )] Architecture
The first real-time MAP VLSI architectures in the literature are described in [11] , [13] , and [24] . The architecture of [11] and [13] is based on three recursion units (RUs), two used for the backward recursion ( and ), and one forward unit ( ). Each RU contains operators working in parallel so that one recursion can be performed in one clock cycle. The two backward RUs play a role similar to the two trace-back units in the Viterbi decoder of [26] .
Let us use the same graphical representation as in [11] , [27] , and [28] to explain the organization of the computation. In Fig. 5 , the horizontal axis represents time, with units of a symbol period. The vertical axis represents the received symbol. Thus, the curve ( ) shows that, at time , the symbol { } becomes available. Let us describe how the symbols are decoded (segment I of Fig. 5 ). From to , performs recursions, starting from down to (segment II of Fig. 5 ). This process is initialized with the all-zero state vector , but after iterations, as noted in [11] , the convergence is reached and is then obtained. During those same cycles, generates the vectors (segment III of Fig. 5 ). The vectors are stored in the state vector memory (SVM) until they are needed for the LLR computation (grey area of Fig. 5 ). Then, between and , starts from state to compute down to (segment IV of Fig. 5 ). At each cycle, the vector corresponding to the computed is extracted from the memory in order to compute . Finally, between and , the data are In the case where the MAP unit is being used in a turbo decoder, the reordering can be done implicitly by the interleaver. Moreover, the a priori information to be subtracted [1] can be reversed in time in order to be directly subtracted after generation of the LLR value (segment IV of Fig. 5 ). Note that the role of the memories is to reverse the order of the state vectors. Reordering of the state metrics can be done with a single RAM and an up/down counter during clock cycles. The incoming data are stored at addresses . In the next cycles, the counter counts down and the state vectors are retrieved from and at the same time, the new incoming state vectors are stored in the same RAM block (from addresses down to 0). Only one read/write access is done at the same location every clock cycle. This avoids the need for multiport memories.
This graphical representation gives some useful information about the architecture. For example, the values of: 1) the decoding latency: (horizontal distance between the array "acquisition" and "decoded bit"); 2) the number of vectors to be stored:
(maximum vertical size of the grey area); 3) the "computational cost" of the architecture, i.e., the total number of forward and backward iterations performed for each received data: (the number of arrows of RU cut by a vertical line). Note that to perform the recursions, branch metrics have to be available. This can easily be done using three RAMs of size that contain the branch metrics of the three last received blocks of size . Note that the RAM can simply store the received symbols. In that case, branch metrics are computed on the fly every time they are needed. Since the amount of RAM needed to store branch metric information is small compared with the amount of RAM needed to store the state metric, evaluation of branch metric computation will be omitted in the rest of the paper.
In what follows, this architecture is referred to as ( ), where and are, respec- tively, the number of RUs used for the forward and backward recursions.
(for the memory of state metric ) indicates that the vectors are stored for the LLR bit computation. Note that in this architecture, the forward recursion is performed 2 cycles after the initial reception of data.
With the ( ) architecture, state vectors have to be stored. The length of convergence is relatively small (a few times the constraint length ) but the size of the state vector is very large. In fact, a state vector is composed of 2 state metrics, each state metric is bits wide, i.e., bits per state metric vector. The resulting memory is very narrow, and thus, not well suited for a realization with a single RAM block, but it can be easily implemented by connecting several small RAM blocks in parallel.
The architecture ( ) is reported in [10] . It is equivalent to the former one, except that the forward recursion is performed 4 cycles after the reception of the data, instead of 2 cycles (segment V of Fig. 5 instead of segment III) . In this scheme, the vectors generated by are stored until the computation of the corresponding vectors by (light grey of Fig. 5) . Then, the LLR values are computed in the natural order.
Other architectures have been developed. Each presents different tradeoffs between computational power, memory size, and memory bandwidth. Their graphical representations are given below.
B. ( ) Architecture
In this architecture, the forward recursion is performed 3 cycles after the reception of the data (see Fig. 6 ). Thus, vectors and vectors have to be stored. The total number of state vectors to be stored is still . Moreover, with this solution, bits have to be decoded in the last clocks cycles of an iteration, thus, two APP units have to be used. This scheme becomes valuable when two independent MAP decoders work in parallel. Since two MAP algorithms are performed in parallel, it is possible to share the memory words between the two MAP algorithms by an appropriate interleaving of the two operations, as shown in Fig. 6 . In this figure, the second iteration is represented with dotted lines and the corresponding vector memory with a striped region. This scheme can be used in a pipeline of decoders to simultaneously decode two information streams. With this interleaving, the amount of memory for state metrics corresponding to each MAP is divided by two. Thus, the final area will be smaller than the simple juxtaposition of two "classical" MAP algorithms. With this solution, two read and two write accesses are needed at each symbol cycle. Those accesses can be shared harmoniously with two RAMs of size with a read and write access at the same address for each of the two RAMs.
The MAP architecture can use more than two RUs for the backward recursion and/or more than one RU for the forward recursion. The following sections describe some interesting solutions.
C. ( ) Architecture
An additional backward unit leads to the schedule of Fig. 7 . A new backward recursion is started every cycles on a length of symbols. The first steps are used to achieve convergence, then the last steps generate vectors . The new latency is now 3 , and the amount of memory needed to store the vectors is only . Two observations are worth noting: 1) the reduction of the latency and the memory size is paid for by a new backward unit; 2) a solution of type ( ) can also be used.
D. ( ) Architecture
The addition of an extra forward unit can also decrease the SVM by a factor of two, as shown in Fig. 8 . This scheme has the same number of processing units ( ) and the same state metric memory size as the ( ) architecture, but its latency is 4 compared with 3 for the architecture of the previous section. However, the second recursion unit can be simplified, since it only copies, with a time shift of cycles, the computation of . Thus, there exists a tradeoff between computational complexity and memory. By storing, during cycles, each decision and offset value generated by , the complexity of is almost divided by two (see Fig. 9 ).
This method is very similar to the method used for the softoutput Viterbi algorithm [29] .
Note that once more, an ( ) method can be used.
E. ( ) Architecture
This type of architecture is a generalization of the idea described above: instead of memorizing a large number of (or ) vectors, they are recomputed when they are needed. For this, the context (i.e., the state metrics) of an iteration process is saved in a pointer. This pointer is used later to recompute, with a delay, the series state metric. Such a process is given in Fig. 10 .
In this scheme, the state metrics of are saved every cycles (small circles in Fig. 10 ). Those four state metrics are used as a seed, or pointer, to start the third backward process ( , in Fig. 10 ) of length . The third backward recursion is synchronized with the forward recursion in order to minimize the size of the vector to be stored. In practice, only three seeds are needed, since and process the same data during the last quarter of a segment of cycles. With this method, the latency is still 4 , but the number of state metrics to store is now . With such a small number of vectors, the use of registers instead of RAM can be used to store the state metrics. This avoids the use of a RAM with an unusual aspect ratio and a consequent negative impact on performance. This scheme becomes particularly attractive if two independent MAP algorithms are implemented in a single chip, since an ( ) architecture can be used to share the vectors of the two MAP algorithms (see Fig. 11 ). As with the ( ) architecture, this scheme can be used in a pipeline of decoders to simultaneously decode two information streams.
This scheme is particularly efficient because it avoids the use of a RAM for storing the state metrics.
F. Generalization of the Architecture
Many combinations of the above architectures can be realized, each one with its own advantages and disadvantages. In the above examples, the ratio between the hardware clock and the symbol clock is one. Other architectures can be uncovered by loosening this restriction. For example, if this ratio is two (i.e., two clock cycles for each received symbol), the speed of the RU is doubled. Thus, an architecture such as ( ) can be used (see Fig. 12 ) to obtain an SVM of size .
G. Summary of the Different Configurations
In Table II , different configurations are evaluated in order to help the designer of a system. Note that is a generalization factor and that 0.5 (in columns and ) denotes the simplified ACSO unit of Fig. 9 . We can see that in the case of two MAP algorithms implemented together in the same circuit, it is possible to decrease the number of vectors from to . This reduction allows the realization of this memory using only registers.
Note that the final choice of a solution among the different proposed alternatives will be made by the designer. The designer's objective is to optimize area and/or power dissipation of the design while respecting application requirements (decoding latency, performance). The complexity of the MAP algorithm depends on the application (continuous stream or small blocks, simple or duo-binary encoder [30] , [31] , number of encoder states, etc.). The consequence is that the merit of the proposed solution can vary with the application and no general rules can be found. In practice, a fast and quite accurate complexity estimation can be obtained in terms of gate count and memory cells by simply using a field-programmable gate array synthesis tool to compile a VHDL or Verilog algorithm description.
H. Similar Works in This Area
Since the first submission of this paper, much work has been independently published on this topic. In this final subsection, we give a brief overview of these fundamental works.
The architecture of Sections V-B-D has also been proposed by Schurgers et al. In [32] and [33] , the authors give a very detailed analysis of the tradeoffs between complexity, power dissipation, and throughput. Moreover, they propose a very interesting architecture of double flow structures, where for example, two processes of type ( ) and ( ) are performed in parallel on a data block of size , the first one, in natural order, from data 0 , the second, in reverse order, from data down to . Moreover, Worm et al. [34] extend the architecture of Sections V-A and -B for a massively parallel architecture where several processes are done in parallel. With this massive parallelism, very high throughput (up to 4 Gbit/s) can be achieved.
The pointer idea described in Section V-E has been proposed independently by Dingninou et al. in the case of a turbo decoder in [35] and [36] . In this "sliding window next iteration initialization" method, the pointer generated by the backward recursion at iteration is used to initialize the backward recursion at iteration . As a result, no further backward convergence process is needed and area and memory are saved at the cost of a slight degradation of the decoder performance. Note that Dielissen et al. have improved this method by an efficient encoding of the pointer [37] .
Finally, an example of an architecture using a ratio of two between clock frequency and symbol frequency (see Section V-F) is partially used in [38] .
VI. CONCLUSION
We have presented a survey of techniques for VLSI implementation of the MAP algorithm. As a general conclusion, the well-known results from the Viterbi algorithm literature can be applied to the MAP algorithm. The computational kernel of the MAP algorithm is very similar to that of the ACS of the Viterbi algorithm with an added offset. The analysis shows that it is better to add the offset first and then do the ACS operation in order to reduce the critical path of the circuit (OACS). A general architecture for the MAP algorithm was developed which exposes some interesting tradeoffs for VLSI implementation. Most importantly, we have presented architectures which eliminate the need for RAMs with a narrow aspect ratio and possibly allow the RAM to be replaced with registers. An architecture which shares a memory bank between two MAP decoders enables efficient implementation of turbo decoders.
