Abstract-An intuitive shortcut to understanding the maximum a posteriori (MAP) decoder is presented based on an approximation. This is shown to correspond to a dual-maxima computation combined with forward and backward recursions of Viterbi algorithm computations. The logarithmic version of the MAP algorithm can similarly be reduced to the same form by applying the same approximation. Conversely, if a correction term is added to the approximation, the exact MAP algorithm is recovered. It is also shown how the MAP decoder memory can be drastically reduced at the cost of a modest increase in processing speed.
I. INTRODUCTION
T HE maximum a posteriori (MAP) decoding algorithm for convolutional codes was proposed over two decades ago 1 by Bahl et al. [1] , but initially received very little attention because of its increased complexity over alternative convolutional decoders for a minimal advantage in bit-error rate performance. Recently, however, the MAP decoder has enjoyed renewed and greatly increased attention as an iterative soft-output decoder for the class of "turbo" codes discovered by Berrou et al. [3] , as well as the class of serial concatenated codes with random interleaving, more recently proposed, analyzed, and simulated by Benedetto et al. [4] . Other soft-output decoding algorithms have also been proposed and successfully demonstrated, notably the SOVA algorithm of Hagenauer and Hoeher [5] . Only the MAP algorithm, however, achieves acceptable performance at levels within 1 dB of the value which corresponds to Shannon capacity.
In the next section, we derive intuitively an approximation to the MAP algorithm. We then proceed to justify this heuristic approach by demonstrating its structural equivalence to the formally derived MAP algorithm, particularly after the same approximation is applied to the basic function required to implement the latter. The connection is further strengthened by showing that when the approximation is augmented by a correction term, it becomes the same as the logarithmic form of the MAP algorithm. In the last section, we consider impleManuscript received July 19, 1996; revised August 5, 1997. This paper was presented at the IEEE Information Theory Workshop, Haifa, Israel, June 1996.
The author is with QUALCOMM Inc., San Diego, CA 92121 USA. Publisher Item Identifier S 0733-8716(98)00165-6. 1 A similar forward-backward recursion algorithm for MAP demodulation for channels with intersymbol interference had previously been published by Chang and Hancock [2] . mentation complexity, and show that the MAP decoder can be implemented with no more than four times the complexity of a Viterbi decoder for the same code.
II. AN INTUITIVE APPROXIMATE APPROACH TO THE MAP ALGORITHM
Let us first consider briefly the MAP 2 soft-output decoder for a block code. If is the information bit sequence and is the sequence of received channel output symbols, then the a posteriori probability ratio for the th bit is just (1) In preference to using as the soft-decision output, its logarithm has the advantage that, for a memoryless channel, the overall metric can be formed as sums, rather than products, of independent components or metrics. Thus defining (2) for the soft-output metric, and (3) it follows from (1)-(3) that (4) Exponentiation enhances the differences between individual metrics . Hence, typically, one term will dominate each sum, which suggests the approximation (5) which is obviously also a lower bound. When applied to (4), this yields (6) a metric which has been termed the "dual-maxima" rule for block codes [6] , [7] . We now proceed to apply the same approach to a block code generated as a convolutional code which is truncated by forcing the encoder to a known (e.g., ) state as shown in Fig. 1 . While (4) and (6) still apply, they involve an inordinately complex computation involving terms, where is the length of the information bit sequence. We proceed to simplify (6) by recognizing that the maximum metric for a given state of the convolutional code's trellis can be obtained from the conventional Viterbi algorithm (VA). Suppose then that we seek for the th bit, when the channel is memoryless. For simplicity, let the code trellis be binary, so that one branch corresponds to a single bit (this can be easily generalized to treat multiple bits per branch). Then we may argue intuitively as follows, referring to Fig. 1 . Let us generate all of the state metrics at the ( )th node by applying the Viterbi algorithm (VA) from the initial node to this point. For each state, the metric corresponds to the maximum metric over all paths up to that node. Let us denote the state metrics at the ( )th node and those at the th node , where and are the generic states at the ( )th and th nodes, respectively. These are generated by the Viterbi algorithm, whose definition is the recursion relation (7) where is the branch metric for the branch connecting state at node to state at node . The state metrics for the portion of the trellis beyond the th node can similarly be computed recursively by a backward VA starting at the last node (Fig. 1) . Thus denoting the generic states at nodes and , and , respectively, and the corresponding state metrics and , the recursion for the backward VA can be similarly stated as (8) where is again the metric from the branch connecting to . This branch metric required by both recursions is just the log-likelihood function (9) Thus, for two states and , for which a branch transition does not exist, the metric is negative infinity, as seen from (9) . However, for all pairs ( ) for which a branch transition exists, we may combine the forward state metrics at node , the backward state metrics at node and the metrics for branches connecting the two sets to obtain for the approximate metric of (6) the expression 3 (10) Note that the first maximum is over all branch pairs, at node and at node for which the connecting branch is shown as a solid line in Fig. 1 , while the second maximum is over all pairs for which the connecting branch is shown as a dotted line.
We next show that this result can be reached by applying the approximation (5) to the exact formulation of the MAP decoder; conversely, we shall also show that the exact result can be obtained by applying a correction term to the maximum functions of (7), (8) , and (10).
III. RELATIONSHIPS OF THE APPROXIMATE
TO THE EXACT FORMULATION Hagenauer et al. [9] obtained an elegant derivation of the MAP decoder by partitioning the joint probability where and are the sequences of received symbols before and after the th branch. Replacing the summations in the numerator and denominator of (1) by the summations over all state pairs ( ) for which is +1 and , respectively, one obtains for the logarithm of (1) (10 ) with the recursions for and
and where is again the branch metric given by (9) . Clearly, primed (7 ), (8 ) , and (10 ) become the same as their unprimed counterparts, developed intuitively, if we use (5) to approximate the logarithm of the sum-of-exponentials by the maximum.
More interesting, however, is to apply the reverse process to the approximate development of the previous section. Following several authors [10] - [13] , we define the function (11) It follows from the definition that It also follows that just as so also (12) Thus, replacing max by max in the approximate expressions (7), (8), and (10), we obtain the exact expressions (7 ), (8 ) , and (10 ). This justifies our labeling in Fig. 1 the forward and backward recursions as "generalized Viterbi algorithms" and the computation of as a "generalized dual-maxima" computation.
We note finally, as has been observed in [11] and [12] , that the implementation of is only slightly more complex than the implementation of . The latter requires a subtractor to form ( ) followed by a comparator with zero, while requires additionally only a read-only memory (ROM) which outputs the correction term given the input , which is the subtractor output. This also shows that, just as for a Viterbi decoder, a common term can be subtracted from all metrics at a given node to avoid overflows, with no consequence to performance.
IV. IMPLEMENTATION FOR MEMORY REDUCTION
Now that the implementation has been reduced to a series of common decoder operations, the obvious remaining drawback of the MAP algorithm is the excessive memory required. As described in the above, the entire state metric history must be stored, out to the end of trellis, at which point the backward algorithm begins and decisions can be output starting with the last branch, without the need to store any but the last set of state metrics computed backward. This storage requirement is obviously excessive; for a 16-state code, assuming 6-bit state metrics, it would require 96 bits of storage per branch, for a total of 96 000 bits for a 1000-bit block, judged to be minimal for turbo code performance.
We now describe a technique 4 which reduces the memory requirement for a 16-state code to just a few thousand bits, independent of the block length. It can best be described by referring to the timing diagram of Fig. 2 , which indicates the bit processing times for one forward processor and two backward processors operating in synchronism with the received branch symbols, i.e., computing one set of state metrics during each received branch time (bit time for a binary trellis).
The basis for this approach is the fact that the VA can start cold in any state at any time; initially, the state metrics generated are nearly worthless, but after a few constraint lengths, the set of state metrics are as reliable as if the process had been started at the initial (or final) node. Let this "learning" period consist of branches. (For a 16-state code, is more than sufficient, amounting to over six constraint lengths of the convolutional code.) This applies equally to the backward as well as the forward algorithm, and assumes that all state metrics are normalized by subtracting at every node an equal amount from each.
Let the received branch symbols be delayed by branch times. Then the forward algorithm processor starts at the initial node at branch time , computing all state metrics for each node every branch time and storing these in memory. The first backward processor starts at the same time, but processes backward from the th node, setting every initial state metric to the same value, not storing anything until branch time , at which point it has built up reliable state metrics and it encounters the last of the first set of forward computed metrics. (In Fig. 2 , the top line indicates the node indexes; the remaining lines are labeled according to the times at which the branches are processed. Also, unreliable metric branch computations are shown as dashed lines.) At this point, the generalized dual-maxima process is performed according to (10 ) , the th branch soft decisions are output, and the backward processor proceeds until it reaches the initial node at time . Meanwhile, starting at time , the second backward processor begins processing with equal metrics at node , discarding all metrics until time , when it encounters the forward algorithm having computed the state metrics for the th node. The generalized dual-maxima process is then turned on until time , at which point all soft decision outputs from the th to the th node will have been output. The two backward processors hop forward branches every time they have generated backward sets of state metrics, and they time share the output processor since one generates useless metrics while the other generates the useful metrics which are combined with those of the forward algorithm.
Note that nothing needs to be stored for the backward algorithms except for the metric set of the last node, and these only when reliable metrics are being generated. The forward algorithm only needs to store sets of state metrics 5 since, after its first computations (performed by time ), its first set of metrics will be discarded, and the emptied storage can then be filled starting with the forward-computed metrics for the ( )th node (at branch time ). Thus, the storage requirements for a 16-state code using 6-bit state metrics is just bits in all, which for amounts to approximately 6 bits. (Note that a conventional Viterbi decoder with 64 states and a 32-bit path memory requires about 2 bits of memory, while a decoder requires at least a 40-bit path memory resulting in over of storage.) We conclude that these storage requirements are no greater than those of a conventional VA for commonly used codes.
As for processing requirements, it would appear that the VA load is thus tripled; furthermore, the complexity of the generalized dual-maxima process is no greater than that of the forward or backward VA processor so that, overall, the complexity is not more than quadrupled-also, the chain-back procedure is avoided. Further, since the code is shorter, the number of states is much reduced relative to the and examples just given. Since the MAP decoder (with short constraint length) is only justified for iterative decoding of turbo or serially concatenated codes, we must also account for the required number of iterations, which are on the order of 4-8. Thus, a pair of 16-state concatenated decoders performing four iterations imposes double the processing load of a Viterbi decoder; a pair of four-state concatenated decoders performing eight iterations imposes the same load as a decoder.
Minimum decoding delay is set by the length of the block or its corresponding interleaver. If the processors described above operate at just the speed of the received branches, it is necessary to pipeline the successive iterations, and hence multiply the minimum delay by the number of iterations. If, on the other hand, the processors can operate at a much higher speed, then additional delay can be much reduced.
V. CONCLUDING REMARKS
One purpose of this paper is to clarify and simplify the topic of MAP decoders of convolutional codes, which is often clouded by unintuitive presentations, and hence appears more complex than it actually is. By its inherent equivalence to a combination of forward and backward VA processors, coupled by a dual-maxima computation, the appearance of complexity is dispelled and quantitatively bounded. Another purpose is to assess implementation complexity. By applying memory management techniques similar to those used for ordinary convolutional decoding, we have bounded the processing load at no more than four times that of a conventional decoder for the same code, with moderate memory requirements. For turbo (parallel) and serially concatenated codes, employing iterative soft-output decoders, the component code constraint lengths are much shorter, which affords the possibility of performing several decoding iterations without exceeding the processing time of a single conventional decoder for the longer constraint lengths in common practice. All of this guarantees the feasibility of such decoders operating at multimegabit per second data rates.
