Abstruct-At present, the Viterbi algorithm (VA) is widely used in communication systems for decoding and equalization. The achievable speed of conventional Viterbi decoders (VD's) is limited by the inherent nonlinear add-compare-select (ACS) recursion. The aim of this paper is to describe system design and VLSI implementation of a complex system of fabricated ASIC's for high speed Viterbi decoding using the "minimized method" (MM) parallelized VA. We particularly emphasize the interaction between system design, architecture and VLSI implementation as well as system partitioning issues and the resulting requirements for the system design flow. Our design objectives were 1) to achieve the same decoding performance as a Conventional VD using the parallelized algorithm, 2) to achieve a speed of more than 1 Gb/s, and 3) to realize a system for this task using a single cascadable ASIC. With a minimum system configuration of four identical ASIC's produced by using 1.0 p CMOS technology, the design objective of a decoding speed of 1.2 Gb/s is achieved. This means, compared to previous implementations of Viterbi decoders, the speed is increased by an order of magnitude.
decoding system for satellite communication systems. For communication systems, the achieved system performance, which is often measured as the bit error rate, is of crucial importance. In comparison to an ideal system with optimum algorithmic parameters and without quantization effects, the performance loss-so called implementation loss-of an actual implementation has to be as small as possible. Careful optimization of algorithmic parameters and quantization is required to arrive at solutions fulfilling this requirement. Therefore, throughout this paper, the system design requirements and the system design flow used for the implementation are also focussed on. The implementation loss resulting from solutions which are equivalent in performance under ideal conditions is directed by the sensitivity of the achieved system performance to the implementation constraints. The MM, whose implementation we are dealing with in this paper, was previously only shown to be asymptotically equivalent to the VA [15] . In fact, the MM algorithm can only be derived from the VA assuming that a particular implementation parameter, the block size of the block processing MM architecture, is infinitely large. Of course, this ideal condition cannot be maintained for an implementation of the MM. The first goal of this paper is to show how the MM can be implemented with an implementation loss equal to a standard Viterbi decoder. Hence, it is proven that the MM can be considered to be equivalent to the VA for a practical implementation. Second, the feasibility of realizing Gb/s convolutional decoding systems is proven, achieving more than one order of magnitude higher speed than previous decoder implementations.
To summarize, the design goals for the realized system were as follows.
The Same Decoding Pe~ormance as a Conventional VD:
By analyzing the implementation parameters of the MM, it is shown that the same decoding performance as a conventional VD can be achieved. Some changes to the original MM formulation are introduced in order to achieve this performance with minimum implementation effort.
A Solution for the Gbh Range: The MM algorithm can be realized by a parallel feedforward block processing architecture. In principle, through using the MM, it is possible to design arbitrarily fast VD's. Our design goal was to develop a solution for the Gbls range, i.e. to obtain a speed improvement of one magnitude more than previous solutions. A Single Cascadable Design: The necessary design procedures for achieving a single cascadable design realiz- ing the MM are described. In terms of production cost, the possibility of extending the system with identical modules is of course very attractive. Additionally, even higher decoding speeds can be achieved by simply adding identical modules to the implementation.
CHANNEL DECODING USING THE VITERBI ALGORITHM
The VA [16] , [I71 is a well-known specific application of dynamic programming. Introduced in 1967, the VA is widely used today for various applications in digital communications, of which the decoding of convolutional codes is probably most important. The trellis of convolutional codes is constructed by drawing the possible state transitions of the encoder-which is a feedforward shift register with parallel output-over time. Given a known start state of the shift register, any sequence of input bits corresponds to a unique path through the trellis.
For a rate 1/2 encoder, two output bits are created for every input bit 1 according to the generator polynomials GI and G 2 (see Fig. 1 ). They describe modulo 2 additions of parts of the shift register contents and the input bit. Hence, a particular combination of coder output bits-which is mapped onto transmitted channel symbols according to a given symbol alphabet-belongs to each possible state transition. After transmission, the receiver observes the noisy channel symbols e,. The VA is the dynamic programming formulation of the search for the best path through the trellis given the received channel symbols. From this path the corresponding source bits can simply be decoded. A trellis is described by the finite number of N states z, (z~(0, . . . , N -1)) of the encoder shift register at every discrete time instant k and by branches with associated channel symbols representing the state transitions of the time intervals ( k , k + 1) that connect the states.
Using the observed channel symbols e,, a weight called transition metric Xz3,k is derived-for every possible state transition from state z3 to state z,-as a measure of probability for the corresponding state transition. Here, a large transition metric indicates a high probability. These transition metrics are accumulated as path metrics for the paths given by successive transitions. The path metric -y,,k associated with each state z, at time instant IC is updated according to the ACS recursion 'di : Yz,k+l = max(x,,,k + 7 3 , k ) .
(1)
This means that all paths merging in the same state are compared and the path metric for this state is updated with the largest metric corresponding to the most probable path. The decision for the best path is stored for each state and each time instant k. For a trellis with N = 4 states (Fig. 1 ) the ACS recursion is given by %,k+l = max ( A l l , k + % , k , x 1 3 , k + y3,lc) 7 2 , k + l = max (x21,lc + Y l , k , x 2 3 , k f 7 3 , k ) ^13,k+l = m a (x32,k + 7 2 , k , x 3 4 , k + 74,k) Y4,k+l max ( x 4 2 , k + % , k , x 4 4 , k + 74,k).
(2)
Due to maximum selection this is a system of nonlinear recurrence equations. Since the ACS operation is the only recursive part of the algorithm, the achievable data (and clock) rate of a VLSI implementation is determined by the computation time of the ACS recursion, which is a feedback bottleneck. The nonlinear data dependent nature of the recursion excludes the application of known parallelization strategies [4] . The overall best path of the N best paths associated with the N states (hence, the maximum likelihood path) merges with the correct path if traced back-using the stored ACS decisions-over time, leading to uniquely decoded source bits. The number of transitions after which this merging occurs with sufficient probability is called survivor depth D. Hence, the state at time instant k -D can be determined at time instant k with latency D.
An implementation of the VA, a Viterbi decoder, is usually divided into three functional units. In the transition metric unit (TIvrU), the branch metrics are computed from the noisy channel symbols received for every transition. These branch metrics are used in the add-compare-select unit (ACSU), where they are accumulated according to the ACS recursion. In the survivor memory unit (SMU), the best path is traced back and the decoded bits are determined.
LINEAR ALGEBRAIC FORMULATION
It was shown by Fettweis [18] that a linear algebraic formulation of the ACS recursion can be derived, which, together with the use of asymptotic algorithmic properties, allows to derive the purely feedforward MM formulation of the VA 1181.
In this paper, we introduce an algorithmic modification of the MM which leads to considerable complexity reductions. However, it is a necessary prerequisite for the derivation to briefly review the algebraic formulation of the ACS recursion [ 181. Using this formulation, the algebraic multiplication @ denotes addition and the algebraic addition @ denotes maximum selection. The resulting algebraic structure of a semiring defined over the operations @ and @ contains the following neutral elements: Since the rules known from linear algebra are applicable to this formulation, it is possible to arrive at an M-step ACS recursion
with an M-step transition matrix MA, describing the N x N optimum transitions from every state at time instant k to every state at time instant k + M . Note that this approach is just another formulation of the original ACS recursion, i.e., the results are exactly equivalent.
The key to parallelizing the VA is to make an independent processing of blocks of received symbols possible. The MM contains an acquisition procedure which determines the state in the middle of a symbol block from time instant k -M to IC + M using only the 2M corresponding received symbols.
This acquisition iteration was originally
We will show below that the following new acquisition pro- it turns out that truncation or acquisition errors will not significantly (exponentially) affect the overall error probability if the truncation length t is such that t E ( R ) 2 KE,(R), which leads to the well known D N 5K rule of thumb. We now apply these results to the matrix algebraic formulation. First, we review conventional Viterbi decoding with regard to the linear algebraic formulation: the M-step transition matrix contains the best paths from every state at time instant k to every state at time instant k + M :
Each entry MA^^ contains the metric of the best path from state j at time instant k to state i at time instant k + M .
The conventional VA calculates recursively
which is equal to 18 f k . Hence, the VA operation can also be interpreted as follows: the VA adds the path metrics at time instant k to the corresponding matrix entries and then performs a rowwise (concerning D A~) maximum selection leading to metrics for the N best paths at time instant k + D. If best state decoding [21] is applied, the VA finally selects the overall maximum likelihood survivor of the best state j at time instant k . Note that often, best state decoding is considered to be too expensive for implementation, and therefore not the best path, but a path leading to a fixed state is traced back. In order to avoid performance losses due to this simplification, the actually implemented survivor depth has then to be increased [22] , [21] .
IV.
OF ACQU1slT1oN AND path with metric Y~,~+~ x D A, , +-yJ,rc including the decoding The ensemble average probability that an incorrect path which is unmerged with the correct path for exactly t branches has higher metric than the correct path (called truncation error probability below) is known from Viterbi decoding. We use the terminology and results derived by Viterbi and Omura in [20] . The truncation error probability is used to determine an The conventional VA with best state decoding for time instant k can hence be represented as (6) since the multiplication with (1, . . . ,L) corresponds to the final overall maximum selection. It is obvious that the best state j at time instant k can-disregarding possible truncation errors-be immediately accessed via the indexes of the overall best entry Y z , k + D =D&, +~, , k . The operation (6) on the whole is obviously algebraically equivalent to the VA. Hence it is also proven that this procedure leads to the same performance as the VA. Using a conventional Viterbi decoder, the D-step matrix is not directly accessible, and hence the best path has to be traced back using the decisions generated during the maximum selections. As a corollary, the procedure (6) can be considered to be an alternative to the usual trace back decoding for Viterbi decoders.
After investigating the linear algebraic description for truncation, we also need a linear algebraic description for the acquisition in order to completely analyze the new procedure in (5) . The independent block decoding implies that the path metric vector I ' k is not known as for the conventional VA. We perform a complete acquisition, i.e. we start decoding in midstream at k -D with all path metrics equal to zero. It was already mentioned during explanation of the truncation and acquisition error probability, that the probability that a path which is unmerged with the correct path has higher metric than the best path is neglegible for M 2 D. Hence, we set 111 = D and calculate
The maximum entry in the resulting vector will-disregarding the low probability for acquisition errors--correspond to a path which is either equal to or at least merged with the correct path. Possible decoding errors concerning such a path would hence take place anyway, independent of the acquisition procedure. Therefore, on the whole, we calculate 
V. NOVEL ACQUISITION METHOD
If implemented from right to left, the acquisition part of ( 5 ) involves only matrix vector products, while the truncation part involves matrix matrix products. Note that the brackets in ( 5 ) separating truncation and acquisition may not be removed since the best state at time instant IC is only accessible if the truncation and the acquisition part of the equation are calculated separately.
Because of the large involved computational complexity of matrix products, matrix vector products are much more advantageous for implementation. Using the algebraic formulation, we can simply reverse the direction of calculation and achieve only a matrix vector product for the truncation as well (as for the conventional VA)
leading on the whole to
. , where the selection of row, and column, corresponds to fixed start and end states i and j . According to the MM derivation, these states may be chosen arbitrarily. However, for an actual implementation, M has to be fixed to a particular value. If M = D is used-which will be shown to be sufficient for the new acquisition procedure to achieve VA performance, and as was also proposed for the MM-linear dependence of the matrix no longer holds, which leads to severe decoding performance losses. In fact, M has to be chosen considerably larger than D (see also system simulation results in Section VII-A). For what reason? As was mentioned above, the survivor depth D is derived under the constraint that the probability for a path which is unmerged with the correct path for a period of D transitions has a higher metric than the correct path is small compared to the VA bit error probability. This can be interpreted as follows: if we select the overall best (hence the maximum likelihood ML) path, it will merge with the best path with very high probability and hence lead to correct decoding of the source bit D transitions back. This, however, by no means implies that the rows and columns of a D step transition matrix are linearly dependent. Of course, the original MM derivation can be applied for M >> D. ' However, using M = D and fixed start and end states as for the original MM will definitely be suboptimum compared to the new method derived here. This result is verified also by the system simulations presented in Section VII-A.
As was proved above, the acquisition and truncation procedure (10) can be performed independently and hence in parallel for an arbitrary number of blocks containing 2M symbols with M 2 D. It is most efficient to use nonoverlapping contiguous blocks of length 2M for this operation. The result is a number of uniquely decoded states with distance 2M transitions, i.e., Below, we call the calculation of row and column of the M step transition matrices ACS acquisition iterations. The remaining states in between the known states resulting from Structure of the ACS block processing for the I< = 3, rate 1/2, the ACS acquisition iterations can be decoded using a second ACS iteration. The resulting architecture that processes one block at a time is shown in Fig. 4 . It consumes one block of input symbols for every clock cycle. The latches are necessary to store values which are needed for the following block to be processed in the next clock cycle.
Starting with the known state Z k a second forward and a second backward ACS iteration are computed. Because of the exact knowledge of the starting states, the decisions on the surviving paths generated during the ACS maximum selection are now valid. As for a conventional Viterbi decoder (Fig. 2) , these decisions are used to finally trace back the decoded paths. When forward and backward ACS iterations meet, the overall best state is determined as for the acquisition iteration which again results in a decoded state 2 k -M . Starting from this state, the best path is traced back and the corresponding source bits are computed as a parallel output. A detailed description of the architecture is given in Section VIII. However, it is important that the parameter M determines (see Fig. 4 ) the blocksize of the MM architecture, hence it is directly related to VLSI implementation complexity. Note that it is possible to extend the architecture given in Fig. 4 by identical modules on the left and right-hand side, leading to an even faster architecture that consumes a number of blocks at a time. Therefore, in principle, an arbitrary degree of parallelism can be achieved.
VI. SYSTEM DESIGN ISSUES
The most important performance measures for channel decoding are bandwidth efficiency and the achievable bit error rate (BER). Here, we summarize these measures using the term "communication performance." Bandwidth efficiency is completely determined by the chosen code and modulation scheme. Therefore, we concentrate on the BER, which strongly depends on the actual choice of implementation parameters. Performance measures related to VLSI implementation are throughput, area, and power consumption, which are summarized using the term "VLSI performance."
We first investigate the influence of algorithmic parameters on the BER. Since transmission channel conditions are statistically described and purely analytical investigations are not possible due to nonlinearities and/or complexity, the BER can only be determined by system simulation. Since the finally achieved BER includes all performance losses due to quantization and possibly limited accuracy, the simulations have to lbe carried out using bit true models of the digital hardware in order to achieve accurate results. The system specification includes the desired communication performance.
Hence, a maximum allowed implementation loss results for the implementation compared to an "ideal" implementation with floating point data and optimum algorithmic parameters. Our first design goal is to achieve the same communication performance and hence the same implementation loss as a conventional VD. All algorithmic parameters and the quantization and accuracy of all internal values are free parameters in terms of the specified communication performance, thus, an important complexity/performance tradeoff can be exploited here, since these parameters heavily affect the complexity of a VLST implementation.' This implies that an interactive process should occur taking into account both system design and VLSI implementation in order to jointly optimize communication performance and VLSI performance. Therefore, system design and VLSI design as well as the corresponding tools should be closely coupled in a smooth design flow. The result of the system design process is a bit true model with fixed implementation parameters, quantization and accuracy (see Fig. 3 , which serves as a reference for the following VLSI implementation. During the remaining stages of VLSI design, the main objectives are to achieve the specified throughput (our second design goal here: throughput >Gb/s) with minimum area (exploit the well-known AT tradeoff), not to exceed a given maximum power consumption, implement a specified programmability, and introduce a suitable system partitioning (our third design goal: obtain a single cascadable design). It is particularly important that the MM Viterbi decoding algorithm is only asymptotically equivalent to the VA (i.e., for M >> 0). For practical implementation with finite blocklength M , there is no guarantee that the inputloutput ' A systematic approach to parameter optimization given a maximum specified implementation loss was published in [23] . of the MM will be equivalent to the VA. However, it will be shown below that, choosing M = D , the MM is equivalent in terms of the achievable BER. The choice of M = D simultaneously minimizes VLSI implementation complexity.
VII. SYSTEM DESIGN AND SIMULATION
As was mentioned before, communication performance can only be determined by extensive system level simulation. Additionally, the effects of implementation parameters and quantization have to be investigated by simulation. Hence, runtime efficient system level simulation and analysis CAD/CAE tools are of central importance.
System design and simulation of the minimized method Viterbi Decoder were conducted using the system design tool COSSAP [24] , [25] . This tool contains a graphical user interface, which allows the system to be configured as a block diagram (Fig. 6) , a very runtime efficient dataflow-driven simulator and many facilities for analysis and visualisation of simulation results. The configured system contains a binary random source, a rate 1/2 convolutional encoder with constraint length K = 3 and four states, a mapper which maps the coded bits to gray coded complex QPSK symbols at the sender, an additive white Gaussian noise (AWGN) channel,2 and an AD conversion at the receiver providing quantized inphase and quadrature components for the MM Viterbi decoder. Synchronization is assumed to be ideal.
Bit true modules of the resulting minimized method Viterbi Decoder were developed in C with the following parameterizable parameters: input quantization, blocksize M and start of acquisition. These implementation parameters will be investigated in further detail below. Note that in the block diagram, the block "VITERBI" represents the bit true C module, while the block "HiSL" contains a C module interfacing a board with the real hardware during simulation as described in Section 'Note that a simple additive white Gaussian none channel model without intersymbol interference is adequate for our application. XII. Using this design environment, algorithmic parameters and quantization effects can comfortably be investigated.
The achieved bit error rate (BER) over channel signalto-noise ratio (SNR) as shown in the following figures is the single communication performance criterion which is important here. For coded transmission, the coding gain is defined as the decrease in the SNR needed to achieve a certain BER using coded compared to uncoded transmission. Thus the coding gain at any BER is found by subtracting the coded SNR &/No from the uncoded EbINo.
A. Algorithmic Parameters and Quantization
Typically, the dependence of communication performance on algorithmic parameters and quantization exhibits saturation behavior. Exceeding a certain limit, no further performance improvement is gained, while below this threshold, performance is significantly parameter dependent. On the other hand these parameters directly affect VLSI implementation complexity.
In this section, the choice of the blocksize M , of an acquisition initialization scheme, and the influence of the quantization of the input samples are discussed, as well as implications from the employed modulation scheme.
I ) Blocksize:
The parameter M determines the block size of the minimized method architecture (see Fig. 4 ). Hence it is a very important parameter determining both complexity and latency of the implementation. Additionally, communication performance (here BER) depends strongly on M . In Section IV it was proven that for the newly derived acquisition procedure (10) M = D may be chosen while achieving exactly the same performance as the conventional VA. If we employ the original MM acquisition (4), a performance loss of more than 0.5 dB is observed for M = D = 12 (see Fig. 7 ). M has to be increased considerably in order to achieve the same performance as our new method, leading to a 33% increase of the complexity of a VLSI implementation due to the increased blocksize. Therefore, the newly derived method (10) was used for the actual implementation (see also [l] ).
Then (see Fig. 7 Performance of the Gb/s system compared to a conventional Viterbi the blocksize for the Gb/s decoder design was fixed to be
2) Quantization: Viterbi decoding can be performed using either soft or hard quantized input channel symbols. The advantage of using soft quantized inputs is that 2 dB additional coding gain can be achieved compared to hard quantized inputs [26] , [22] . It is known from available literature that three or four bit input symbol quantization is absolutely sufficient for Viterbi decoders. System simulations show that four bit quantization is absolutely sufficient for the MM as well.
It was mentioned before that the MM algorithm cannot be derived analytically from the VA. Therefore, the performance of the resulting bit true MM model (see Fig. 8 , curve "bit true model") is compared to the performance of a standard recursive Viterbi decoder (curve "conv. Viterbi") with the same algorithmic parameters (D = 12) and input quantization (four bit). Our first design goal is achieved since, as shown in Fig. 8 , no performance difference can be found.3
The implementation loss involved with the actual implementation compared to an "ideal" implementation provides insight into the quality of the implementation. Note that the usage of bit true models allows the implementation loss to be determined before the actual hardware implementation is available. The curve "ideal" in Fig. 8 is the result of an ideal decoder implementation without quantization, i.e., using floating point data during the simulation and ideal arithmetic without any rounding or overflow effects, and with quasi unlimited blocksize M . The implementation loss is the difference between this "ideal" curve and the curve resulting from the actual implementation. As becomes clear, the loss is very small (about 0.15 dB at BER). Since there are no rounding or overflow effects, the implementation loss is completely due to the quantization of the input samples and the limited blocksize M .
3Note that in Fig. 8 , the curve "Gb/s Decoder" represents the performance of the fabricated hardware. It results from a simulation with the hardware "in the simulation loop" as described later on. Since the behavior of the fabricated hardware and the bit true model is exactly the same, the performance curves "bit true model" and "Gb/s Decoder" are also identical.
B. Modulation Scheme
The modulation scheme employed is gray-coded quaternary PSK (QPSK) with four complex symbols a, b, c, and d. For the chosen rate 1/2 code, one complex QPSK symbol-given by inphase and quadrature components-is associated with each trellis transition. The transmitted symbols are two dimensional phasors at 90" angles. Since the sequence of transmitted symbols is not known a priori to the receiver, a 90" phase ambiguity remains after phase synchronization. Since the employed code is not 90" rotationally invariant, a means has to be provided in order to correct this phase ambiguity. If the incoming phasors are all rotated by multiples of 90", the resulting symbol sequence does not represent a valid sequence. Therefore, the path metrics accumulated during ACS operation, in this case, do not exceed a certain limit indicating a low probability even for the best path that could be found. Therefore, if a means is provided to observe the path metrics, 90" phase errors can be detected. On the other hand, the path metrics are obviously dependent on the SNR, thus a low path metric could also indicate a bad channel condition.
Due to the random nature of the noise in contrast to the deterministic error introduced by the wrong phase condition, the observed path metrics have to be averaged. If the average does not exceed a certain limit, it is assumed that a phase error has occurred and all received channel symbols are rotated by 90". Such a phase rotator is provided on chip for all incoming symbols. At most three rotations are necessary to obtain a 270' phase error correction. Both the time interval for averaging and the limit are programmable in order to allow the adaptation of the scheme to the actual channel conditions. Hence a path metric accumulator as well as a comparator with programmable threshold and a controller are provided on the chip.
VIIT. ARCHITECTURE AND SYSTEM PARTITIONING With algorithmic parameters and quantization known, communication performance is fixed. The resulting bit true models are now refined stepwise without changing the inputfoutput behavior. It was shown in the previous section that the first goal of (his project, a communication performance like a conventional Viterbi decoder, has been achieved with this bit true model. The additional design goals, Gb/s decoding speed and a single cascadable design, have to be achieved during system partitioning and architecture development.
In order to achieve Gb/s speed, a fully parallel and pipelined implementation of the MM as shown in Fig. 9 was implemented. One dedicated ACS unit, with a dedicated ACS processing element (PE) for every state, is implemented for each trellis transition. Bit level pipelining is implemented for the ACS PE's, which is possible since the MM architecture is purely feedforward. This leads to the skewing storage which is indicated by the shaded areas in Fig. 9 (cf., Fig. 4 for nonpipelined implementation). At this design stage, complexity estimations for the VLSI implementation were possible with high accuracy, indicating that it was not possible to integrate the processing of a whole block of length 2M = 24-which would have been the most "natural" partitioning-on one chip. In order to develop a feasible partitioning, yield and pin count limitations as well as power consumption had to be taken into account. Additionally, we had to achieve our third design goal: a single cascadable design.
1) Yield Limitations:
Because there are two ACS operations per state transition given by the first and second ACS iteration, 24 . 4 . 2 = 192 ACS PE's are needed for a complete block of length 24. In addition, the skewing storage indicated by the shaded areas in Fig. 9 and the trace back elements have to be considered. Taking into account yield limitations, we estimated that on one chip it is possible to integrate only one quarter of the overall number of PE's together with the necessary skewing storage. Hence a suitable partitioning of the architecture had to be derived. It turned out to be advantageous to integrate one block of 12 successive trellis transitions, leading to the partitioning as shown in Fig. 9 .
2) Pin Count Limitations: There was only a limited set of standard packages available for our design, with a maximum pin count of 224. For the partitioning explained above the number of input bits is reduced by half, i.e., to 96 b, because the input block is only of length M = 12. In addition to the input symbols necessary for every chip, only a set of four path metrics and the best state have to be transmitted from chip to chip as shown in Fig. 9 . Using this partitioning, it is possible to integrate the chip using a standard ceramic 176 PGA package. Note that the choice of the package is crucial because of the different cavities given by the selection of a package. In order to keep the bond wires sufficiently short, a certain minimum die size has to be provided for every particular package. This die size usually increases with the number of pins. Even if the used silicon area is smaller, it can happen that the pad ring determines the chip die size. The customer, however, has to pay for the die size, not the used silicon. In our case, selecting the next larger package (208 PGA) would have resulted in an increase of production cost by 25%. Hence we had to use multiplexed bidirectional pads whenever possible in order to keep the pad count below 176 (see also next section). 3) Power Consumption: The choice of a particular package together with the package heat resistance and the maximum allowed junction temperature already limit the maximum power consumption of the chip. As a result the necessity to simulate or at least estimate power consumption arises. Since our VLSI design tools did not include any power estimation or simulation tools, we performed an "activity analysis," in order to determine the average toggle count per gate and clock cycle. It turned out that the toggle count is 66%, which is surprisingly high (for standard designs, where not all parts of the chip are active all the time, a toggle count of 25% is usually assumed). Together with the average power consumption-per gate and MHz clock cycle-it was possible to coarsely estimate the power consumption to 5 W, which leads to tolerable values of the junction temperature using the selected package.
In Fig. 9 the minimum resulting configuration of four chips is shown. This configuration can easily be expanded by extending the input serial to parallel conversion and providing more chips. Hence, the general setup is described by an array of chips with two rows and an arbitrary number of columns. For the different array positions, slightly different functions have to be implemented, which had to be integrated into one cascadable design. It is necessary to distinguish whether the chip is in the upper or lower row of the array, if it is placed on an even or odd column and if it is the mostright or mostleft chip in the array. Then it can be decided if the ACS operation and the trace back have to be performed in forward or backward direction, and if the latches for interchip communication between the leftmost and rightmost chip (see Fig. 9 ) have to be turned on or off. It was possible to achieve this flexibility with relatively little additional effort, as described below.
4) ForwarcUBackward ACS Operation and Trace Back Decoding:
Depending on the array position of the chip, the ACS as well as the trace back operation have to be performed in forward or backward direction.
In Fig. 10 , one trellis transition for a forward and backward ACS iteration is shown. As becomes clear, the interconnection network for the two possibilities is different. But if only the assignment of the binary labeled states to the PE's is interpreted in reversed direction (from left to right instead from right to left, Fig. 10 ) the wiring of the PE's as well as the connections of the four transition metrics for the symbols a,b,c and d to the PE's do not need to be changed. No multiplexers are necessary and the identical topology can be used for both cases (see the Appendix). A different consideration is necessary for the trace back operation. Remember that for every state and every decision during the ACS operation, one bit was stored indicating which of the two merging paths was chosen for a given state. Using these decision bits, the sequence of states can be traced back starting with a known state. From every known successive state the corresponding source bit can be decoded using simple logic equations.
Computing power
In our application, the ACS iteration proceeds in time reversed direction as well as in forward direction. Therefore, a means has to be provided to trace the paths in forward direction. The logic equations for the trace back and trace forward are shown in Table I . Please refer to Fig. 10 for the corresponding trellis transitions, the transition labeled "forward transition" for the trace back and the transition labeled "backward transition with changed labeling. . .'' for the trace forward. The decision bit (db) is set to 0 if, for a given state, the upper incoming branch is chosen, and to 1 otherwise. It is easy to see from Table I that for the trace forward, Z k + l is calculated by a left shift of the binary representation of state Zk and inserting the decision bit (db) at the rightmost bit position, while for the trace back, ZI, is calculated by a right shift of the binary representation of state Zk+l and inserting db at the leftmost position. The decoded bit (decbit)k+l for time instant k f 1 is equal to the rightmost bit of Z k + l for both cases. The different combinational logic for trace back and trace forward can be realized simply by multiplexing the shifted binary representation of the state and the decision bit, thus with a small amount of additional logic.
IX. THE PARTITIONED CASCADABLE CHIP
In this section, the final partitioned architecture of the realized chip is described. There are 12 input symbols which are fed in parallel into the decoder chip. Remember that these 12 symbols are represented by inphase and quadrature components quantized with four bits each. The input symbols are first stored in a preskewing FIFO consisting of dual ported RAM'S and registers. This FIFO is necessary because the ACS iteration is implemented as a pipeline consisting of 12 ACS units each with N = 4 ACS PE's. The ACS PE's are implemented using the solution described in [11] , which employs partly binary and partly redundant carry-save arithmetic in most significant bit (MSB) first mode (see Fig. 11 ). Note that the "plus" circles represent full adders, the M-boxes binary maximum selection and the CSM-boxes carry save maximum selection. Each of the ACS PE's contains two pipeline stages (see Fig. ll) , therefore the preskew is necessary.
Dependling on the position of the chip in the array (upper or lower row, Fig. 9 ), a controller steers a number of the following functions.
The input symbols have to be skewed differently. In the upper row of the array, the ACS operation can immediately start with the first transition, while in the lower half, the input symbols have to be delayed until the best state resulting from the ACS acquisition iteration is available. The timing in terms of clock cycles is shown in Fig. 12 . Note that for a position in the upper row, the input is fed directly into the preskewing registers without entering the 32 x 32 FIFO. The two 64 x 32 FIFO's implement a delay of 8 and 16 clock cycles, respectively. For a position in the lower half, the 32 x 32 FIFO is fully used, while the 64 x 32 FIFO's realize 40 and 48 cycles delay, respectively. Forwardhackward processing has to be enabled for the ACS and the survivor memory. As was described in the previous section, the difference for the ACS is only given by the fact that the binary representation of the states has to be interpreted differently. For the trace back, multiplexers are steered which control the choice of to trace forward or back. Additionally, the start section of the ACS pipeline has to be steered. Either (upper row), the ACS calulation starts with path metrics equal to zero, or (lower row) the ACS calculation starts with the best state already known. The fixing of a particular state is achieved by forcing the path metrics for all states to zero and setting the branch metrics for all paths not emerging from the best state to zero. The advantage of this procedure compared to simply setting the initial path metric of the best state to a sufficiently high value is that the internal wordlength of the ACS 
27
.32
datapath can be kept small (path metric accumulation starts with zero). A number of bidirectional pads have to be steered. Note that in Fig. 12 , the path metrics leaving the ACS pipeline are first converted from carry save to binary representation (CS to binary). The following calculation of the best state involves the path metrics from two neighboring chips, hence it could be implemented only on every second chip. Since we implemented a single cascadable design, however, the corresponding logic has to be provided on every chip. Calculating the best state on every chip requires that every chip sends his path metrics to, and receives the path metrics from its neighbor. This would require two times 4x8 pads (see Fig. 12 ), for the incoming and outcoming path metrics. In order to be able to use a 176 pin PGA, we had to reduce the number of pads. Since it is sufficient that the best state is calculated only on every second chip, we provided bidirectional pads for the path metrics. One chip only sends the path metncs, and finally receives the best state from its neighbor, while this neighbor only receives the path metrics. Of course, two additional bidirectional pads are then necessary for the best state, but 4 x 8 pads are saved, thus, on the whole, 30 pads are saved. Below, the different functional sections of the chip are described as follows.
I ) ACS:
Because of the repeated accumulation of the transition metrics during the ACS operation, the wordlength increases from five to 11 with the length of the pipeline. Therefore, a number of different PE's were designed each with the minimum wordlength necessary (see Fig. 12 ). On the other hand, no metric normalization schemes are necessary 2) Trace Back: The decisions computed during the ACS operation are then stored in the SMU FIFO. It was implemented using flipff ops, because of the relatively small overall amount of storage required. Note that there is only one decision bit generated per ACS decision. The simple combinational logic realizing the trace back was already described in the previous section.
3) Phase Rotation: Note that the path metric accumulation and threshold comparison are programmed using a bit serial interface due to pin count limitations. If an out-of-sync condition is observed, the out -of -sync (see Fig. 12 ) signal is set to 1. The out-of -sync pad can be connected externally with the sync input pad. This sync signal triggers a phase rotation (see Fig. 12 ).
4) Test ControZZer:
The test controller is programmed bitserially via the testpin teio (a bidirectional pad which is later steered to represent an output) and a separate testclock tdk during chip initialization. ~7 1 , ~3 1 .
A. Design Flow
In order to verify the design, an interface has to be provided between the system design tool and the tools used for VLSI implementation. In the early phase of this project we used a different approach than used in current design projects, which is briefly summarized at the end of the section.
The design was conducted using traditional "bottom-up" schematic entry with the Cadence Edge tools. In order to test the gate level schematic descriptions, behavioral descriptions of the subcircuits were developed in STL [29] (Simulation and Test Language) for the Cadence Edge [30] SlLOS gate level simulator and compared to the simulated schematic using the "simdifl" command. The largest debugging effort was necessary for the ACS pipeline. First of all, the mixed carry-savehinary signal representation (see Fig. 11 ) makes it difficult to interpret intermediate waveforms. Note that for the STL description, we used high level behavioral functions ("plus" and "max"), leading to a conventional number representation of the results. Second, all intermediate path metrics in the ACS pipeline occur in a skewed form (Fig. ll) , which presents a further difficulty in interpreting the results. In order to compare the two descriptions, the path metric values had to be first deskewed and then converted to binary representation. This was a very time consuming and error prone task during debugging.
Another important problem (as seems to be usual with bottom-up design) occured when the verified subcircuits were put together. In order to verify the overall circuit, we developed a high level STL model of the chip. This STL description contains a random pattern generator to generate the source bits and the convolutional coder and mapper. Overall functionality for noise free transmission could be tested by feeding the generated channel symbols directly into the decoder, delaying the source bits and comparing the results. Of course, this description covers only noise-free transmission, but is sufficient to detect a lot of functional errors. Note that the overall design can be considered to be a large pipeline with 70 stages and additionally, control and interchip communication. The detection of very simple faults, e.g., signals arriving at wrong times anywhere within the 70 stage pipeline, was very difficult (consider the skewed signal representation in the ACS pipeline). The high-level STL model provided error detection, but did not allow the error sources to be traced, because only the final outputs could be compared. Therefore, it was necessary to refine the high-level model in order to provide intermediate test signals.
After this debugging phase, we finally had to prove functionality and system performance for other than noise free channel conditions. The noisy input symbols generated during a COSSAP system simulation were recorded in ASCII files as well as the outputs of the COSSAP decoder software module. The SILOS simulation of the circuit was then stimulated with these input symbols, and the circuit response was compared to the response of the high level system model.
Today, timed VHDL descriptions can be generated automatically from the COSSAP data flow graph description using the ADEN design environment [31] , [32] . It is also possible to co-simulate a VHDL model included in the COSSAP data flow graph together with the remaining system by simulation coupling [33] . As a result, system level performance can be determined using exactly the model which serves as a reference for further refinement throughout logic synthesis and layout. Hence, a real smooth transition is possible today with easily achievable model consistency.
X. BUILT-IN SELF TEST
The chip contains a pseudo random built-in self test (BIST). In addition, an interface test is possible as well as a system level test and a functional test. Compared to state of the art test methods such as scan path or partial scan path, this method exhibits several advantages. The overhead can be kept very low at a few percent of the overall silicon area [34] . Additionally, the test is dynamic, i.e., it runs on real time with operating frequency. The concept for the self test was published in [35] , [34] . However, we will describe here the design and implementation of the BIST, which had an important impact on the chip architecture. 1) Self-Test Mode: All inputs of the chip are fed into BILBO (built-in logic block observer) registers immediately after the input pads. These BILBO cells are flipflops with some logic at the input side (three gates) enabling the cell to be switched into different operational modes. During self test mode, these BILBO's represent a LFSR (linear feedback shift register) generating pseudorandom numbers. On the other hand, all outputs are fed into BILBO cells as well which represent a signature register compacting the circuit response to the test vectors generated by the LFSR. It should be noted that, using our self test concept, the length of the self test is not critical, because the test vectors do not have to be stored but are simply generated by the input LSFR. In addition, the self test is dynamic; it runs with operational frequency. With to a presimulated signature on chip. Thereby, a static binary chipok signal is generated which can be observed at the test pin teio. Note that in (normal) functional mode, all BILBO's simply represent pipeline registers.
The first fault simulations showed that the fault coverage was only about 70%. This percentage was reached after 10000 clock cycles and could not be improved by simply running longer tests, indicating problems in terms of controllability and observability of the internal signals, which seems to be quite natural given a circuit representing a 70 stage pipeline. A number of facilities were implemented iteratively using the results of extensive fault simulations in order to improve controllability and observability, and thereby, to improve fault coverage. Because of the tremendous computational task given by fault simulations for circuits of that size, the fault simulations could only be performed making use of a ZYCAD hardware simulation accelerator! It should be noted, that a fault simulation taking into account only 1% of all possible stuck-at faults took weeks of computing time on our fastest workstation. But in particular, since these faults are chosen at random and the testability information obtained is spread over the whole gate level circuit, the determination of the reasons for low fault coverage is very difficult, or even impossible. On the other hand, with a full 100% fault simulation, the regions of low fault coverage cannot only be determined but the complete information on the testability allows the reasons for the bad coverage to be traced. Most testability problems were located in the AC!3 pipeline, which contains about 70% of all the chip gates. This is because the final path metric is only output at the very end of the pipeline, while all the other ACS units output single decision bits only. Fault simulations showed, that especially in the front part of the pipeline, observability was very bad. Here, pipeline registers were replaced by cells augmenting the output signature register ensuring observability of the signals. Due to the successive accumulation of the path metrxcs during the ACS calculation, values exceeding a certain limit occur with very low probability, indicating bad controllability. Therefore, the pipeline has to be broken and random numbers have to be fed in at these places via multiplexers. The final structure of the system self test fac is shown in Fig. 13 . The coverage finally obtained is 99% for a test duration of 10000 clock cycles with a test overhead of only about 5% of the overall chip area. All self test facilities are designed in a way that the critical paths are the same for functional as well as self test mode. Therefore, the self-test allows the maximum possible clock frequency of the chip to be easily determined.
XI. PHYSICAL DESIGN AND CHIP TEST
The huge power consumption was an important problem during physical design of the standard cell chip. Since it is a large SO MHz 70-stage pipeline with 66 % of all gates toggling for every clock cycle (see Section IX), power consumption is quite high. The width of the power supply rails in the standard cells of our library was not large enough to allow rows of so MHz 'lock rate, ms. The final signature after self test completion is compared long tests can be run in less than 4A 100% fault simulation needs only one session of a few hours computation time on our ZYCAD machine (ZYCAD XP 140). Fig. 13 . Structure of the system self test facility. standard cells to span over the whole width of the chip, since the high current in these rails would lead to electromigration effects. Splitting the standard cell regions into two parts was sufficient to safely prevent electromigration. The placement of the dual port RAM's of the input FIFO was optimized taking into account the placement of the standard cells. In contrast, simply placing the RAM's on the bottom or top of the chip led to much larger area consumption. From the die photo in Fig. 14 it becomes clear that the wiring is quite dense, despite the tremendous complexity. The wiring channel in the middle is due to the necessity to split the standard cell rows into two parts. The width of the global power supply rails was dimensioned in order to avoid electromigration problems. The overall chip area including pads is 95 mm2. The chip is packaged in a standard ceramic 176 PGA. 1) Measured Results: Using the dynamic on chip pseudo random self test, the maximum clock frequency was determined by simply running the self test at different clock frequencies-using a steerable clock generator included in our test rack-and observing the chipok signal, which we connected to a LED. Besides some stimuli initializing the circuit to selftest mode, no further test vectors are necessary, since these are generated by the on chip LFSR. The physical duration of the self test (10 000 clock cycles) is well below one millisecond. Thus you can comfortably iterate the clock frequency toward the maximum value where the self test is still successful, From 10 fabricated prototype chips, six run with a maximum measured clock frequency of 50 MHz. The measured power consumption at 50 MHz is 3.6 W.
FI

Input BILBOS
XII. SYSTEM VERIFICATION: SIMULATION
WITH A PHYSICAL DEVICE MODEL The COSSAP system design tool allows simulations with a physical device model integrated into the simulation loop-called "Hardware in the Simulation Loop" (HiSL)-to be run [36] . As soon as VLSI implementations of components (blocks) of the block diagram are available, the software modules formerly representing them can be replaced by the hardware. A board containing the hardware and some interfacing (the "device under test") is simply plugged in a rack (''IS2 Testsystem") which is linked to the workstation via VME bus (see Fig. 15 ), and there it is connected to a running COSSAP simulation. Using this feature, the fabricated hardware can be tested in the system design environment, which is of great benefit for system design. A board containing the minimum system configuration of four chips was developed for this task. The system configuration as shown in Fig. 6 includes a block "HiSL" which is the graphical representation of the interface module performing the connection to the VME bus board. The input signals generated during the system simulation (here the received and quantized channel symbols) are sent immediately to the hardware board. For every complete set of received input signals, one clock cycle is steered, and the board inputs The COSSAP tool simulating this configuration for a signal to noise ratio of EbINO = 3.5 dB is shown in Fig. 16 together with the rack and the board. In Fig. 17 , a screendump of the workstation screen is shown. In the upper left part of the screen, the COSSAP configurator window is displayed (cf., Fig. 6 ). The waveform in the lower left part of the screen is an on line display of the remaining bit errors after decoding for a period of 10000 bits (every peak indicates a bit error). The scatter diagram in the upper right part of the screen displays the received QPSK channel symbols, which would be four points at a 90" angle for noise free transmission. The chip set was functionally verified by running long "in the loop" simulations.
For a given EbINO the HiSL simulation was stopped as soon as one thousand decoding errors were observed. The performance curve for the realized system as shown in Fig. 8 (curve "Gb/s Decoder") results from these simulations with the decoding system "in the loop" as shown in Flg. 16. The HiSL method is very well suited for functional verification. However, the average data rate achieved by transmitting the board inputs and outputs while running the simulation is only some kbit/s. But the maximum system speed can easily be determined by running the self test as was described in the previous section. 
XIII. PROJECT DURATIONMANPOWER
The whole system was developed during two years including system simulation model development, architecture design and verification, BIST development, layout, chip prototyping, chip verification and demonstrator development.
The design team consisted of one full time designer (the first author) and nine part time (10 h/week) students, one working on system simulation, four on architecture development, one on BIST implementation, and three on board design and demonstrator development.
The work load for the different tasks is summarized in Table  111 .
XIV. SUMMARY AND CONCLUSIONS
In this paper, the first implementation of the minimized method (MIM) parallel Viterbi decoding algorithm is presented.
The MM was previously shown only to be asymptotically equivalent to the standard Viterbi algorithm. It is proven here for the first time that, by choosing suitable implementation parameters, the same system performance as a standard Viterbi decoder can be achieved. The importance of a close interaction between system design and VLSI implementation is particularly emphasized. This involves an analysis of the impact of the choice of algorithm, algorithmic parameters, quantization, and architecture both on system performance criteria (here the achieved bit error rate) as well as VLSI performance criteria (throughput, silicon area). As a result, the original MM algorithm was modified leading to a large reduction i m terms of VLSI implementation complexity. The second half of the paper dealing with the implementation of a Gb/s Viterbi decoding system includes a detailed architecture description, a discussion of system partitioning issues, and the BIST implementation. The verification of the fabricated system using "in the loop" simulations with the COSSAP system design tool is also described.
With the minimum configuration of four chips a decoding speed of 1.2 Gb/s is achieved, which is already an order of magnitude faster than the fastest previous implementation. Additionally, the overall decoding speed can be linearly increased by simply adding identical modules to the implementation.
It should be mentioned that a similar high speed architecture for soft output decoding applications was derived in [21, [31.
APPENDIX FORWARDIBACKWARD Acs OPERATION
Below it is proven that the same trellis topology and the same assignment of branch metrics to PE's can be used for forward and backward ACS operation. Hence, the same data-path can be used for either operation, without any multiplexers or control.
Consider a convolutional coder as shown in Fig. 1 . It is assumed that for the binary representation of a state S = ( s L , . . . , S O ) the rightmost bit is associated with the rightmost register in the coder, and the other bits, read from right to left, are associated with the other registers from right to left. The state transitions of the encoder can be described in the following way using binary notation for the states , Z O ,~) at time instant IC with L = K -2 . A forward state transition is described by a right shift of the bits in ZI, and insertion of the input bit at the leftmost position. Any realization of the necessary connections given by the state transitions requires a particular assignment of nodes (PE's) to states. If this assignment is simply interpreted from right to left instead from left to right, the same topology results, because the right shift becomes a left shift and the new input bit is inserted at the rightmost instead of the leftmost position.
Thus, the PE labeled with Nk = ( z o ,~, . . . , Z L ,~) emulates the node Zk = (zL,k,...,~O,k) .
A more difficult problem arises from the assignment of branch metrics to PE's. Each branch metric corresponds to a particular sequence of coded bits which is output by the coder for the corresponding transition. Even if the topology of the PE's remains the same for forward and backward computation as was shown before, it can be expected that the assignment of coded bits, and therefore branch metrics, too, will be different for backward computation.
Consider the forward computation of a given backward trellis transition It is important to consider how this affects the generation of the coded bits which is described by the generator polynomials (see Fig. 1 ). In our case, K = 3, L = I, hence Zk = here Nk = ( z~, k ,~l , k ) , it follows that S O = Z L ,~ and s1 = Z O ,~. Since B is inserted at the leftmost position it is taken to be the input bit I . Therefore,
follows. Comparing the two generator polynomials given in (15) and (16) it becomes clear that-because modulo addition is a commutative operation-the coded bits and therefore the metrics associated to each transition are independent of the direction of operation. This is not true in general for other codes. As becomes clear, it is the result of a certain symmetry in the generator polynomials which is not given for all convolutional codes. However, using the above given derivation, it can be easily tested if forward or backward metric generation are the same by using the corresponding polynomials in the derivation. The important corollary here is that the branch metric input to the PE's does not need to be changed in accordance with the direction of operation, i.e. no control overhead is necessary. Only the assignment of states to PE's has to be interpreted taking into account the reversed binary labeling.
