Abstract-A family of multiprocessor architectures implementing the Viterbi algorithm is presented. The family of architectures is shown to be capable of achieving an increase in throughput that is directly proportional to the number of processors when the number of processors is smaller than the constraint length v of the code. The hardware utilization and the depth of the pipelining available inside each processor are also
I. INTRODUCTION ITH the growing use of digital communications,
W there has been an increased interest in advanced coding techniques that yield higher coding gain and permit transmission of larger quantities of data over the same channel. As throughput requirements increase, decoders must be able to operate at higher clock speeds. An alternative to higher clock speeds is to utilize the parallelism inherent in the decoding algorithm by splitting the computations between multiple processors, with each processor performing a part of the ensemble of computations. Much research effort has been dedicated to developing array processors for digital signal processing (DSP) applications, including systolic arrays, wavefront arrays, and data-flow arrays.
Recently there has been increased interest in multiprocessor implementations of decoders for the class of errorcorrection codes known as convolutional codes, and in particular, in a decoding technique known as the Viterbi algorithm [l] . An introduction to the Viterbi algorithm can be found in [2] . The Viterbi algorithm has found widespread acceptance in deep-space communication networks [3] - [5] , magnetic disk memory [6] , and adaptive channel equalization [7] .
A number of multiprocessor architectures have been proposed for Viterbi decoders, including one-and twodimensional systolic meshes [8] , [9] and the perfect shuffle layout [lo] . All of these architectures provide a throughput increase that is sublinear (i.e., to increase the Manuscript received December 3, 1990; revised October 12, 1992. The associate editor coordinating the review of this paper and approving it for publication was Dr. John Eldon. This work was supported by the Natural Sciences and Engineering Research Council of Canada and by the Information Technology Research Center of Ontario.
The authors are with the Department of Electrical and Computer Engineering, University of Toronto, Toronto, Ontario, Canada M5S 1A4.
IEEE Log Number 9210108.
throughput by a constant factor more than a constant growth in area is required [19] ). An architecture that is closely related to the perfect shuffle, known as the crenellated-FFT [ 1 11, is currently being implemented at the Jet Propulsion Laboratory. Yet another family of architectures for Viterbi decoding has been introduced recently by (121-[14] , and these architectures are capable of providing throughput increase that is linear in the number of processors. Unfortunately, throughput increase is also inversely proportional to the number of states in the trellis [ 141. This architecture is of benefit only in the case of the short-constraint length codes, since the number of states in the trellis grows exponentially with the constraint length of the code. Recent work in [9] has been able to take advantage of the relationship between the Viterbi algorithm and the fast Fourier Transform (FFT) [7] by proposing a novel "cascade" Viterbi decoder that is closely related to the "cascade" design for the pipelined FFT computation [ 151, but requires an additional "recirculation" connection from the output of the last processor to the inputs of the first processor. While the cascade Viterbi decoder architecture has a topology that is regular and requires only local interprocessor communications, making it suitable for VLSI implementation, there is one major drawback-the utilization of the processors is only 50%. This paper proposes an architecture that resolves this shortcoming and provides additional benefits, including simpler switching structure and increased pipelining availability inside the processor units.
ORGANIZATION OF THIS PAPER
We begin in Section I11 with a discussion of the mapping and scheduling of the operations in the dependence graph of the Viterbi algorithm (trellis diagram) that are characteristic of all members of the generalized cascade Viterbi decoders (GCVD) family of architectures. In Section IV we analyze the flow of computations in the "cascade" Viterbi decoder of [9] (we will use the name canonic cascade Viterbi decoder (CCVD) for this architecture).
In Section V we introduce an alternative architecture, a folded cascade Viterbi decoder (FCVD) architecture that uses half as many processors, but achieves a throughput equal to that of CCVD. This is followed, in Section VI, by a formal algebraic proof of the existence of the GCVD family of architectures, which leads to an interesting result about the availability of pipelining in the GCVD family of architectures.
In Section VI1 we derive an expression for the utilization of the processors in a GCVD as a function of the number of processors k .
Section VI11 contains an analysis of the cross-point switches and interstage registers that are required to ensure correct spatial and temporal alignment of the path metrics at the inputs of each processor. Section IX discusses survivor sequence memory management for GCVD. We also compare our proposed architecture with uniprocessor and fully parallel Viterbi decoders. Finally, Section X summarizes the main results of this paper.
THE TRELLIS DIAGRAM: MAPPING AND
SCHEDULING OF OPERATIONS Consider a portion of the trellis diagram for the (2, 1, 4) convolutional code illustrated in Fig. 1 . The nodes in the trellis diagram correspond to the add-compare-select (ACS) computations and the edges indicate dependencies between the ACS computations of the successive stages of the trellis. Given a number of individual ACS units we can derive a large number of possible assignments of the nodes of the trellis diagram to the physical ACS units. In this paper we will concentrate on a particular assignment: all computations in a given stage of the trellis are performed on a single ACS unit. We must also select a schedule of computations in each ACS unit that does not violate causality. If more than one such schedule can be found, we then choose one schedule that maximizes the utilization of the hardware. The solid-lines in Fig. 1 give an indication of the scheduling constraints imposed by the causality requirement. For instance, to perform the first ACS computation of the last (rightmost) stage, two ACS computations must be performed in the previous stage, four in the one before that; a total of N = 16 computations in the leftmost stage must be performed. Note that Fig. 1 indicates two ACS computations being performed simultaneously in any stage of the trellis, in a manner reminiscent of the FFT butterfly computation. Since two path metrics (say pm2; and pm2, + I ) are used to compute two path metrics ( pm; and pmi + N / 2 ) of the next stage and no other path metrics, it is advantageous to perform ACS operations two at a time.
IV. THE CANONIC CASCADE ARCHITECTURE The Canonic cascade architecture of the Viterbi decoder was originally introduced in [9] . In the CCVD architecture log ( N ) processors are arranged in a ring, together with local memory and switches. An illustration of the CCVD decoder for q = 2 (binary input alphabet), v = 4, N = q" = 16 is given in Fig. 2 . With the exception of the feedback path connection and the input branch generation circuitry, the architecture of the CCVD is identical to that used for bitonic sorting [16] has only local (nearest neighbor) communication so that partitioning the decoder into multiple chips (or boards) is straightforward. The CCVD layout can be easily extended to accommodate any problem size by inserting more processors, memory, and switches in the ring structure.
There are four types of circuits required to implement the state information update section of the CCVD, as illustrated in Fig. 2 . In the binary input alphabet case each processor (PE,, 0 I j I v -1) consists of two addcompare-select (ACS) circuits in a butterfly configura- An example of timing in the path metric update circuitry of the canonic cascade Viterbi decoder with binary alphabet and constraint length 4. The latency of each processor is assumed equal to one clock cycle.
tion. With the exception of the last processor in the ring, PE,-I , each processor PEj is followed by two shift registers of length 2 J , and a cross-point switch, SW,. The last processor, PE, -which can be thought of as a special dual port memory with a controller dedicated to reordering the state information flow from PE, -to PEo. The efficient implementation of this reordering dual port memory is discussed in Section VIII. In addition, branch generation circuitry consists of two parts: a received symbol FIFO buffer (FIFO in Fig. 2 ) and a set of branch metric generators (BMG,), one for each processor (PE,). Every trellis-stage-time the FIFO buffer receives an n-tuple of quantized channel output symbol, ri, the output of the FIFO is routed in a rotating fashion is followed by a special switch SW, -to one of the branch metric generators by a synchronous sampler at the time corresponding to the beginning of a new cycle of state transition evaluations in the corresponding processor. Upon receiving a channel output symbol, the branch metric generator will proceed to compute branch metrics for every state transition present in the state transition diagram (trellis). Since shift register sizes between processors are not identical, branch metric generators will fetch symbols from the FIFO queue at intervals that are not evenly distributed in time, thus the use of a FIFO queue (i.e., an elastic buffer), and not just a shift register is required.
Finally, survivor sequence storage and retrieval circuitry is required to produce the decoded sequence. Two methods are available: the register exchange method and the traceback method. The register exchange method is conceptually simpler, while the traceback method consumes less area and wiring bandwidth. Either one of two survivor sequence management schemes can be used in a canonic cascade Viterbi decoder. Survivor sequence management techniques are further discussed in [ 171, [ 191. A minor modification of the CCVD is possible, where shift register sizes between processors are placed in order of decreasing length; this is analogous to switching between decimation-in-time and decimation-in-frequency algorithms for the fast Fourier transform. The scheme depends only on a definition of the most significant bit in the convolutional encoder's shift register. Throughout this paper, an assumption will be made that the data in the convolutional encoder is shifted from the most significant bit toward the least significant bit, but the alternative definition would work just as well Each successive processor begins computations before the previous processor has completed its computations, so that stages associated with the next stage of the trellis can be processed even before all states associated with the previous stage have been computed. Note that computations associated with different stages of the trellis are staged unevenly: whereas PEI starts computing two clock cycles after PEo, PE, has to be delayed by three more clock cycles, and PE3 by five more. This explains the necessity of using FIFO buffering in the sampled data input section. The switching control algorithm for SW, through SW,-, is straightforward: each switch, SW,, alternates between straightrthrough and criss-cross configurations with a period of 2', while switch SW, -is significantly more complex, and can best be described as a dual-port memory with a controller. Writing and reading is done according to a predetermined algorithm to perform re-ordering of the state information. This allows input to PEo to arrive in proper order [9] . The special SW, -I will be analyzed in greater detail in Section VIII.
When operating in the manner described, the cascade decoder decodes v bits every 2" clock cycles. It is apparent from Fig. 3 that each of the v processors in the system will be idle for one half of the time, leading to a speedup of v/2 compared to a uniprocessor Viterbi decoder. It may be possible to increase the utilization of the CCVD by using it to decode two interleaved streams of data. One can also ascertain that each storage location within the CCVD is only utilized one half of the time, and, since the total amount of memory storage required is equal to the number of states, 2", times the number of bits required to store one path metric, w,,, the total size of the path metric memory inside the CCVD is 2 x 2" X wpm. ' An extension of the CCVD to arbitrary, nonbinary (q-ary) input symbol alphabets is straightforward and closely related to the radix-q extension of the fast Fourier transform [15] . Utilization of the processors and path metric storage locations in the cases of a q-ary alphabet will be l / 4 .
V. THE FOLDED CASCADE ARCHITECTURE
In the previous section we have pointed out that the CCVD implementation can be efficiently laid out by exploiting local wiring and replication of the processors, switches, and shift registers. However, utilization of the storage elements and the path metric computation circuitry is low (1 /2 for the binary alphabet CCVD and 1 /4
for the q-ary alphabets). Let us consider the following question: is it possible to modify the CCVD so that each processor is utilized 100% of the time? One potential solution is to employ half as many processors, each one fully utilized. This is the basis for the folded cascade Viterbi decoder (FCVD). Rigorous proof that causality is nowhere violated, and that in every stage every state will be computed before it is used in the next stage will be supplied in the next section. Fig. 4 demonstrates how computations for the 16-state Viterbi decoder can be performed on two processors instead of four, as illustrated in Fig. 3 . The new design utilizes all processors and memory locations 100% of the time, while keeping all the advantages of the CCVD (regular structure, modularity, local communications).
Having discovered the CCVD and FCVD architectures (with Y and v/2 processors respectively), it is natural to ask whether other similar architectures exist with a number of processors that is not equal to v or v/2. In the next section we will demonstrate that such architectures do indeed exist and derive their general properties, including the hardware utilization and speedup available.
VI. FORMAL DERIVATION OF THE GENERALIZED CASCADE ARCHITECTURE FOR VITERBI DECODERS

A . Interstage Delays in Cascade Viterbi Decoders
Before performing the formal derivation of the GCVD, it is helpful to examine the interstage delays in a cascade Viterbi decoder. We have already referred to the not-inplace nature of the computations. We will now elaborate further on this property to derive the general expression for the interstage delays. This expression will be exploited later in this section.
'Here we ignore the fact that some memory locations inside the special switch S W , -, are empty for more than half of the time. There exists an alternative implementation of the special switch that, while simpler conceptually (dual-port memory with a special sequencer), requires a much more complex illustration. It is sufficient to note that an equivalent design exists, and that in dual-port memory design all memory locations will be used 1 /2 of the time. 
, where k is the stage number; the trellis is periodic with a period v -1.
numbers, as shown in Fig. 5 . The first evaluation utilizes path metrics for states {0000} and {OOOl} to produce path metrics for states { 0000} and { lOOO} . Note that while we are manipulating the path metrics, their ordering depends on the state number and is independent of the actual value of the path metric. Ordering of the path metrics after zero, one, two, and three stages of the decoding is illustrated in Fig. 5 . Suppose we wish to begin performing computations of stage one as soon as possible after the computed path metrics begin to arrive from the zeroth stage. It is apparent that the processor for stage one cannot begin computations, immediately after the first two metrics for states {0000} and { lOOO} have arrived, since the path metrics for the states {OOOl} and { lOOl} have not yet been computed; it is necessary to wait for one more clock cycle until these path metrics become available. This wait time increases to t y o clock cycles for the second stage and four clock cycles for the third stage (plus the latency of the ACS).
In the zeroth stage the path metrics of the states are used in the natural order. The path metrics leave the zeroth stage in the order derived from their original order by circularly shifting the state number by one bit to the right. This is a direct consequence of the way the convolutional encoder operates. Recall that the contents of the shift register in the convolutional encoder are shifted to the right every clock cycle, with the LSB (a 1 or a 0) shifted out and a new MSB shifted in. Since all stages of the Viterbi decoder are identical, we can derive the ordering of the path metrics generated for any number of stages by performing the appropriate number of circular rotations on the state number. Consider the general case of the Viterbi decoder with a binary alphabet ( q = 2) and N = 2" states. We can give the following definition of the schedule for the cascade Viterbi decoder: Dejinition 2: A schedule for any Viterbi decoder is defined by the time when the inputs (state path metrics) for a particular butterfly computation are required.
For a generalized cascade Viterbi decoder we choose a particular schedule such that the local time at which the inputs of a particular butterfly are required is a circular rotation to the left of the bits representing the butterfly number. The number of bit positions by which the bit representation of a butterfly number must be rotated is equal t o j ' = j mod (v -1).
In particular, computation of the ACS butterfly
-u2u1} in the j th stage will begin at local time The schedule that defines the cascade Viterbi decoder is just one of many equivalent (in terms of performance and resultant implementation) possible schedules.
We have briefly mentioned the fact that computations in stagej + 1 cannot start immediately after the first pair of state metrics have become available from stage j. Definitions 1 through 3 allow us to derive the minimum necessary delay between stages j a n d j + 1 as a function of the stage numberj.
Dejinition 4: The interstage delay (IS,)
is the minimum number of clock cycles between the time the first pair of path metrics is available from the j t h stage and the time the first computation in the; + 1st stage is able to begin.
Theorem I :
For the GCVD with k processors, the interstage delay, IS,, 0 I j I k -1, is equal to 2Jmod("-I) and guarantees that the computations in t h e j + 1st stage can proceed in a pipelined fashion, with a new set of inputs available every clock cycle until all computations of t h e j + 1st stage are completed.
For proof of this theorem refer to [ 
171.
From our discussion of the causality condition in Section I11 and Theorem 1 it follows that the GCVD schedule is the optimum (or perhaps one of the many equivalent optimum schedules), since computations in every stage begin as soon as permitted by the causality condition and computation of every stage of the trellis proceed without interruptions until all ACS operations of a given stage of the trellis have been performed.
B. Timing of the Generalized Cascade Architecture
In this section we proceed to investigate the ordering of operations in the complete ACS datapath, and consequently move from consideration of local times (It) to a global perspective. To differentiate from local time of the previous subsection, we use the designation global time (gt) to refer to the overall time index of the system. For instance the global time (gt) is entirely equivalent to the time index in Figs. 3 and 4 . We can now proceed to evaluate the global time (gt) at which a given state path metric will be available from stagej, and to show the delay required between PE, -and PEo so that the path metrics can be recirculated. The global time will consist of three parts: the combined interstage delay of all the stages from 0 t o j -1; the combined latencies of the stages 0 through j -1, and the local time at which the path metric of a given state is evaluated in the j th stage.
The sum of the interstage delays is E{
The sum of the latencies for stages 0 throughj -1 is
Finally, the local time of thejth stage at which a given state path metric will become available is
Thus the global time at which path metrics of the states { O u u -~u u -2 * * * u 2 u l } and {~U~-~U , -~ -* * u 2 u l } will be available from stagej -1 is
(1)
These path metrics will now be forwarded to PEO for the purpose of computing the path metrics of stagej. PEo is busy with the path metrics of stage 0 up to gt = 2" -I . It can then begin to accept the inputs for stage v in the same order as that used for stage 0 (i.e., in the natural increasing order of state numbers). Let gt2 be the time at which PE, can accept path metrics of state { $ U , -I U, -2 . . . To maintain causality we require that the information be produced before it is consumed, therefore gt2 I gtl for any state. The worst case state can be found by using the following rule: If a bit ak occupies a higher bit position in the gtl expression than in the gt2 expression, then set it to a 1, otherwise set it to a 0. Finally set rl, to a 0, since this choice of rl, makes meeting the criterion gt2 L gtl more difficult (i.e., minimize the expression on the left hand side of the inequality, if the inequality is true when rl, = 0 then it is also guaranteed to be true when rl, = 1, but the opposite is not always true).
We will compute the criteria for inequality gt2 L gtl to be true by first considering the case of 1 5 k < v, followed by the more general case of k any integer greater than or equal to one.
Let us substitutej
gives the following value of the global time at which the path metrics of the state {$av -a , -u 2 u 1 } (where y5 can be a 0 or a 1) will become available from processor . .
The global time at which the path metric of the state {y5av-lav-2 --. a2al} will be required at the inputs of processor 0 will be the same as in ( 2 ) or
The worst case state will again be where rl, = 0, set a bit ai to a 1 if it occupies a higher bit position in the gtl expression than in the gt2 expression, otherwise set bit a, to a 0. From comparison of (3) with (4) it is apparent that bits U v -k , U v -k -l , ' * , a2, al must be set to a 1 and the rest of the bits to zero.
Substituting these values into inequality gt2 > gt I gives
The left-hand side of this inequality will become
The right-hand side of this inequality will become 
We may conclude now that for a constraint length v cascade Viterbi decoder an implementation with k processors (1 I k I v -1) exists providing L 2 " p k -1 / k J I 1, and this implementation utilizes every one of the processors 100% of the time. In addition, the ACS circuitry inside every processor can be pipelined with up to L 2 v -k -1 / k j pipeline stages available in the data path. This is extremely important, since pipelining allows us to increase the clock rate (and throughput) of the Viterbi decoder. This increase in throughput due to pipelining is multiplied by the throughput increase that is made possible by utilizing k processors running in parallel, with each processor utilized 100% of the time.
If the number of processors (k) selected leads to Inequality 8 bring violated, then an extra delay, A, must be added between the beginning of the last computation as- . . . Equation (4) for gt2 remains unchanged.
Once again, causality requires that gt2 I gtl . The worst case state will again be one where rl, = 0, bit ai set to a 1 if it occupies a higher bit position in the gtl expression than in the gt2 expression, otherwise bit ai is set to a 0. From a comparison of (10) and (4) it is apparent that bits , a2, al must be set to a 1 and the rest of the bits to zero.
. . .
a v -k ' , av-k'-l
We can now restate Inequality 5 using the results of (6) Substituting these values into the inequality gt2 > gtl
The right-hand side of this inequality will become
We can now restate Inequality 11 using the results of (12) and (13) and solve for d big A must be made to decode any number of independent streams of data in an interleaved fashion. To decode n interleaved streams it is necessary to set A L (n -1) X Tu, with A = (n -1) X Tu being a condition for being able to decode n data streams with 100% utilization. Of course some values of n may be impossible to achieve, since the condition must be satisfied at all times to ensure causality is not violated.
Note that the first term in the expression for A dominates, thus increasing the value of k by 1 changes the value of A by a small amount only, unless increasing k by 1 will (12)
cause an increase in Lk/(v -l ) J , which increases the value of A by a large amount. The most obvious example of this occurs when k is increased from v -1 to v. Consider Fig. 3 , where a large delay is required if path metrics are forwarded to PEO from PE3; yet if we were to remove PE3, and the path metrics were forwarded to PEO from PE2 the net result would be a much smaller number of clock cycles during which a given processor is idle (shorter delay), resulting in a better processor utilization.
We can now use the expression for A to derive a general equation for utilization possible when decoding a single stream of data for all values of k:
It is apparent that no positive value of d will satisfy Inequality 14 when k 1 v, therefore extra delay must be added between the completion of the zeroth stage evaluation in the zeroth processor and the beginning of the evaluation of stage k in the zeroth processor. Once again, it is possible to increase the delay, simultaneously increase the number of pipeline stages available, and process multiple streams of data in parallel, thus bringing utilization back to 100%. It is possible to determine how Equation (15) is true for all values of k > 0, and for k I v -1 it matches (9), as expected.
An extension of the results of this section for a Viterbi decoder with a nonbinary input alphabet is presented in 1171.
VII. SPEEDUP AND PROCESSOR UTILIZATION IN GCVD
In Fig. 6 the maximum speedup (product of the number of processors k and the utilization U ( k , d ) available is plotted against the number of processors k for two particular cases v = 14 with pipelining depth d = 17 and v = 7 with pipelining depth d = 4. It is apparent that when v = 14 even for the large value of d = 17, the speedup is nearly linear with the number of processors k up to and including k = v -1 (k = 13 in this particular case) where the speedup has its maximum value of 12.7. Note that in addition to the primary linear speedup region which runs from k = 1 to k = v -1 , there are also secondary, tertiary, and so forth linear speedup regions. As is apparent from (15), these linear regions will have a slope that is -1)J ) can be decoded with utilization that is very close to 100%. As the constraint length v increases, the speedup curve will preserve its general shape; deviation of the speedup curve from the straight lines indicating primary, secondary, and so forth linear speedup regions will actually decrease.
VIII. INTERSTAGE SHIFT REGISTERS AND CROSS-POINT SWITCHES
Recall from our discussion in Section VI that a k-processor GCVD contains k -1 cross-point switches and that two shift registers are associated with every crosspoint switch; the length of each shift register following the j t h processor is 2', 0 I j I k -2. The design of these modules is straightforward. One circuit deserves special attention: a special switch, SWk-I that is used to reorder the path metrics before recirculating them back to the inputs of the zeroth stage. We have stated in Section IV that conceptually we can consider this reordering to be performed in a dual-port memory with a special sequencer that ensures proper retrieval of path metrics.
Designing a dual-port memory with an appropriate sequencer is possible, but not trivial, yet, we already use a number of circuits that perform reordering of path metrics based on our knowledge of scheduling in a given stage of a GCVD. These circuits are the other cross-point switches with their attendant shift registers. Admittedly their job is somewhat simpler, since they do not perform reordering like that required at the inputs to the zeroth stage. Yet it seems plausible that one or more cross-point switches may be able to accomplish the reordering required of the dualport memory. In this section we will demonstrate that a combination of one or more cross-point switches with shift registers can always implement the necessary reordering of the path metrics. For any combination of the constraint length v and the number of processors k we will compute the number of cross-point switches required and the size of the shift registers associated with each cross-point switch.
Finally, we will demonstrate that, through elimination of the equal sections of the shift registers on the upper and lower paths between the switches, it is possible to eliminate the delay on the critical path. Alternatively, we can keep at least one shift register between any two switches; the reduction in the processor utilization in this case is equal to the reduction that would be caused by increasing the pipelining depth of each processor by a few (typically one) stages. A GCVD with k = v -1 processors requires only a single cross-point switch to implement the reordering of the path metrics.
Full analysis of the construction of the recirculation network is quite lengthy and can be found in [17] . We will instead give a summary of the method of construction.
Three distinct cases may be encountered in designing a concatenation of switches: 1) Number of processors k = v -1. A single crosspoint switch with registers of length 2"' is required.
2) Number of processors, k , and v -1 are relatively prime. In this case, exactly v -1 switches are required. The length of the shift registers associated with each switch can be computed from the following formula: each register associated with ith switch is of length 2'"-* -' ')'.
3) Number of processors, k , and v -1 are not relatively prime, with gcd(k, v -1) = m. In this case v -1 + m -1 switches are required. The method for computing lengths of shift registers associated with each switch is discussed in [ 171. Consider three concatenated switches with shift registers illustrated in Fig. 7 . It appears that this combination of switches will delay the path metric of any state. Yet in our analysis of GCVD timing in Section VI we have assumed that the path metric on a critical path-the one with the worst (possibly negative) difference between the local time of its generation in the j t h stage and the local time of its consumption in the zeroth stage (after recirculation) is sent to the inputs of zeroth stage with no delay. Fortunately there is a simple modification to our multistage switch network that allows for a zero delay passing of the state path metric on a critical path. Consider a pair of shift registers located between the first and second cross-point switches in Fig. 7 . In general, the lengths of the two registers will not be equal. Suppose, without loss of generality, that the length of the upper shift register is 2', the length of the lower shift register is 2", and m < 1. We may shorten each shift register by 2", leaving a shift register of length (2' -2"') in the upper path and a shift register of length (2" -2") = 0 (or simply a wire) in the lower path. Since removal of equal delays does not affect relative timing of the state path metrics taking the upper and the lower paths, the overall effect of reordering remains in place. At the same time, a delay-free path becomes available for the state path metric on a critical path, as shown in Fig. 7 . In a real circuit, the nonideal switches may cause a problem when the path metric on a critical path is routed by multiple cross-point switches without being latched in between. This problem can be avoided by removing fewer delay registers between any two switches. This will have exactly the same effect on the overall timing as would an increase in a pipelining depth of the processors. One delay register between every pair of switches on the route taken by the path metric on a critical path will be sufficient to restore signal levels following each switch, yet because the total delay will be equal to the number of switches O ( v ) , the result will be equivalent to that of increasing the pipelining depth of each processor by at most v (in the case of the uniprocessor GCVD). No increase in the apparent pipelining depth of a GCVD with v -1 stages will occur, because only a single switch is required to perform the reordering of the state path metrics in a GCVD with v -1 stages. Simplicity of recirculation network design is an added bonus to the previously discussed advantages of a (v -1)-stage GCVD.
IX. DISCUSSION
In our discussion of GCVD architecture we have ignored the survivor sequence memory management design. The best possible design of the sequence memory management for the GCVD is a one-pointer [18] , [19] with distributed memory: total decision memory is split into v -1 memory banks (one for each ACS unit); each memory bank is associated with an ACS unit. The distributed memory structure allows decision writing to proceed over the local wires, while slower traceback/decode read operation travels around the ring of ACS units (but in the direction opposite that of the path metric). An added benefit of the distributed memory is the fact that smaller memory modules can be operated at higher speed.
Finally, it is instructive to compare the GCVD architecture with other architectures that have been used for Viterbi decoding. One of the authors has shown [17] that GCVD architecture is capable of achieving a throughput increase (vis-h-vis a uniprocessor) that grows linearly with both the number of processors and the VLSI area required. This compares favorably with other architectures [9] (square mesh, shuffle-exchange) for which throughput grows as a square root of the VLSI area required.
X . SUMMARY
In this paper we have demonstrated the existence of a family of generalized cascade Viterbi decoder architectures that can be implemented as a ring structure with un-idirectional local communications for a binary alphabet. An extension to a case of any q-ary input alphabet is also possible [ 171. The proposed family of architectures is well suited for VLSI implementation.
We have demonstrated that the GCVD architecture is capable of efficiently utilizing from one to v -1 processors in decoding a single stream of data. We have shown that for the generalized cascade Viterbi decoders will a small number of stages, it is possible to achieve full utilization of all processors and simultaneously pipeline the path metric update circuitry of the processors. This makes the GCVD architecture attractive for implementing the Viterbi algorithm where a large constraint length, v, and a high throughput rate are required. For larger values of the number of processors it is impossible to achieve full utilization in decoding a single stream of data, but full utilization may still be achieved in decoding multiple interleaved streams of data. A GCVD with k = v -1 processors is especially attractive because it achieves the highest speedup possible in decoding a single stream of data, with nearly 100% utilization. Furthermore, a recirculation network for a GCVD with k = v -1 stages is guaranteed to be simpler than a recirculation network for a GCVD with any number of stages that is not an integer multiple of v -1. Similarly, all GCVDs with k equal to an integer multiple of v -1 will provide the largest speedup in decoding multiple interleaved streams of data.
