In this paper the problem of obtaining efficient hardware for Viterbi decoders for high rate convolutional encoders is addressed. It is first shown that the graphs describing the interconnection of the add-compare-select units required may be classified in terms of structures of which there are only a small number for a given code constraint length. They correspond to the assignment of individual register lengths in the ensemble of shift registers in the feedforward encoder. The structures relate to the partitioning of the states such that common successors are grouped together and successive partitioning leads to a hierarchical, modular VLSI layout method. Example symbolic grid layouts are given for 16 and 64 state codes. It is noted that within a given structure, the parity check matrices map into local wiring patterns implying a method for implementing class-universal programmable or adaptive decoders.
Introduction
The importance of application specific VLSI design in realizing the performance requirements of present and future high speed digital communication systems is readily apparent. In the case of error control, the utility of the Viterbi algorithm for convolutional codes is well known. In the VLSI design of Viterbi decoders, the major issues are handling the interconnection of the required add-compare-select processors and the achievement of high throughput in the face of recursive computation. In considering the first problem systematic, hierarchical methods for VLSI placement and memory management are required. This paper concentrates on the former, extending known results for rate 1/n codes to the general b/n case, with rate 2/n considered as a working example. Higher rate codes are prevalent in the many communication systems that employ coded modulation [1] .
Definitions: Viterbi Decoder Design
It is not intended here to review convolutional codes and Viterbi Decoding in detail, but to just recall those concepts relevant for this paper. In a digital communication system employing convolutional codes an encoder E receives a stream of b bit input symbols and produces a stream of n bit output symbols. These are applied to a system consisting of a modulator, channel and receiver which generates a set of 2n branch metrics which measure the closeness of the received symbols to each possible encoder output symbol. The Viterbi Decoder V performs a maximum likelihood reconstruction of, typically, the encoder input symbols. Further system level details are not of central importance here; we are concerned with the problem of the design of V given E and the essence of this is already well known. E is a (Mealy) finite state machine of N states with state transitions labeled by input and output symbol pairs and V is an interconnection of 2b-input add-compare-select (ACS) units and with some memory. There is one ACS unit per stateoperation per state per system cycle 2 which serves to update accumulated path metric (PM) and a survivor sequence (SS) information. This information is transmitted between ACS units which are interconnected with a pattern matching the state graph of E. E may be realized in either feedforward or feedback form. Since feedback encoders have equivalent feedforward encoders [2] these will be considered for the most part here. However, some comments will be made in section 5 of this paper about the systematic feedback encoder used in Trellis Coded Modulation [3] .
The feedforward encoder in the rate 2/n, constraint length ν (N = 2 ν ) case consists of two shift registers, called A and B here, of lengths ν A and ν B such that ν = ν A + ν B and a parity check matrix block H that computes n modulo-2 sums of the inputs i A , fed to A, i B fed to B and the contents of A and B. This is depicted in figure 1 .
The aspect of Viterbi Decoder implementation considered here is the dependence of its internal wiring on H and on ν A and ν B .
The dependence on H can be dealt with readily. Figure 2(a) shows the ACS unit associated with a state of E that processes 2 b = 4 pairs of branch and path metrics.
The state graph labeling, determined by H, is an association of branch µετριχσ (α, β, γ, δ in figure 2(a)) with input symbols (w, x, y, z in figure 2(a) ). Since the ACS unit outputs the largest sum of the pairs and the input symbol associated with that sum; it can be replaced by the combination of a generic ACS unit (GACS) and a unit making the required connections at its input, which in general will be a switchbox [4] routing cell in VLSI. This is shown in figure 2(b) . The GACS units are identical and output the maximum input pair sum and position. Thus H maps into patterns of local wiring at the ACS unit inputs; that is to say it specifies N switchboxes Sj that connect path and branch metric inputs to generic ACS units. Figure 2 (c) shows the merged four-ACS unit group processor switchboxes Sk that are discussed in section 2. The switchboxes are considered in further detail in other papers; for trellis coded modulation [5] and in, their time dependent form, for sequential Viterbi Decoder architectures [6] [7] .
The dependence of the wiring of the Viterbi Decoder on ν A and ν B is the prime subject of the remaining sections of this paper.
Relevant Previous Work
Previous work on hardware implementation of Viterbi decoders can be classified into three categories, according to the number of physical ACS units k in relation to the number of states N of the underlying trellis.
Parallel implementations have k = N; sequential implementations have k < N; A third class, that might be called massively parallel, have k > N. In the first case the symbol rate of the Viterbi decoder is given by the time for one ACS operation. In the second case time is traded for area efficiency by the use of multiplexed ACS processors and switching networks [6] [7] or by the use of systolic arrays [8] [9] . In the third case [10] symbol rates are increased above the single ACS rate by unfolding the limiting recursion by a method akin to pole-zero cancellation in IIR digital filter implementation. Because of the large amount of area required for this the application of this technique is limited to decoders with small N.
This paper is concerned primarily with parallel implementations of rate b/n decoders for arbitrary N, with b > 1. As in [11] we restrict out attention to the path metric updating mechanism. The survivor sequence updating may be by register exchange or traceback [12] . Traceback is a sequential method suited to k < N architectures; an example VLSI design is found in [13] . When the survivor sequence updating is by the parallel, register exchange mechanism, the topological problem is similar to the path metric.
Previous work on the layout of parallel Viterbi Decoders has focused mostly on the rate 1/n case where the problem is to interconnect ACS units according to the topology of the single shift register state graph, the DeBruijn graph [14] [15] . In fact we shall see that this topology is also present sometimes for the higher rate codes considered here.
Previous work on graphs of higher co-ordination is to be found in [16] , where the Shuffle-Exchange network was generalized to the case of a ternary alphabet for a small example. The difficulty of a more general analysis was noted in [17] . In this paper the authors developed the idea of using Cartesian products of DeBruijn graphs. That approach taken together with the one presented here creates an interesting viewpoint on the problem. This is discussed further in section 6.
An alternative approach to decoders for higher rate codes is the technique of puncturing [18] in which rate 1/n decoders can be used to decode rate b/n codes at the expense of extra operations per cycle. This technique is applicable to situations where high throughput of the decoder hardware is not required.
Structural Classification of Encoders
Although the structural properties of convolutional codes, in terms of tree, trellis and state diagrams, are discussed in many textbooks (for example [19] ), there is no explicitly stated methodology for obtaining the state graphs of rate 2/n codes of arbitrary complexity such that the requirements of VLSI are taken into account, These requirements are served here by developing a hierarchical approach that may be readily mapped to a modular layout methodology. Some examples of this are given with the derivation of symbolic VLSI grid layouts.
Following the comments in section 1.2 about the mapping of the parity check matrices to local wiring, we are left with the specification of the global wiring and placement of the ACS units. This requires us to seek a constrained multilayer planar embedding of the state graph of two shift registers lengths ν A and ν B . By this we mean an embedding with edge crossings arranged to match the typical availability of connecting and contacting layers in VLSI technology.
The following observations can be made (i) For a given constraint length, there are as many state graph topologies as there are ways of choosing ν A and ν B such that ν = ν A + ν B . Each choice is identified here as a structural classification, the term structure will be used as shorthand for this phrase. Thus there is one possible structure for 4 and 8 state rate 2/n codes, 2 16 and 32 state structures 3 64 state structures, and so on.
(ii)
For a given structure we need to know how to draw its state graph.
(iii) For a given state graph we need to know how to do the embedding discussed above.
A hierarchical approach that simultaneously addresses (ii) and (iii) is given below. Although it is not provably optimal in terms of the VLSI layouts that may be inferred, it at least yields a modular design methodology, which is desirable.
In what follows we will see that there are two types of structures.
Structures with ν A = ν B . These are called fully connected (FC hereafter) because they will typically involve maximal wiring effort. They require nested layouts of hierarchically self-similar connectivity. Section 3 below deals with these structures in detail.
Structures with ν A > ν B . These are called structures of reduced connectivity; they will in general require less wiring effort than fully connected structures. They have the property that their state graph is developed from an underlying DeBruijn graph of some order. Their layout can then, at least partially, be obtained be applying known methods for this problem (references cited in section 1). Section 4 below deals with these structures in detail.
The graph construction and layout methodology is based on a hierarchical successive grouping of states into common successor partitions (CSP hereafter). Initially the states of the machine are grouped into common successor partitions at level 1 (CSP(1) hereafter). The CSP(1)'s are grouped into CSP(2)s and so on and we may refer to groupings at some hierarchy level k as common successor partitions at level k (CSP(k) hereafter).
To see how the CSP(1)s are formed by grouping the individual states, consider figure 3(a) which represents an example shift register pair. The figure depicts how their contents evolve under a state transition. It can be seen that as the shaded bits are shifted along in the state transition, and are hence common to states and their successors, they label partitions of the states into common successors to the states, that is to say, the CSP(1). The successors to each state all lie in one and only one CSP(1). This is a well known property of shift register sequences.
Since the state successor relationship is mapped to wiring in the Viterbi Decoder, it is advantageous to place the ACS units of the states in a given CSP(1) together because of the fan out of the state metrics. This leads to the formation of a configurable ACS group unit (CGU hereafter), shown in figure 2(c) , where the individual ACS switchboxes discussed in section 1 have been merged.
As will be seen, the CSP(k)'s containing successors to another are themselves grouped by the CSP(k+1) partitioning. Unlike the case of the CSP(1) set, there is no electrical fan out; the reason for physically grouping the partitions at higher levels is based on the requirement for hierarchical, modular layouts.
The remaining parts of figure 3 depict the formation of the CSP(k)s and the evolution of the state machine among them. An elucidation of this, the construction of graphs for particular structures and their VLSI implications are the subjects of the sections that follow.
Fully connected structures
The formation and interconnection of the common successor partitions for the FC case is depicted, for a working example, in figure 3(b). As observed above, states may be grouped together according to the contents of the shaded register cells. The evolution of the state machine through these CSP(1)s is the matter of how the contents of the shaded cells in figure 3 (b)(i) evolve. By inspection they can be seen to evolve according to the states of a pair of shift registers each shortened by one bit as shown in figure 3(b) (ii). The state graph of this register pair is called the 1-reduced graph. After one level of hierarchical partitioning we arrive at the requirement to wire the CGUs according to a reduced graph similar to the original state graph, in having vertices with four incoming and four outgoing edges but with one quarter of the number of vertices.
We can now apply similar reasoning to to the registers in figure 3(b) (ii), in which the shaded cells now label groupings of CSP (1)s, that is to say the CSP(2)s and the evolution of the state machine among these is given by the 2-reduced graph for the registers in figure 3(a)(iii) . In the example shown this graph is a primitive graph as no other non-trivial reduction is possible; it is well known as a 4-vertex directed clique.
At each level k there is a graph describing the interconnection of the CSP(k); each vertex is a CSP and each edge represents a number of edges of the original state graph. The common successor relation between a level k and a level k -1 graph can be used to draw graphs of arbitrary complexity by successive substitution of one level k vertex by four level k -1 vertices with each level k edge replaced by four level k -1 edges that fan-out to the level k -1 edges. Figure 4 depicts the substitution of a vertex with a self loop, this becomes four edges that connect one vertex to each of the four vertices within the group, as shown. After one substitution in the primitive graph, vertices with no self loop appear, these are treated similarly, except that there are no intra-group edges.
From this method, of hierarchically constructing the graphs, symbolic VLSI grid layouts can be obtained. For the 16 state FC decoder one possibility is shown in figure  5 ; it is built from four identical CGU layouts. Following an inter-group line we see the fan out from each ACS to the four ACS units in a group and that each ACS unit within a group connects to a distinct destination group. In figure 6 a possible 64 state FC decoder symbolic layout is shown. This shows how the layout hierarchy develops. In figure 6 (a) the decoder is seen to constructed from four identical (but rotated) 16 ACS CSP(2) units, whose internal structure in terms of four identical CGU units is exposed in figure 6 (b). The relation of figure 6 to figure 5 can be seen to embody the vertex expansion shown in figure 4.
Reduced connectivity structures
For encoders with ν A > ν B , reduced graphs are formed, as above, until a level is reached when one register of the machine inducing the reduced graph vanishes. This happens at level k = ν B . These structures are called reduced connectivity structures at this level (RCν B for short). Figures 3(c) and 3(d) show the reduction in example RC1 and RC2 encoders.
When the reduced graph relates to a single shift register, it is therefore the well known DeBruijn graph. This necessarily implies a qualitative change to the self-similarity of the CSP(k) connectivity found in the FC structures. The vertices in the DeBruijn graph have two incoming and two outgoing edges only. As the lower level graphs are of higher co-ordination, this implies inter -partition connections in pairs. The vertex substitution required to construct the lower level graphs from the DeBruijn graph is developed similarly to the FC case; it is depicted for a vertex with a self loop in figure  7 . Note the eight fold edge expansion because of the pairing effect.
In the case of the RC1 structure the DeBruijn graph specifies the inter CGU connectivity directly, with PM wire pairing, and no new layout techniques need be implied as there is a previously cited existing literature on methods for laying out DeBruijn graphs. As a simple example a symbolic grid layout for the 16 state RC1 structure is shown in figure 8 . As there are no crossings in the routing channels, compaction with a multilayer technology will use a much smaller area than in the FC case in figure 4 .
To construct graphs for other RC structures requires two or more expansions beyond the underlying DeBruijn graph. After substitution of the form of figure 7 , those of the figure 4 type need be applied. It is felt that current practical interest is not served by elucidating the details of other structures than FC and RC1. Consequently, further VLSI symbolic grid layouts will not be derived here although the methodology for doing so is considered to be established.
Structures of Systematic feedback encoders
Since, as stated, feedback encoders have equivalent feedforward encoders, they also have varying structures and since feedback encoders are an important class of encoders, it is worthwhile to uncover their structural properties. In the original development of trellis coded modulation [3] the best codes for error correcting properties were found by searching among the feedback encoders directly. In this case the parity check coefficients sets (or "codes" for short) induce both a structure and a state graph labeling; that is to say, both a structure and a definition of the switchboxes in figure 2 . It is not clear here how to obtain analytically the equivalent feedforward form for a given code, and hence determine its structure. However, it is relatively straightforward to construct a computer program to successively group states into common successor partitions. This has been done and table I shows the multiplicity of various structures for some constraint lengths. The column marked "degenerate" refers to cases where the code induces disconnected submachines, corresponding to non maximal length LFSR sequences. The utility of table I is in indicating the relative ease of finding a particular structure in a code search.
Summary and Discussion
It has been shown how to derive the state graph for a pair of shift registers and how such graphs can be classified according to their relationship to both DeBruijn graphs and cliques. The method, of successive expansion by graphical substitution has been derived informally. For applications, it will suffice that the method is precise and may be verified a posteriori for any graph of practical interest.
The method is hierarchical, where one level of hierarchy is associated with the grouping of the states into common successor partitions (CSP1's). For fully connected structures these groups are successively grouped in a similar fashion until a four vertex clique is formed. The wiring of the decoder is obtained by starting with a four vertex clique and applying the substitution rules accordingly. For RC1 structures the group units are wired according to a DeBruijn graph. For other RC structures there is a more complicated hierarchy.
In section 1.2 reference was made to previous work involving Cartesian product graphs [17] which is also a hierarchical approach. It is stated there that the Cartesian product graph, in the rate 2/n case may be interpreted as the DeBruijn graph for one shift register in the encoder with each node replaced by copies of the graph of the other shift register. The procedure is to be successively applied for more than two registers. Thus the hierarchy in this has b levels for a rate b/n encoder and the DeBruijn graph topology is always present. It is interesting to consider this in relation to the hierarchy and connectivity inherent in the present work. However since in [17] , actual specification of the hardware units and their placement and wiring are not given for any examples, it is difficult to make a more detailed comparison.
Reduced connectivity structures are easier to wire that fully connected structures. For example the DeBruijn graph of degree four is planar whereas the degree four clique is not (figures 5 and 8 may be referred to); In an RC1 structure implementation the path metric connections that run in pairs may possibly be multiplexed to a single set of physical wires to further reduce interconnect area.
In the case of a rate b/n encoder with b > 2, the same general approach can be used; there will of course be a much greater number of structures. Wiring Viterbi decoders in such cases appears a formidable problem; the reduced connectivity structures may be the only practical ones. For example a rate 3/n encoder with two single bit registers would involve wiring 8 units groups according to the DeBruijn graph, with connections running in four-tuples. Practical applications of such coding are currently limited.
In conclusion, this paper shows some possibilities for parallel Viterbi decoder layout based on a systematic methodology. The layout techniques have a practical application in exploring system level tradeoffs in communication systems. The efficiency of a code for error correction can be referred to its ease of implementation though its structure. 
Figure Captions

