Abstract-Many communications applications require similar processing functionality but are implemented independently. In particular, a number of applications (including trellis coding, encryption, and speech recognition) use techniques based on shortest path search algorithms. In this paper, we propose a highthroughput architecture that can search for the shortest path within a graph. The architecture can decode any data encoded with a finite state machine (PSM) or data encrypted in a dynamic trellis code and also serve as a specialized processor for other searching and matching applications. Balance between flexibility and hardware efficiency is achieved hy an integrated design nf architecture, in-place scheduling, and concurrent algorithms.
Abstract-Many communications applications require similar processing functionality but are implemented independently. In particular, a number of applications (including trellis coding, encryption, and speech recognition) use techniques based on shortest path search algorithms. In this paper, we propose a highthroughput architecture that can search for the shortest path within a graph. The architecture can decode any data encoded with a finite state machine (PSM) or data encrypted in a dynamic trellis code and also serve as a specialized processor for other searching and matching applications. Balance between flexibility and hardware efficiency is achieved hy an integrated design nf architecture, in-place scheduling, and concurrent algorithms.
I. INTRODUCTION
ARIOUS decoding, speech recognition, pattern matching, and tree searching applications can often be modeled as finding a min-cost path within a (structured) graph using dynamic programming. The graph can be a structured trellis of a finite state machine (FSM), such as in decoding convolutional or trellis codes with a dynamic programming algorithm like the Viterbi algorithm, the stack algolithm, or the M-algorithm [l] . Stereo vision or dynamic-time-warp speech recognition uses similar dynamic programming algorithms to search through a less structured graph. Some previous processors, such as [2] and [3], provide retargetable functionality but their applications are more limited because of their throughput, inpuvoutput (UO) bandwidth, and flexibility. We w7ill describe a new multipurpose engine that uses hybrid pipelined architecture (mixed between serial and parallel pipelined architccturcs) and specific scheduling to optimize performance and flexibility in decoding and other applications.
ARCHITECTURE
As shown in Fig. 1 
METRIC VALUES READ OR WRITTEN PER CLOCK CYLE
I a variable length queue (VLQ), which is just a group of controllablle pipeline latches. The queue length of the VLQ is specified during configuring (programming) the architecture and is uwally fixed during processing. The VLQ serves two purposes. First, it provides a convenient swap space for concunent processing on the PPE. Second, it offsets the routing delay by matching the number of pipeline stages in the pipelined routing network. For example, suppose the pipelined routing network has J pipeline stages. To unskew the routing delay, we put the PPE output to the pipelined routing network at the ( J + 1) position from the top of the queue, as illustrated in Fig. 3 . The hashed pipeline latches within the VLQ's mark the metric output positions of the PPEs. VLQ's help remove architectural dependency, since the amount of concurrency of the algorithms, the number of pipeline latches within PPE's, and the (pipelined) routing networks can be designed independently and then integrated with the help of VLQ's. Note that the length of the VLQ only affects the processing; delay but not the processing throughput of the PPE.
The pipelined shifting buffer in Fig. 1 records the (decisions of the madmin units of the PPE's and updates the path records. It is optional because the host processor can process the PPE outputs directly.
DERIVWC PIPELINED IN-PLACE
SCHEDUlJ? FOR APPLICATIONS To run different applications on the multipurpose architecture, we need to derive a control sequence for the PPE's based on the application and the hardware properties. The local feedback through the VLQ allows the PPE to use in-place scheduling [4], [5] to reduce the metric VO traffic. In-place scheduling is a way of assigning path selections to (parallel) processors such that each processor always uses locally generated path metrics in the next stage of path selections. General in-place scheduling for rate-k/n convolutional codes, Ungerboeck s codes, and general trellis codes has been solved in a companion paper [ 5 ] . Here, we only restate from [5] that in-place scheduling minimizes the intercommunications between PPE's, and thus allows our architecture to use VLQ's to simplify routing and to reduce Communication bandwidth between PPE's.
For example, assume that we use a two-PPE hybrid pipelined processor to run the Viterbi algorithm on the fourstate trellis of a convolutional code. The PPE has three pipeline stages in the (adder)-(max/min)-(normalize) data path. The pipelined routing network is a switching network or memory with three pipelinc stagcs. Thus, thc PPE takes three clock cyclcs to computc data; routing data from onc PPE to another takes three clock cycles; routing data back to the same PPE (bypassing the routing network) is instantaneous.
The first step is to apply in-place scheduling to reorganize the trellis from the conventional form in Fig. 4(a) into the cyclic form in Fig. 4(b) . The resulting PPE in-place schedule is also shown in Fig. 4(b) , which has not yet considered concurrency, timing, and hardware constraints. A complete cyclic PPE schedule can be derived easily by expanding Fig. 4(b) according to data dependency and hardware constraints. The cyclic PPE schedule in Fig. 4 (c) satisfies both the routing and the PPE clocking constraints: the delay from the PPE's data path inputs ["local VLQ' and "read port" in Fig. 4(c) ] to data path's output ("write port") is three clock cycles, and the delay from sending data from a PPE's "write port" to its own or another PPE's "read port" is zero or three cycles, respectively. The VLQ length is three because the delay from an PPE's "write port" to its own "local VLQ' is three cycles. To match the three-cycle routing delay, the PPE's output to the routing network is at the insertion point of the VLQ (the "write port"). It is obvious that we can overlay another two identical cyclic PPE schedules to the one in Fig. 4(c) , meaning that we can decode three blocks of codewords concurrently using various concurrent methods (low overhcad methods, such as [6]- [9] , are more desirable). S'imilar design works if we scale all hardware parameters accordingly (for example; from threecycle delay to four-cycle dellay with four-block concurrent processing), or if we change the VLQ's "read port" or "write port" to accommodate a different routing delay or pipeline depth.
The multipurpose architecture can process a set of codes called dynamic rrellis codes, which provide certain degree of communication security. Dynamic trellis codes are generated by nonstationary finite state m,achines. Nonstationarity implies that the number of states, the state transition function, and the output function can all vary with time. This makes datainterception difficult or infeasilble. Our architecture can decode such dynamic trellis codes concurrently with the maximumlikelihood or suboptimal algorithms.
The proposed architecture in general can decode any codes that are encoded with finite state machines. The architecture adapts to different algorithms, codes, and data rates by modifying the control map, the VLQ's, and the metric tables. By merging and splicing control schedule, the multipurpose architecture can simultaneously decodc multiple codes. It can be used in generic search problems such as finding the minimum-cost traveling path between any two cities' if a map and traveling costs of individual routes are provided.
IV. CONCLUSION
A flexible architecture is proposed for high throughput decoding for different encoding sources, deciphering encrypted data, voice command recognition and other path searching applications. In-place scheduling and concurrent path searching provide hardware advantages. The architecture can be reused efficiently, saving the cost of duplicated hardware for difierent services.
