This paper presents a new VLSI solution for high-speed digital communications based on long-constraint convolntional codes. T h e proposed VLSI architecture implements a modified sorter-based sequential decoding algorithm with an achievable maximum decoding rate of 25 Mbits/s. Unlike its previous version based on shiftable content addressable memory (SCAM) only, path recording is now implemented on an emebedded SRAM module whose size is determined by path depth ( d ) and survived nodes (S). T h a t is, both SCAM and SRAM are exploited to implement the sorter kernel. Results show t h a t , for a (2,1,7) code, both power consumption and silicon area can be improved by 50% on the average, making the new proposal very suitable for high-speed convolutional code applications.
INTRODUCTION
Convolutional coding with Viterbi decoding [l] is widely accepted as an efficient method to achieve a significant power gain on digital communication channels with low to moderate signal t o noise ratios. Generally speaking, for information bit rates above 5 Mbits/s, the decoder must be based on a fully parallel implementation of the Viterbi algorithm, which often demands one complete cycle in each clock interval and hence parallel ACS modules and path memory are requested. O n the other hand, sequential decoding [1,2] has low computation complexity in search of a correct path. However the decoding rate becomes lower because a correct path can only be identified after several trials on incorrect paths clue to error bits. The principal sequential decoding algorithms are the Fano [a] and the stackbased [3] . Although these algorithms are useful for longer constraint length codes, they suffer from "buffer overflow" which derives from an inability t o maintain uniform decoding rates while searching message sequences. Previously, we presented a dedicated memory structure [4] to speed up the decoding rate, where the buffer overflow is solved lately by shiftable content addressable memory (SCAM) [5] on such a newly developed high-speed data sorter [GI, we can achieve a sorted sequence right a.fter input, samples are given. This technique provides us a very powerful hardware solutions for those a.pplications which require massive sorting operations. However power consumption and silicon area are the two constraints for implementing longconstraint codes.
In this paper, we propose a new VLSI architecture t.o solve these two problems. In section 2 , we first, briefly describe the sorter-based sequential decoding algorithm, where features and complexity of the new algorithm will be highlighted. In section 3, we present a new VLSI solnt,ion to implement long-constraint sequential decoder. Comparison between the new solution and its previous version will also be given t,o show the improvements in bot,l~ power consumption and silicon area. Finally some concluding remarks are drawn on the cost-effective solution.
. THE FAST SEQUENTIAL DECODING ALGORITHM
In principle, the sequential decoding algorithm can be regarded as a set of sorting operations working on the survived nodes to identify a candidate node. And from this candidate node, further tracking can be performed until all input sequences are decoded. Figure 1 shows the decoding process for two cases. It can be found that the search of correct path i s now formulated as a sorting procedure. To find such path, we have assigned different, costs added to each node so that a local minimum can be located. However due to finite precision problem, we have to perform renormalization during the decoding process to ensure t1~a.t each survived node does have t h e correct weight. In addition, a path recording technique was also developed t.o overcome back-tracking of correct bit sequences and to improve throughput rate. This is done by introducing two parameters: one is decoding status parameter which records the decoding depth from the root, the other is the path sequence parameter which records the decoded bits up to the survived node. This modified seclnential decoding or named fast sequential clecoding (FSD) algorithm was presented in [5] .
T h e modified version FSD algorit,hm mainly coi1sist.s of sorting and path recording kernels. T h e sorting kernel provides storage space for the survived nodes whicli are sorted in ascending order. To reach high-throughput sort,ing results, we use deletme-and-insert sort algorithm which 1ia.s been mapped onto parallel comparison structure [GI. This implies t,hat each storage node needs a compara.tor. As for path recording, a depth of up to d is allocated for each survived node. This method keeps the decoded path and sends out the decoded bits until the depth of decoded path meets the specified value. Thus it totally needs ( d + logzd) x S storage space, where lognrl and S represent depth status and number of survived nodes respectively. The depth status p r a m e t e r is used to track decoding process and output a decoded bit whenever an upperbound is encountered. T h e S storage space is determined by constraint length to meet coding rat,e under a certain SNR ratio. Thns it should be not,ed that, bot.1~ d and S are highly dependent on the selected convolutional codes. More detailed discussions about these two parameters and their effects on coding gain of different codes can be found in [5, 7] .
THE VLSI ARCHITECTURE
In this section , we present an efficient VLSI architecture for single-chip implementation of the developed algorithm. It is found that the constraint length is not limited by this proposed architecture and hence the proposed algorithm and architecture are very suitable for long constraint-length applications.
T h e proposed algorithm can be mapped onto an architecture as shown in Figure 2 , which mainly consists of G units. T h e input buffer unit (IBU) contains the encoded sequences which have been arranged based on the coding rate defined in the corresponding encoder. It can be accessed by depth status parameter when error bits are detected. The weight calculation unit (WCU) uses the node with minimum weight as a basis and then calculates the new weight according to the costs generated by the cost generation unit (CGU). The cost is obtained by pattern matching and then prodnces a set of costs. The weight sorting unit (WSU) is nsed to store the weights of the survived nodes. T h e path recording unit (PRU) is provided to produce the correct bit sequence when target depth is reached and, in the mean time, to give a label indicating the order of decoded bits. Detailed structure of this unit is given in Figure 3 which mainly consists of path memory, shifter, incrementer and detector.
Implementation results [8] show that SCAM realization of PRU and WSU occupies most of the chip area and power dissipation because of simultaneous shift operations. To solve these two problems, we have recently combined SCAM and SRAM as shown in Figure 4 . Here we use address as an indicator to link to path sequences which are now stored in SRAM. These addresses and weights of survived nodes are stored in the SCAM. During the sorting process, only the weight from the first node is read out and added with two costs. Then the two generated weights are simultaneously compared to the other weights. Once this step is done, t,wo aclrlresses h o r n the SC'.-l,Ll are :lerivetl niitl connected t,o SRAM. Bot,h addresses are used to update t,he path information stored in the SRAM, such as pat~lL sequence, decoded status. These paths are the %" p t h m c l -1" path which can simultaneously be st,orecl in the SRAM. However, the correct path from the first address is always activated to provide address t o input buffer, from which new input sequences can be decoded or error bits can be recovered. Note that in practical realization, this is the critical path. Therefore pre-fetching strategy is exploited here t,o speed np decodirrg process. This is clone by selecting t,he contents of the first node for next address to t,he input, buffer. For most of the time, this method does apply t o most of t,he input patterns since error bits are only very small part of the input sequences. If the addresses are not direct,ly obtained from the first node, the address should be taken from the second node. This implies that due to error bits, both new weights are larger than that of the second node. In other words, the address €or input buffer can be obtained either from the first node or the second node depending on the input patterns. Only when error bits are detected, will the decoding rate becomes slower because more time is needed to perform read/write operations on the SRAPVI. Currently the critical path is limited to 2Ons for 0.8p9n ChIOS doiible metal process.
Operations between SRAM and SCAM are described as follows. Initially addresses and weights are assigned to each storage element of SCAM. Both left and right. elements provide two addresses to SRAM so that pat,h information can be updated. This strategy is derived because the left element always stores the candidate node with minimum weight, where it has to be traced further; and the right element always stores the candidate node with maximum weight, implying that it's no long a possible candidate and c m be overwritten by other newly generated nodes. Each time when a new node is created, its address for the path information should be updated in t,he SCAM. As shown in Figure 5(a) , both addresses from the SCAM are connected to SRAM to update path information. However a pre-fetched path register is inserted to speed up the updating process. That is when no error bits are det,ected, only one write operation is requested for the SRAX.1; however when error bits are encountered, two write and one read operations are requested for the SRAM. Figure 5(b) shows the basic cell of the multi-port SRAM which can be written from two different address buses.
Thus for each survived node, only (log2T.V + loplS) is needed in SCAM, instead of ( l o g 2 W + logad + d ) . For each survived node, we need (log2rl+ d) storage space in SRAM. At first sight, i t is found that more storage spa.ce, i.e. l o g~S , is needed in the proposed architecture. However physical design shows that both area and power can be reduced by 50% on the average. For more long constraint code applications, the gain in power dissipation and area will become more evident because many shift operat,ions in t,he SCAM can be replaced by reacl/write opeiations in the SRAM. In addition, the basic SRAM cell is only about half size of the SCAM cell. Table 1 lists the implementation data compared to its previous SCAM version for a (2,1,7) code.
To run this architecture, it is necessary to initialize the SCAM so that addresses ancl weights are placed appropriately. Once this procedure is done, the decoding process can be activated. Our simulation results show that the perfoimance from this architecture can compete those results obtained from maximum likelihood decoder. Thus with this architecture, not only high-speed and programmability can be achieved but also less power consumption and silicon area can be obtained for long-constraint codes, making it very suitable for practical applications. 
CONCLUSION
In this paper, we have presented a new VLSI architectnre for sorter-ba5ecl sequential decoding algorithm. The use of both sorting and path recording techniques not only solves the low-throughput problem but also provides a solution for long constraint convolutional code designs when mapped onto SCAM and SRAM architectnre. Thus a cost-effective solution for high-speed convolutional codes can be achieved.
We are currently developing a radix-4 FSD algorithm and architecture in order to enhance decoding rate for highspeed networking applications. 
