Abstract. Due to the development of wireless internet and an increasing number of internet users, transferring and receiving errorless data in real-time can be the most important method to guarantee the QoS (Quality of Service) of internet. Convolutional encoding and Viterbi decoding are the widely used techniques to enhance the performance of BER (bit error rate) in the application area such as satellite communications systems. As a method to enhance the QoS of internet, a new DSP architecture that can effectively materialize the Viterbi algorithm, one of the algorithms that can correct errors during data transfer, is introduced in this paper. A new architecture and a new instruction set, which can handle the Viterbi algorithm faster, and simplify the Euclidean distance calculation, are defined. The performance assessment result shows that the proposed DSP can execute the Viterbi algorithm faster than other DSPs. Using 0.18 µm CMOS technology, the new DSP operates in 100 MHz, and consumes 218 µA/MHz.
Introduction
Due to the development of wireless internet and an increasing number of internet users, transferring and receiving errorless data in real-time can be the one of methods to guarantee the QoS (Quality of Service) of internet. In order to support world widely interconnected internet service, the SATCOM (satellite communications) system is used. Convolutional encoding and Viterbi decoding are the widely used techniques to enhance the performance of BER in the application area such as satellite communications system. ATM (asynchronous transfer mode) switching and IP (internet protocol), which use convolutional encoding and Viterbi decoding, can guarantee high quality channels with 10
or of higher BER [1] . Since the year 2000, demands for higher-level internet service increased. In order to accommodate such demands for faster data process, researches on SOC (system on a chip) technology, in the field for hardware infrastructure, are being increased [2] [3] [4] [5] . Therefore, the kinds of DSP (digital signal processor), which effectively handle the Viterbi algorithm and can easily be implemented into SOC, can have considerable contributions on improvement of the QoS of internet. As a method to enhance the QoS of internet, a new DSP architecture that can effectively materialize one of the algorithms that can correct errors during the data transfer, the Viterbi algorithm, and has a flexibility to form an SOC with an MCU and other peripherals as well, is introduced in this paper.
Existing DSPs are designed mainly to process SOP(sum of product)s effectively; therefore, these DSPs show good performances in applications that use a lot of SOPs, such as filter calculation. However, in some calculations, they show loss of performances due to low ILP (instruction level parallelism). Power dissipation may occur when the multiply instruction is used to calculate the Hamming distance in order to materialize the Viterbi decoding [6] [7] . The proposed DSP proposes a new instruction set for the Viterbi decoding algorithm materialization, and shows a new architecture to process those instruction sets effectively. Performance evaluation shows that the proposed DSP has the most effective architecture to process the Viterbi decoding. Table 1 shows an programming example of a typical Viterbi decoding process. With hard-decision inputs, the local distance used is the Hamming distance. This is calculated by summing the individual bit differences between the received and the expected data. With soft-decision inputs, the Euclidean distance is typically used. This is defined (for rate 1/C ) by:
Instruction Set for Viterbi Decoding Algorithm
where SD n are the soft-decision inputs, G n (j) are the expected inputs for each path state, j is an indicator of the path, and C is the inverse of the coding rate. This distance measure is the C 2 -dimensional vector length from the received data to the expected data. To minimize the accumulated distance, we are concerned with the portions of the equation that are different for each path. The terms
n (j) are the same for all paths, thus they can be eliminated, reducing the equation to:
Since the local distance is a negative value, its minimum value occurs when the local distance is the maximum. The leading -2 scalar is removed and the maximums are searched in the metric update procedure. For the local distance calculation like this, the proposed architecture has powerful ADD/SUB instructions, which enables users to add or sub signed numbers efficiently. ACS part, which assigns the shortest path through the Viterbi decoding process. This means that the proposed DSP supports instructions that searches for either the maximum or the minimum between two values and stores the paths into the Viterbi shift registers in sequence. MAX/MIN instructions are the instructions that can set a flag according to true/false signal, resulted from comparing two data. The Viterbi decoding is one of the fast calculations, when several instructions get executed a loop. Each function forms a loop, and it is typical for the Viterbi function to get repeatedly executed also in a loop. It has a characteristic of using the branch instruction, to move from one loop to another, constantly. In order to use such characteristics, the proposed DSP provides loop instructions that can be nested into three levels. To minimize the loss of performance from using branch instructions many times, all instructions get processed as conditional instructions in order to be able to put into delay branch slots with no problem. The T bit is set by a compare instruction, and can be also used by a branch instruction. By placing conditional instructions in delay slots, the penalty from untaken branches can be reduced. Also, since this structure prevent fetched instructions from being issued, the power consumption becomes lower.
In EREP/EREPS instructions, the type tells whether instructions in the loop are filled in a loop buffer or not. The following example is the case of filling the loop buffer, which can be fetched from the buffer at the next turn of the loop. The new architecture is composed of four blocks as shown in Fig. 2 . Since the DSP was designed for a coprocessor, it can be used to implement a system with program memories, data memories, a microprocessor, and peripherals. LBU(loop buffer unit) fetches the current instruction, behaves as an instruction cache, and calculates branch addresses. This block helps Viterbi decoding loops to be executed rapidly. RPU(ram pointer unit) calculates two data addresses A0G  A1G  A0H  A0L  A1H  A1L  A2G  A3G  A2H  A2L  A3H  A3L  A4G  A5G  A4H  A4L  A5H  A5L  A6G  A7G  A6H  A6L  A7H  A7L  P0G  P1G  P0H  P0L  P1H  P1L   A0G  A1G  A0H  A0L  A1H  A1L  A2G  A3G  A2H  A2L  A3H  A3L  A4G  A5G  A4H  A4L  A5H  A5L  A6G  A7G  A6H  A6L  A7H  A7L  P0G  P1G  P0H  P0L  P1H P1L
EREP (typec or typed), LABEL, #16
For Viterbi decoding, EMIN or EMAX instructions can be used. There are Viterbi shift registers in DALU, VTSR0 and VTSR1. AC1, the DALU control register, contains EVS bit and ESP bit. If EVS is set, the carry of EMIN/EMAX instructions is entered into VTSR0[15] or VRSR1[15], and these registers get shifted. If ESP is set, the pointer to the minimum or to the maximum is saved. The following example shows the conditional instructions for Viterbi decoding.
10:4 Selector
Compare operation using ALU0
Compare operation using ALU1
VTSR0 VTSR1
Cflag0 Cflag1
Fig. 3. Viterbi decoding paths of DALU
for data memory access, and is divided into two regions: X region and Y region. DALU(data arithmetic and logical unit) performs actual data calculations, and contains two MACs, BMU(bit-manipulation unit), and an internal register file. Finally, CU(control unit) controls the pipeline flow and bus arbitration. Since the DSP has the dual MACs, three 32-bit data buses exist for efficient memory accesses. A program bus is 32-bit wide, and four data address buses are divided into two regions of 16-bit. Fig. 3 shows the behavior of VTRSHR instruction; an instruction that does the Viterbi shifting. The DALU shown in Fig. 3 is the unit which performs either arithmetic or logical operations of data. Since two EMAX/EMIN instructions can be executed simultaneously, ACS operations of two butterflies can be performed and VTSR0/VTSR1 registers can save two path metrics independently. In addition, the LBU supports the efficient processing of loop instructions. Conditional instructions considered by T flag in a status register reduce the penalty of mis-predict branches. Fig. 4 shows the overall structure of the DALU. As shown in Fig. 4 , a register file contains the registers used not only for multipliers, but also for accumulators. In addition, some accumulators share the position with the registers of multipliers. This register file has high performance in successive multiplications and additions after a multiplication. Since an input register and an output register of a multiplier are the same, successive multiplications can be executed without an additional MOVE instruction as shown in Fig. 5 . This structure enables Viterbi decoding using the hard decision inputs to be executed efficiently. Register file
16 by 16 multiplier To prove the performance of the proposed DSP in Viterbi decoding, a benchmarking with other three dual MAC DSPs has been done. The results are shown in Table 2 . Although they are all dual-MAC DSPs, they are very different in detail. For example, Teak T M DSP has no instruction cache and buffer, whereas TMS320C55x has both of them. Also, the number of accumulators is different among DSPs. However, all other three DSPs have the serial MAC structure to perform SOP operations efficiently. As explained previously, the MAC structure of the proposed DSP is different from other DSPs. The structure of MACs, which can reduce the number of additional negating instructions in BMG(branch metric generation), increases its performance. And two independent Viterbi shift registers having each path-metric calculation ability also makes the DSP to show the best performance. In addition, powerful ADD/SUB instructions for signed data, latency-free loop instructions, various MAX/MIN instructions, and the efficient ILP structure enhance the performance of the DSP. Finally, conditional instructions in delay slots of conditional branches reduce damages of not-taken cases by executing these instructions. Table 3 shows that the proposed DSP has the best performance of all for a Viterbi algorithm porting using code rate 1/2 and 8 delay registers. The new DSP is implemented with 0.18µm CMOS technology library. The total number of gates is 59327, and DALU occupies 29.5% of these gates. The result is summarized in Table 4 . 
Conclusion
The new DSP architecture is a 32-bit dual MAC DSP, and behaves like a coprocessor of a micro-controller. Regardless of the input type, whether it is hard decision or soft decision, the following improvements are made to achieve high performance in the Viterbi algorithm. First, powerful ADD/SUB instructions for signed numbers is defined. Various cases due to sign combinations of two or three numbers happen in Viterbi decoding. In this case, without additional negation instructions, the DSP can execute BMG operations efficiently.
Second, an efficient 3-way super-scalar structure is used. Since each instruction can perform two operations simultaneously, it can actually be regarded as a 6-way super-scalar architecture. More ILP in the proposed DSP help higher code-rate Viterbi decodings being performed more rapidly.
Third, LBU-supported loop instructions accelerate the processing of Viterbi subroutines. Latency-free loop operations enable Viterbi subroutines to be executed rapidly.
Finally, conditional instructions reduce the mis-prediction penalty of delayed branch. In case of mis-prediction, issued instructions can be flushed by T bit. This scheme is meaningful because a Viterbi decoding program has many branch instructions.
The benchmark results show that the new DSP has the best performance in comparison with other dual MAC DSPs in Viterbi decoding. In addition, because it is a super-scalar architecture, it has an advantage in program memory usage. Therefore, the proposed DSP is suitable for the mobile technology, and since the DSP behaves as a coprocessor, it can be used as in the form of SOC.
