Introduction
For competitive markets like consumer electronics or telecommunications, ASICs often lack flexibility and programmability. Programmable DSPs are difficult to meet the cost, size, and power consumption demands. ASIP can compromise advantages of custom ASIC chips and general DSP chips. In other words, ASIP chips adopt high performance and low power of ASIC chips and flexibility of DSP chips [1] .
The common method to design ASIP is exploiting the design space based on a special architecture, including application specific instruction library and function unit library developing [2] . The addition of key new instructions to the processor for target application is more efficient and easier to update.
Multimedia signal processing technology has been developed with the increasing of consumer electronics demands. Progress in multimedia technology research has led to an increasing number of standards, such as MPEG2, MEPG4, H.263, and H.264/AVC. More recent standards introduce new encoding tools to offer improved performance in terms of picture quality at certain bitrate. However, this does not mean that older standards become obsolete. The introduction of new standards to the market is a gradual process, which means the ISA for the ASIP design is continuous updated, to address and compute requirement of new video standards.
Existing SIMD multimedia extensions and DSPs support various instructions to execute packed operations between two registers. These operations are used for various video signal processing, such as motion estimation and compensation, DCT/IDCT, etc. The key idea in multimedia extensions, such as Intel' MMX, SSE1 and SSE2, Sun's VIS, HP's MAX, Compaq's MVI, MIPS's MDMX, and Motorola's AltiVec [3] , is the exploitation of subword parallelism in SIMD fashion. TMS320c64xx of Texas Instruments supports special instructions for multimedia signal processing, such as SUBABS4, AVGx, etc [4] . And Trimedia CPU64 introduced two-slot operations, collapsed load instructions with interpolation, etc [5] .
This paper presents a novel ISA named VS ISA (Video Specific Instruction Set Architecture) to the THUASDSP2004 architecture, which is a video specific DSP architecture for ASIP design and scalable for ISA update [6] . We quantify the performance improvement on H.263 encoding. We also show how the new instruction can be used to optimize modules of other video standards, such as MPEG4 and H.264/AVC.
Architecture Overview
THUASDSP2004 consists of two parts, a VLIW architecture DSP part and a hardware coprocessor part. Fig. 1 shows the overview of the whole architecture. The VLIW DSP core contains one global register file and four local register files. Every local register file is corresponding to a function unit cluster. A function unit cluster can contain two function units. Every function unit in a cluster has its own access port to the corresponding local register file, and all function units in the same cluster share a set of access ports to the global register file. Every register file contains sixteen 32bit registers [6] . The hardware coprocessor part has a VLC processor for encoding, a VLD processor for decoding, a image coprocessor for transforming between YUV and RGB, and a video specific DMA for video data transfer.
THUASDSP2004 ISA has basic 74 instructions, can work well at the frequency of 150MHz, and the instruction throughput was 1200MIPS.
THUASDSP2004 has a VLIW-like architecture. Table 1 shows the performance comparison of THUASDSP2004 and TMS320C6000 on encoding a QCIF video sequence with the standard ITU-T H.263.
Video Application Characteristics
As seen in Fig.2 , standard source video codec such as MPEG-4 and H.263, H.264/AVC consist of standard modules, such as motion estimation, discrete cosine transform (DCT), etc.
Within video application, the high degrees of parallelism and high memory bandwidth have been well researched [7] . The size of datapath is another factor that has yet to be determined for video processing. The integer data size for multimedia is typically believed to require only 8 or 16 bits. New instructions typically add function logic to the exiting datapath. The increase of critical path caused by addition logic should be avoided or kept to minimum. 
Video Specific Instruction Set Architecture
This section describes VS ISA and its enhancement to THUASDSP2004 basic ISA on video application. All modules' implementations are self-contained functions: function-call and -return overhead are included, function inputs are read from memory and outputs are written to memory.
As mentioned in section III, video application data has the character of high degree of parallelism and the integer data sizes for video typically require only 8 or 16 bits. For instances, the data sizes are 8 bits for motion estimation in almost every video codec standard, such as H.263, H.264/AVC, etc. while the data sizes are 16bits for integer transform. As THUASDSP2004 having 32bit datapath, we developed SIMD instructions of four 8-bit subwords for motion estimation and two 16-bit subwords for integer transform, quantization, and inv-quantization.
Proposed instructions for Motion Estimation
A significant part of the computation complexity of video application is found in motion estimation. Motion estimation is used to exploit the inherent temporal redundancy of a video image. In a typical motion estimation process, each frame is divided into 16x16 macroblocks. Most computational complexity of the algorithm is found in the "macroblock matching" module, which determines the similarly of a macroblock in the current image with a motiondisplaced macroblock in the reference image. We implemented a five-step motion estimation algorithm, the search step of which is 8-, 4-, 2-, 1-, 1/2-pixel, and the search range is 3x3 macroblock.
Typically, the match criterion is the SUM of Absolute Difference (SAD) function. To complete the matching of one macroblock, SAD function would be invoked for 41 times on H.263 encoding. Where (a, b) and (m, n) are the upper left corner positions of the current macroblock and the 16x16 region from the searching range of previous reconstructed frame, respectively, Y1 and Y0 are the pixel luminance values from current frame and reconstructed frame. Equation (1) is composed of sixty-four equation (2) . With THUASDSP2004 basic ISA, equation (2) needs eight 8-bit loads, four subtractions, four absolutions, four additions, totally 20 operations. As Fig 3, with SUB4ADD and LDDWNA (load non-aligned doubleword) instructions in VS ISA, equation (2) just needs a LDDWNA, a SUB4ADD, an addition, totally three operations. As seen in Fig. 4 , using software pipeline, one loop can implement a doubleword operation (64bit, eight 8bit pixels), so the execution cycles of SAD reduced from 263 cycles to 37 cycles, and the performance gets an enhancement of 7.1x. Whereas the H.263 and MPEG2 standard allows for fractional motion vectors at half pixel granularity, the MPEG4 and H.264/AVC standards support quarter pixel granularity. The calculation of reference data at fractional positions is more involved. As the reference data position is not word align, so the processor with LDDWNA can load 64bit (eight 8bit pixels) every cycle. For H.263, as seen in Fig.5 , interpolation can be classified to 3 modes of operations: horizontal interpolation, vertical interpolation, and rectangle interpolation. With ITPL8PH, ITPL8PV, ITPL8PR, shown in Fig.6 , the performance gets an enhancement of 3.4x. 
Proposed instructions for Integer Transform
The 8x8 2D-DCT and 2D-IDCT are row-column separated into 8-point 1D transforms with 16bit width data. We use the Chen algorithm for DCT and Loeffler algorithm for IDCT. Both algorithms make frequent use of conjugated and rotate operators, defined by formula (5), (6) . Either operator has two inputs and produces two outputs. VS ISA includes specific instructions for conjugated and rotate operators. As seen in Fig.7 , CJG2 (two 16bit subwords Conjugated) could implement two 16bit conjugated operators in one clock cycle, and two 16bit rotate operators using RTT2SRMx (two 16bit subwords Rotate with signed round in mode x) could be calculated in parallel. Signed rounding has 3 clip modes; each mode has a given R.
The DCT code is compiled and scheduled manually into 35 VLIW instructions, including function call/return overhead, loading the 8x8 pixels, performing the 2-dimensional DCT, and storing the 8x8 coefficients back in memory. 
Performance Evaluation
We use assembly code written manually to evaluate the performance of VS ISA. The processor operates at 150 MHz, synthesized with SMIC 0.18 library under worst case condition, with 128Kbytes on-chip memory. We encoded the "Foreman" sequence at QCIF resolution at 451 frames per second, as shown in Table  1 , with a target bitrate of 32 kbps on H.263 encoding. Frames are represented in a 4:2:0 format, resulting in six blocks per macroblock. Table 2 gives a performance comparison between THUASDSP 2004 and VS ISA. As seen, VS ISA enhances the processor performance by 5.77x in I frame and 4.70x in P frame.
As an example of the power of the combined approach of VLIW, subword parallelism, and an extensive set of operations, the IDCT computation is an interesting benchmark from the targeted application domain. Fig. 9 shows the number of cycles to compute the IDCT of an 8x8 block on different architectures. As seen, VS ISA outperforms both DSPs and SIMD multimedia extensions architectures by 1.6x to 8.57x in computing the IDCT [4] [5] . 
Conclusion
This paper proposes an efficient instruction set named VS ISA and its hardware architecture to implement the video application. A processor with VS ISA based on THUASDSP2004 architecture is implemented to quantify the performance improvement on H.263 encoding, and a satisfied result achieved. 
