Abstract -.4 programmable and scalable parallel architecture i s proposed for the real-time encoding/decoding of I-IDTV images and for nonlinear editing of the compressed video data. It only use!; the intra-mode compression/decompression so that nonlinear editing can be performed easily and high-quality images can be recovered. Spatially partitioned image data are concurrently processed by multiple parallel processing units (PU's). Each PU consists o f a programmable parallel digital signal processor, called multimedia video processor (MVP; TMS320C80), and reconfigurable field programmabile logic devices (FPLD's). The performance o f the REDS i s described in terms of the required MVP cycles for transform coding and the FPLD's throughput for entropy coding. Robust RD-optimized quantization matrices for HDTV images a r e presented.
INTRODUCTION
Multimedia processing is now' expected to be the driving force in the evolution of computing, communication, and broadcasting technologies. Digital video and audio processing methods are the most powerful means to extend the capability of television services in both studio and broadcasting systems. As the resolution of digital TV has been increasing, very fast data handling and high computational powers are required. Data coinpression technology of MPEC-2 standard will be applied to the HDTV broadcasting. However, other aspects should be considered for the purpose of nonlinear HDTV editing in the studio. The data compression technique to store and retrieve vast amount of HDTV data should be computationally efficient, becaljse the amount of data is so vast and the computational requirements are proportional to the amount of data. Also; the image quality at all pictures should be almost perceptually same during the compression and decompression process. Therefore, the JPEG-like encodingidecoding is suitable for the HDTV editing, in which only the intra-mode compression is used. If the motion-compensated compression such as the MPEG-2 MP@HL standard is applied to the HDTV non-linear editing, it results in degradation of the image quality and also the increased computational complexity.
The REDS is configured of three functional units as shown in Fig. 1 ; the video unit (VU), the processing un!t (PU), and the host unit (HU). It is the multiprocessor architecture b a x d on the spatial partitioning for processing elements (PE's). In the VU, the digitized HDTV data [ I ] are first aligned to 64 bit (formatting) according to their color components and divided into multiple strips, each of which includes multiple macroblocks and is the unit of data packet (slicing). Each PU executes the real-time image encoding/ decoding for the assigned strips to those PU's. A PU is composed of a programmable general digital signal processor, called the multimedia video processor (MVP; TMS320CS0), and reconfigurable field programmable logic devices (FPLD's). The programmability and scalability of the PU's can provide flexibility and various applications for the future programmable coding, such as traditional video coding or image representation using the description of moving scenes. Various processing algorithms can be implemented in the PU's by only developing the software without changing system hardware [2] . For the nonlinear editing, the intra-mode compression can be implemented in each PU, which mainly consists of two functional modules: One is discrete cosine transform (DCT) to reduce the spatial redundancy. The other is the entropy coding to reduce the statistical redundancy. The MVP performs the DCT, whereas the entropy coding (variable length coding) is implemented in the FPLD's. The compressed strips from all PU's are collected at the HU and stored to the parallel disk system. Decompression process is the reverse process of the compression. 
TRANSFORM CODING
The MVP consists of four fixed-point advanced DSP's (ADSP), one floatingpoint master processor (MP), and interconnection networks. Four ADSP's perform the DCT in SIMD-like mode. In order to achieve real-time processing for a whole HDTV image, a fast DCT algorithm is proposed by utilizing the characteristics of the ADSP, which can perform one multiplication and two additions simultaneously in one cycle. Some of the fast algorithms [3, 4] exploit the properties of the DCT multiplicative constants. We modified the Chen's DCT algorithm so that the numbers of multiplications and additions are balanced for ADSP. The quantization matrix can be also absorbed in this look-up ; 1770,0=cjcj, mo,j=cjc,, m,,0==c;cj, and m;,.;=c;c) for O<i,j<8. The quantization matrix can be also absorbed in the M by scaling each component of these multiplicative constants, M={n?;,,;/y;~). Table 1 shows the number of operations required for various 2-D DCT algorithms for a SX 8 block. According to these performances, the required number of MVP's for the real-time implementation of the proposed DCT and the inverse DCT (IDCT) for HDTV image is eight. The IDCT requires more ADSP cycles because multiplications at the first stage of the proposed flow-graph should be computed. Figure 3 shows the PSNR differences with the floating-point 2-D DCT for the proposed and ot.her fixed-point implementations. The proposed algorithm shows better performa.nce. 
R-D OPTIMIZED QUANTIZATIQN
The scaled JPEG or MPEG quantization matrices are normally used to adjust the required quality and the compression rate. The HDTV images represent different characteristics in comparison with conventional JPEG images; different aspect ratio and better spatial correlation with conventional images. Ratnaker and Livny [ 5 ] proposed the rate-distortion optimized (RD-OPT) algorithm by considering optimal RD tradeoffs for a given image. The algorithm uses DCT coefficient distribution over a wide range of rates and distortions, and an optimal quantization table is obtained by dynamic programming. The optimal quantization matrix trained for a specific image may not be optimal for the other images.
Nearly optimal Q matrices were obtained by statistical averaging the RD-OPT matrices, which were obtained from various HDTV images. The simulation results show that the reconstructed qualities using the proposed Q matrices are better than the JPEG quantization matrix. The input bit-rates to the variable length coder (VLC) were analyzed using the three MPEG-2 test image sequences of "Football", "Cheer Leaders", and "Mobile and Calendar". The "Mobile and Calendar" has the largest average number of nonzero DCT coefficients, 11.66 (c:oefficients/block). The FPLD for VLC is fully pipelined with five stages. From the run-level coded data, a code symboi and a code length are obtained by the t,ible look-up method. The average throughput for the VLC-FPLQ is 20 million codewordsisec. Therefore, eight PU's are required for the real-time implementation of the VLC for HDTV images.
The variable length of the Huffman code is optimized for coding efficiency, but the variable length limits the decoding throughput because of the recursive data-dependent procedure. The VLC decoding should be performed sequentially. The VLC decoder generally includes a feedback path, which is the critical path for code length-decoding (LD) and symbol-decoding (SD). In the parallel PLA-based architecture proposed by Sun and Lei 161, the feedback path consists of three or four sequential processes. In our FPLD implementation of VLD, two shift processes, input plane barrel sh.fter (IPBS) for the LD and the SD and output plane barrel shifter (OPBS) for next-input alignment, can be performed concurrently as shown in Fig 6. In addition, the decision whether new data must be read or not can also be performed in parallel with above two shift processes. From a simulation using the sample fr,ames in MPEG-2 video sequences ("Mobile and Calendar", "Football", and "Cheer Leaders"), the required processing time of the proposed method for intra-mode pictures was about 70 % of [6] .
CONCLUSIONS
A real-time encodingidecoding system, called REDS, for nonlinear editing of HDTV image was presented. It is a multiprocessor architecture which uses the MVP as its PE. Equally spatiahartitioned data is transferred to each PE and is compressed by transform coding and entropy coding to reduce spatial redundancy and statistical redundancy, respectively. The signal flow-graph of the Chen DCT algorithm is modified to adapt the implementation in MVP, so that the required ADSP's cycles are reduced and PSNR performance is better than those of the other methods. By averaging Q matrices that were obtained using RD-OPT algorithm, two scalable Q matrices were obtained. The proposed Q matrices outperform the JPEG Q matrices. The FPLD implementations of VLC and VLD were simulated for the real-time performances. 
