Abstract-
I. INTRODUCTION
EVERAL video compression standards, e.g., MPEG-2, H.264/MPEG-4 AVC and Windows Media Video (VC-1), have been established and are used in practical applications such as recent terrestrial broadcast and high-compression optical disc. Semiconductor devices that meet these standards for multimedia applications are required to achieve high performance and cost effectiveness. Several solutions have been introduced [1] - [3] , however none of them corresponded to high-compression optical disc standards, such as Blu-ray. To develop a chip for practical use, hardware size, memory usage and memory access bandwidth must be considered. We propose a multi-standard video decoder core that adopts dynamic and static re-configurable techniques and a data compression method suitable for all video standards. High compression of video streams is required for many types of consumer electronics products such as DVD, DTV, digital cameras and set-top boxes. Since most standard compression methods include transform and quantization techniques, block noise and ringing artifacts tend to appear at high compression ratios.
The rest of this paper is organized as follows. An overview of the video core architecture is described in Section II. Section III describes the proposed dynamic re-configurable variable-length coding (VLC) table. Section IV describes the data compression method and corresponding syntax. Implementation results are presented in Section V. The conclusions are presented in Section VI.
II. OVERVIEW OF THE CORE ARCHITECTURE
This section describes requirements and issues for video decoders and the corresponding architecture to solve these issues. To satisfy the requirements for multimedia services and consumer products, the video decoder should be capable of decoding the format of multiple standards and achieve both high performance and low cost. To realize these demands, several design issues should be considered. For the entropy decoding function, the design must accommodate different VLC tables and the performance should meet the bit-rate limits specified by the standard. Data bandwidth and capacity of external memory should also be minimized to realize a low-cost product. Cost effectiveness should be considered for all decoder function blocks for low cost chip.
To solve the above issues, we proposed the video decoder architecture shown in Figure 1 . Considering the profile requirements of the Advanced profile of VC-1 and High profile of AVC, real-time entropy decoding cannot be realized with practical clock frequency. Therefore, the overall decode operation was divided into 2 parts: the VLC decode section and pixel operation section. The VLC decoding achieves a maximum bit-rate (40Mbits/sec) and the pixel operation achieves a maximum frame size and frame-rate (1080i with 30fr/s). In fact, this partitioned architecture can realize the maximum Blu-ray specification performance with 162MHz clock frequency.
A hybrid architecture is adopted for VLC decoding to realize both flexibility and high performance. During VLC decoding, a dynamic re-configurable VLC table is introduced to minimize hardware for quite different VLC tables specified by each video standard. Moreover, a data compression method that is based on
A Multi-standards HDTV Video Decoder for Blu-ray Disc Standard
Noriyuki Minegishi, Hidenori Sato, Fumitaka Izuhara, Masayuki Koyama, Anthony Vetro, Senior Member, IEEE S Exp-Golomb codes is applied and implemented in the data buffer blocks. To prevent an empty buffer, the VLC decoding must be performed fast enough. The data compression function reduces external memory usage and access bandwidth between the core and external memory to satisfy this requirement. 
III. A DYNAMIC RE-CONFIGURABLE VLC TABLE
A key design issue for entropy decoding in our architecture is to realize different VLC tables with a low-cost and high performance implementation. Therefore, we adopted a dynamic re-configurable hardware. It doesn't need register to be compared, thus it achieves a reduced cost. We set the comparison bit width from 1-bit to 4-bits. This grouping achieves best effort for both low cost and high performance. Each cell outputs its own number if the input value is matched. If not, the cell outputs a "0" value. In this way, the "matched PE number" doesn't need a selector; it simply consists of an "OR" tree. This architecture also helps to reduce cost. Mapping information and coefficients are contained in the memory. This design provides both flexibility and low cost. Figure 3 shows an example of how a comparison mapping on the PE array is performed. An entropy coding table is considered as a tree search structure. Shorter bit length codes are assigned for higher probability and placed in upper nodes of the tree. According to our MPEG-2 video sequence simulations, about 40% are covered within 4 bits. Hence, 4 bit comparisons are chosen as fair trade-off between hardware and performance. The variable length decoding process is described below.
B. Decoding Example
At the beginning, the PE group identifier "R0" and "R2" in Figure 2 is activated, then PE0, 1, 6 to 13 is indicated to compare nodes "n0" to "n4" as shown in Figure 3 . If PE9 which is assigned to "n3" and a branch node is matched, the information in on-chip-memory has changed. The table hardware has dynamically re-configured and continues to search for 2nd row of the VLC table tree. For the second row comparison, the control logic disables "R0" and "R2" and activates "R1" and "R4" as shown in Figure 2 . Then PE2 to 5 and 22 to 29 in Figure 3 are indicated and node "n5" to "n13" are compared. If PE2 which is the terminal node of the VLC table is matched, the on-chip-memory outputs a coefficient value. Then, the VLC table hardware returns to its initial state and begins to search for the next code. " 00" "01" "10" "1100"
1st compare n13 n12 n11 " 00" "01" "10" 
IV. AN INTERMEDIATE DATA COMPRESSION METHOD
To meet required performance of the Advanced profile of VC-1 and High profile of AVC with a practical clock frequency, the decode operation is divided into the VLC decode section and pixel operation section. This architecture needs to store intermediate data, which has a high data volume. Therefore, a data compression method is introduced to minimize memory.
Regarding the data compression method, two key aspects should be considered. One is the work load for the compression and the other one is memory cost. The proposed approach attempts to provide an optimal balance between these two. Table 1 shows a sample compression syntax for CABAC coefficients. The run-level and Exponential-Golomb methods are applied to minimize workload. Considering compression rate and encode-decode performance, the Exp-Golomb algorithm is applied. However, the Exp-Golomb compression method is not efficient for large values. Hence, we set 14 bits as the length limitation for LEVEL data compression, which was empirically determined.
A. Data Compression Method Syntax
The Exp-Golomb table with fixed length code is shown in Table 2 . The Value "1" is an indicator and "0" strings before "1" denote the bit-length to be decoded. This compression method is applied for transform coefficient and motion vector data. Since the coefficient value should be considered a signed value, the signed Exp-Golomb code is adopted. Figure 4 shows a run-count example in which the proposed method is applied. In many cases, non-zero coefficients appear early in the scan order. Hence, to indicate the position of the last non-zero coefficient, a forward scan is initially applied. This method typically gives a smaller first run value compared to scanning backward. However, for the rest of the coefficients, a backward scan is used. In the example, first RUN is "7" since number of coefficients before the last coefficient counted from forward scan is seven, and LEVEL is "1". Next "0" coefficients are counted as backward, RUN is "2" and LEVEL is "5". The syntax is continued until the first coefficient is reached.
B. Run Count Example

V. IMPLEMENTATION AND RESULT
The core is implemented with top-down approach on HDL basis. We have carried out the HDL synthesis with 90nm CMOS ASIC library. The circuit volume of the core is 1.5MGates and maximum operation clock frequency is 162MHz. With the proposed design, a Blu-ray Disc video decoder which supports full HDTV resolution and bit-rates up to 40Mbps is realized. By applying our dynamic re-configurable VLC table, the circuit size of the table is reduced by 60% compared with a conventional hard-wired logic implementation. We measured the proposed data compression method with over 300 video sequences. The memory data usage is reduced 50% and access bandwidth is improved by 12%. In fact two 512Mbit DDR2 SDRAM with 324MHz operation can be applied.
VI. CONCLUSION
A multi-standards video decoder for Blu-ray Disc standard was introduced. The decoder corresponds to the Blu-ray Disc standard which requires 40Mbps maximum bit-rate and 1080i with 30fr/s resolution. The supported video standards include MPEG-2 Main profile at High level, H.264 High profile at Level 4.1, and VC-1 Advanced profile at Level 3. The decoder realizes a low-cost LSI implementation. The gate count is 1.5M gates and the operation clock frequency is 162MHz for all video standards at HDTV resolutions. A novel circuit methodology for dynamic re-configurable VLC tables is introduced, which has been shown to reduce circuit volume by 60% compared with a conventional hardwired logic implementation at a bit rate of 40Mbps for entropy decoding. An original data compression method is also applied to realize both low cost and real-time performance. The proposed method utilizes a RUN-LEVEL syntax and signed Exp-Golmb code table to reduce data usage by 50% and improve access bandwidth by 12% with negligible workload. Table 3 provides a summary of this work. 
