I. INTRODUCTION

M
ANY popular multimedia applications such as digital TV, 3-D graphics, and MPEG encoding/decoding require high computing power. However, there exists a big performance gap between a processor and DRAM in terms of data throughput. There have been many efforts to modify the architecture of DRAM for the goal of achieving higher bandwidth. One of these trends is a series of synchronous DRAM. Additional circuitry is added in DRAM for synchronization to obtain higher bandwidth. Another is some series of special memories, such as Video DRAM, Window DRAM, and SGRAM. Special logic circuits are added to DRAM to reach the performance level required for video or graphic applications. However, several types of DRAM and various modified DRAM architectures accompany large power dissipation or large latency to obtain high bandwidth [1] . Thus, merging high-density DRAM with high performance logic in a single chip is considered as a feasible solution of obtaining higher B.-S. Kim is with Samsung Electronics, Kiheung Yong-in, Kyungki-Do 440-600, Korea.
Publisher Item Identifier S 1051-8215(00)07565-0.
bandwidth and lower power dissipation. It also overcomes the performance limitations of previous DRAM architectures [2] . In previous works, there are two approaches in designing merged DRAM logic (MDL). One is to merge a general-purpose processor and DRAM. This approach provides programmability for many applications, but dissipates large power and requires considerable hardware resources [3] , [4] . The other approach is to merge a dedicated signal processor and DRAM [5] - [7] , which enables an efficient usage of hardware resources and low power dissipation, but it does not have programmability, thus applications are restricted. As yet another approach, the activities to obtain programmable-specific hardware for multimedia applications have been tried in [8] and [9] . Those activities show low power dissipation, but do not meet high enough performance for high-quality video signal processing.
In this paper, we propose a programmable high-performance MDL architecture for video signal processing with an efficient use of hardware resources, low power, and programmability. The applications of video signal processing are analyzed by using the proposed MDL model and the requirements of the performance are examined in the MDL. Total required clock cycles (TRCC) is defined to express the number of required clock cycles for the application and DRAM access rate (DAR) is defined to indicate a synergy effect of merging DRAM with logic circuitry in a single chip. The performance of IDCT and MC in MDL is evaluated and the effective datapath of the MDL architecture is designed. This paper is organized as follows. Presented in Section II are the basic model and analysis of MDL architecture. In Section III, design guidelines and the datapath of the MDL are explained. In Section IV, performance evaluations and experimental comparisons are given, and Section V concludes this paper.
II. MODELING AND ANALYSIS
The basic model of MDL architecture consists of processing units (PUs), a merged DRAM, a temporal storage (TS), and some control units. Key parameters are defined to analyze performance of the MDL architecture as follows.
• Bus width between DRAM and temporal storage unit (WDB): this parameter is one of the key parameters in the MDL, which directly indicates improvements of the performance.
• Number of processing units (NP) : this parameter affects the number of clock cycles to execute required operations in the MDL system. • Size of temporal storage (STS): this parameter affects the number of accesses to DRAM. It depends on the size of basic data block in video signal processing.
• Latency of DRAM (LATENCY ): this parameter can be changed by modifying interconnect bus circuits, which drive load capacitance between DRAM and logic.
• Latency of temporal storage unit (LATENCY ): this parameter is the number of clock cycles between address-inputs and data-outputs in temporal storage unit.
• Latency of processing unit (LATENCY ): the effect of this parameter decreases as the amount of data to be dealt with increases.
A. A Qualitative Analysis
The proposed programmable MDL architecture executes required operations according to the sequence of instructions for video signal processing. The instructions can be categorized as three stages : 1 (1) The number of required clock cycles for each item can be formularized by using the defined parameters. At first, if STS is greater than size of basic data block (SBD), the number of clock cycles for Data Read ( ) can be expressed as LATENCY (2) where [ ] means the minimum integer number above a fractional number , SBD is size of basic data block for video signal processing. If STS is less than SBD, the size of data to be dealt with should be reduced to smaller SBD, which is defined as RSBD, to meet the condition of data correlation: STS RSBD STCR, i.e., all data in the same flow of data processing should be stored in a temporal storage unit where STCR means size of temporal computation result. Then (2) (7) on the assumption that CC is the same as CC . In addition, DAR is defined as ratio of clock cycles for data transfer to clock cycles for computation to show the effect of merging DRAM and logic, which is given in (8) 
TRCC and DAR are used to provide design guidelines for the MDL system.
B. A Quantitative Analysis of Applications
The decoding algorithm in MPEG is analyzed to design the datapath in MDL, which consists of variable length decoding (VLD), inverse quantization (IQ), inverse discrete cosine transform (IDCT), and motion compensation (MC) [10] . The VLD is not a data-intensive function compared to IDCT or MC, thus it is neglected in designing datapath. The required clock cycles in MDL architecture is calculated for video signal processing. Assuming that resolution of the video frame is NP NP and the size of a basic data block for computation is NB NB , a performance metric to execute required video signal processing for one frame can be written as NP NB NP NB Table I . Table I shows that TRCC has strong dependency on the number of PUs. From (8) , DAR can be written as in (13), shown at the bottom of the page, where DAR decreases rapidly as the PU and WDB increase.
MC: The operation of MC is applied to full frame (or field) of digital video signal. The MC of the MPEG decoding is formulated as linear matrix with rank 4. Since there is no data correlation between pixels of the same frame in the MC, size of temporal storage does not affect the sequence of the MC. If we assume that is number of 4-pixel ( ) block and is number of bits for a pixel, the maximum amount of data to be transferred in field mode is MV
[bits] where MV indicates a motion vector, means number of bits for a data block. Four motion vectors and four data blocks are required to calculate a pixel in the field mode. The MAC of computation amounts are necessary for an . From (6), the required total clock cycles to execute the MC of a basic block data can be written as
where the Write operation of four motion vectors is not necessary, hence 4MV is divided by two. The TRCC for a basic block is summarized in Table II 
A. Design Guidelines for MDL
Design guidelines for the datapath of the MDL are suggested from TRCC and DAR, and from the analysis of the MPEG2 decoding algorithm. Two major considerations are: 1) TRCC should be kept minimized to reduce power dissipation; 2) DAR should be kept minimized to maximize synergy effect of the MDL. Then, design guidelines for the datapath are suggested as follows.
1) An optimized bus width between DRAM and DSP core is necessary for a low power dissipation and an efficient usage of silicon area. 2) A wide data bus architecture is necessary in the DSP core:
all PUs should be operated at the same time to reduce clock frequency. Then, inputs and outputs of PUs can be manipulated without a bottleneck. 
The total area of a VLSI chip for an application is limited by three major factors: yield, cost, and power dissipation. Fig. 1 shows the range of DAR and ARD versus the number of PUs in the MDL chip. The number of PUs should be carefully determined. The ) is assumed as and an area of a PU ( ) is assumed as three kinds, , , and . If and ARD is selected as and 200%, respectively, then the number of PUs is determined as four (32-bit format). However, the effect of merging DRAM and logic is small when WDB is expanded from 64-bits to 1024-bits. In other words, DAR (about 7%: from 10% to 3%) is small when NP = 4. As the number of PUs increases, DAR becomes larger when WDB is widened in Fig. 1 . Therefore, it is important to increase the effective number of PUs without increment of chip area to obtain a large DAR when WDB is widened.
B. Design of Datapath
The MDL architecture is designed based on the previous design guidelines. The design targets are implementing a high performance MDL system, and limiting the maximum clock frequency of DSP core to 200 MHz for low power dissipation in decoding MP@HL( ) in MPEG2. The area of theVLSI chip is also limited for high yield. Considering constant hardware resources, three prerequisite conditions are assumed, which are extracted from the previous analysis and design guidelines. Fig. 2 . The MAC, ALU, and barrel shifter that have bit-splittable capabilities and multi-port SRAM that has 256-bit simultaneous data accessibility are designed to increase effectiveness in processing video signals. That enables clock frequency of the proposed MDL to be less than 200 MHz for decoding of MP@HL in MPEG2.
IV. PERFORMANCE OF THE PROPOSED MDL
The performance of the MDL is evaluated for MPEG2 decoding based on the TRCC. The simulation results of datapath are compared to other dedicated hardware chips.
A. Performance Evaluation
CF
, the clock frequency to decode MPEG2 in MDL, can be calculated as (except VLD) CF TRCC TRCC TRCC (18) where TRCC is the number of TRCC for IQ, which requires one multiplication per one pixel. The required clock frequencies for decoding of MPEG2 in four kinds of levels are shown in [12] . The proposed MDL can decode MPEG2 signal in MP@ML with the same clock frequency of the hardwired logic. Therefore, the proposed MDL has higher computing capabilities over general types of processors and also versatile programmability for video signal processing.
B. Simulation Results
The performance of the proposed architecture is verified by an architecture simulator for the integrated four PUs and a coefficient ROM through the eight 32-bit data bus. Single instruction multiple data (SIMD) type 4-depth vector instructions are generated for a simulation of programmable environments targeting IDCT operation as a benchmark to verify the performance of the MDL. The instruction sequences for the IDCT are shown in Fig. 3 . The words in parentheses in Fig. 3 architectures [13] , [14] to perform the IDCT. A performance comparison is shown in Fig. 4 . The number in the figure shows required clock cycles to perform the IDCT for the 8 8 basic block, the numbers in parentheses are relative clock cycles when the required clock cycles in MDL is regarded as number one (1.0). The proposed MDL architecture shows 2.1-4.8 times higher performance compared to conventional architectures.
V. CONCLUSION
A programmable architecture of MDL is proposed for video signal processing. The proposed MDL can decode a high level ( ) of MPEG2, and also achieve a 3.2 GOPS processing power for 8-bit video signal processing in a 200-MHz clock. It has 2.1-4.8 times higher performance for the IDCT compared to the previous two conventional architectures, which do not have programmability. The optimization of various datapaths is executed according to the proposed model and design guidelines in MDL. Two measures, TRCC and DAR, are defined such that they take into account effect of various parameters on the performance of MDL architecture. The methodology of MDL modeling and analysis can also be used to implement high-performance merged DRAM logic for other multimedia applications such as digital-TV, 3-D graphics, and MPEG2 encoding.
