Abstract
I. INTRODUCTION
In many applications such as portable wireless devices and multimedia systems, several factors such as increased system complexity, time-to-market pressure, cost effectiveness, and various functionality requirements have made the trend of system-on-a-chip (SoC) design indispensable [1] [2] [3] . In general, SoC devices are connected to off-chip memories that feed instructions and data to the programmable processors and temporarily store data to be transferred between functional blocks. As the SoC integrates more functional blocks and needs higher performance to carry out ever increasing tasks, high data bandwidth is required to meet a given system specification.
High definition television (HDTV) decoders have been integrated into a single chip to exploit the merits of SoC as illustrated in Fig. 1 [4] [5] [6] [7] . The HDTV decoder SoC consists of a system parser, a video decoder, an audio decoder, a display controller, and peripheral interfaces. The HDTV decoder SoC uses off-chip memory to buffer the MPEG-2 bitstream and temporarily store data to be decoded and displayed. Since high memory bandwidth is required to deal with large amount of video data in a given time specification, synchronous memories such as Synchronous DRAM (SDRAM) and Rambus DRAM are widely used to increase data transfer speed, reduce clock cycle time, and ease synchronous design [5] [8] [9] . 1 This work was supported in part by the IDEC and the MICROS centers. Several architectural features developed to alleviate memory latency enable the synchronous memories to meet the bandwidth requirement [10] [11] . The features are based on the fact that all the cells along a word line are latched to sense amplifiers when the row is selected and activated, and can be reused without additional row-activation and precharge as long as the row addresses of successive accesses in the corresponding bank are identical. The row-active state can be used to reduce the latency and the power consumption of memory operations if the successive memory access refers to the same row in the same bank (a page hit). However, if the row address differs from the previous one (a page miss), additional cycles that cannot be hidden are needed for a precharge and a row-activation, resulting in performance degradation. Therefore, to increase memory performance, the memory controller has to control the operation mode by efficiently predicting whether the next memory reference will be a page hit or not.
Several optimizations have been proposed to reduce page misses by statically scheduling the address sequence in memory and controlling the memory operation mode [9] [12] [13] . Those techniques are successfully applied to image and video processing applications, in which memory access patterns are relatively regular enough to be known in advance. In the HDTV decoder system, however, several functional blocks and processors are connected to the external memory through a shared bus, and as a result, memory access patterns become irregular. The irregularity is caused by the motion compensation and the mixed memory accesses of several functional blocks.
A dynamic memory mode control scheme [14] has been proposed to manage the memory operation mode according to runtime behavior of memory access patterns. The state of SDRAM is changed from idle to row-active state if a memory access leads to a page hit and sustains the row-active state until the number of the successive page misses exceeds a threshold value. The dynamic scheme is effective if in-row accesses are dominant. However, if in-row accesses are not dominant and the pattern of memory accesses is irregular, frequent mode transitions lead to many overhead cycles needed for precharges and row-activations.
In this paper, we propose a new dynamic memory mode control scheme to reduce memory latency by predicting the next operation mode. The prediction is based on the history of memory references. SDRAM is used to show the effectiveness of the proposed control scheme.
The rest of the paper is organized as follows. Section II gives a brief background on the architecture, the bank states, and the operations of SDRAM. We describe the proposed dynamic memory mode control scheme in Section III. In Section IV, experimental methodology and results are presented. VLSI architecture and implementation are presented in Section V. Finally, conclusions are made in Section VI. Fig. 2 shows a simplified block diagram of SDRAM, which consists of four independent banks. The four banks share address buffers and I/O buffers, while each bank has its own row decoders, column decoders, sense amplifiers, and a memory array. The state of resources of a bank is maintained independently.
II. BACKGROUND ON SDRAM
Each bank has two stable states that are idle and row-active states as shown in Fig. 3 . The idle state is entered by the precharge operation. The state transition from idle to rowactive is made by the row-activation operation. Column access operations do not change the state of the bank. Thus the bank is in row-active state as long as the precharge operation is not performed.
The operation mode of SDRAM is controlled by a memory controller that translates a read/write request into a sequence of memory commands. Three major operations of SDRAM are as follows:
Row-activation: The bank and the row where the data are accessed are selected and activated. Then, all the cells along a word (row) line of the bank are latched to the corresponding sense amplifiers. The bank is in rowactive state after completing the operation. Column access: The column access operation selects and accesses a column of the activated row. A number of words equal to the burst length are read out from the sense amplifiers to the I/O buffers, one word per clock. Precharge: By the precharge operation, the sense amplifiers are precharged and the bank of the SDRAM is made to stay in idle state. A row-activation command can be issued when the state of the corresponding bank is idle. 
III. HISTORY-BASED MEMORY MODE PREDICTION
The operation mode is controlled by commands generated by the memory controller. The read/write commands with the auto-precharge option change the memory to idle state after completing the corresponding operations as depicted in Fig. 4 . As the precharge time (t RP ) can be overlapped with burst accesses or data transfer between the memory controller and the processor, the effective latency is the sum of the rowactivation time (t RCD ) and the column select latency (t CL ). The read/write commands without the auto-precharge option maintain the memory in row-active state. If the successive access brings a page hit, the precharge and row-activation operations are not necessary. In this case, the effective latency can be reduced to t CL . If the successive access leads to a page miss, a precharge, a row-activation, and a column select operation have to be performed, increasing the effective latency to t RP + t RCD + t CL . Therefore, the memory mode must be controlled to stay in row-active state as long as possible and to minimize the number of page misses.
Although the address requested by the processor is random and unknown in advance, the principle of locality of memory reference [15] makes it possible to predict whether the successive access refers to the same row or not. Using the past history of memory references, we predict if the next access causes a page hit and control the memory mode according to the prediction. If the history predicts the successive access to refer to the same row, the memory controller makes the bank remain in row-active state. Otherwise, the bank is changed to idle state. To store the past history of memory accesses, a state machine that can be built with a two-bit saturated up/down counter is used for each row (per-row counter), as shown in Fig. 5 . The corresponding state machine is incremented on a page hit and decremented on a page miss after comparing the row and the bank addresses with those of the previous access.
For a pending memory access, the memory controller issues a command without the auto-precharge option if the state machine selected by the row and the bank addresses is in strongly hit (SH) or weakly hit (WH) state. If the state machine is in strongly miss (SM) or weakly miss (WM) state, a command with the auto-precharge option is issued. The state transition utilizing the past reference history is depicted in Fig. 6 .
Although the per-row predictor can accurately reflect the behavior of memory references to the corresponding row, area overhead is considerable. For example, if a memory has a Nbit row address and M banks, M•2 N two-bit counters are required. To reduce the area overhead while keeping prediction accuracy moderate, one two-bit counter is used for each bank (per-bank counter) instead of each row. As only M state machines are required in this case, significant area reduction is achieved at the loss of a little prediction accuracy. Among M state machines, one is selected by the bank address. 
IV. EXPERIMENTAL RESULTS
To evaluate the effectiveness of the proposed history-based mode prediction scheme, we measure memory latency and memory energy consumption by performing trace-driven simulation for a HDTV decoder system. Memory traces obtained by observing the shared bus are used as an input vector in the simulation. In addition, data memory traces of five SPEC92 benchmark programs are simulated to show the effectiveness in various applications.
A. Latency Estimation
As given in the following equation, the total memory latency is calculated by counting all the individual latencies. 
where N idle is the number of idle states, N hit is the number of page hits in row-active state, and N miss is the number of page misses in row-active state. In the simulation, a SDRAM that has a 13-bit row address, a 9-bit column address, and 4 banks is assumed. The precharge time (t RP ), row-activation time (t RCD ), and column select time (t CL ) are assumed to be three, three, and two (zero for write operations) cycles, respectively, which are quoted from a commercial SDRAM. Table I and Table II show the ratio of the number of correct predictions to the number of total references for a HDTV decoder and SPEC92 benchmarks, respectively. The historybased scheme with per-row counters shows the highest hitprediction rate, and the history-based scheme with per-bank counters predicts more accurately than the previous scheme. In the HDTV decoder system, the history-based scheme with perbank counters predicts more accurately than the previous scheme by 20%. The large difference in hit-prediction ratio is caused by the irregularity of memory reference made by several functional blocks and processors. As a result of more accurate prediction, the memory latency is significantly reduced even compared to the previous mode control scheme. The latency results are summarized in Table  III and Table IV , where we can find that the proposed perbank prediction scheme outperforms the scheme that always maintains the SDRAM in idle state by 18.8% and 19.0% on the average for the HDTV decoder system and SPEC92 benchmarks, respectively.
B. Energy Estimation
The energy consumption (E) is calculated based on the equations presented in [16] , as follows.
where M is the number of banks, N i pa is the number of precharge/activation's in bank i, and N i rw is the number of read/write's in bank i. The energy parameters are shown in Table V , which are quoted from [16] . The operating frequency is assumed to be 133 MHz.
The total memory energy consumptions for a HDTV decoder and SPEC92 benchmarks are summarized in Table VI  and Table VII , respectively. The proposed per-bank prediction scheme reduces the energy consumption by 23.3% and 40.8% for the HDTV decoder system and SPEC92 benchmarks, respectively, over the scheme that always maintains the SDRAM in idle state.
Compared to the previous mode control scheme, the proposed mode control scheme considerably improves memory performance without increasing the energy consumption. Therefore, it can be successfully applied to HDTV decoder systems. Fig. 7 shows the overall architecture of the memory controller. The configuration and mode registers provide parameters needed for the initialization and the control of SDRAM. The refresh unit is responsible for the periodic generation of refresh cycle requests. The power management unit brings SDRAM into power-down mode after the memory request is not accepted for a predefined time interval and recovers the active state when a memory operation is requested. The main control unit performs memory operations by tracking the states of SDRAM. The main control unit performs read and write operations by regulating the command generator and the I/O control block as shown in Fig. 8 . The address alignment block divides the address into a bank address, a row address, and a column address. The hit/miss decision block compares the row address with the previous row address of the same bank to decide whether the access results in a page hit or not. The hit/miss information is fed into the history FSM and updates the corresponding history counter. Considering the state of history counters and SDRAM, the mode control block controls the command generator and the I/O control block.
V. VLSI ARCHITECTURE AND IMPLEMENTATION
A fully synthesizable Verilog model was described in register-transfer level to implement the memory controller. The memory controller contains about 4700 gates. The operating frequency is 133 MHz and the core size is 0.4 mm × 0.4 mm. Fig. 9 shows a layout of the circuit that was implemented with 0.35 µm 3.3 V four-layer metal CMOS technology. 
RU

VI. CONCLUSION
To reduce memory latency of synchronous memories, we have proposed a memory control scheme that predicts whether the successive memory access leads to a page hit or not and changes the memory mode according to the prediction. Twobit state machines are employed to predict the next memory mode based on the history of memory references. Two prediction schemes that use per-row and per-bank predictors are proposed to make a compromise between prediction accuracy and area overhead. Experimental results on benchmark programs and a HDTV decoder show that the proposed scheme is effective in reducing the number of rowactivations and precharges, thereby improving memory performance and energy consumption.
