Abstract-Chip Multiprocessors (CMPs) are adopted by industry to deal with the speed limit of the single-processor. But memory access has become the bottleneck of the performance, especially in multimedia applications. In this paper, a set of management policies is proposed to improve the cache performance for a SoC platform of video application. By analyzing the behavior of Vedio Engine, the memory-friendly writeback and efficient prefetch policies are adopted. The experiment platform is simulated by System C with ARM Cotex-A9 processor model. Experimental study shows that the performance can be improved by the proposed mechanism in contrast to the general cache without Last Level Cache ( 
INTRODUCTION
In 2005, the computer industry adopted processor clock frequency as the primary indicator of processor performance. The change reflected the looming problem so that the frequency abounded by power dissipation, wire delay and the physical properties of CMOS transistors impeded the rate of the increase in clock speeds. Chip Multiprocessors (CMPs) are adopted by industry to deal with the speed limit of the singleprocessor. However, the "memory wall" problem is still an overwhelming bottleneck in current system performance [1] and over 50% clock time on processor is spent for memory access [2] . Cache optimization mechanism is the only efficient way to alleviate the latency. Considering the speed, the capacity and the increasing memory bandwidth requirement of the increased cores per chip, optimization proposals mainly focus on the last-level cache (LLC) [3, 4] .
As the number of the integrated cores in ChipMultiprocessor (CMP) designs continues to increase, the typically shared last-level cache memory gradually becomes a critical bottleneck for the performance. Ranging from the early Least Recently Used (LRU) replacement policy to the recent optimization policies [5] [6] [7] [8] , most of the work mainly focused on two points: when or how to replace (or write back) a block, and when or how to load (or prefetch) a new one.
Although statistical methodology has been proposed by researchers [9] , the high level criterion to evaluate the performance is how exquisitely the replacement policy depicts the program locality. Work mentioned above is universal optimization policy to achieve great improvement of the performance. In practice, no policy is sufficient to satisfy each of the workload classifications. Furthermore, multimedia application is the most widely used and the roughest job for processors, especially for handheld devices. A new method for optimizing these applications was explored in this work on video application.
Our goal in this paper is to design a set of management policies to improve the performance of the cache for video application. The complete management policies include Memory-friendly writeback, multimediaintensive prefetch and scheduler of last-level cache. Firstly, in modern memory systems, memory-write requests can cause significant performance loss by increasing the memory access latency for subsequent reading requests targeting the same device [7] . Memoryfriendly writeback aims to use idle time to execute useful writeback command. Secondly, multimedia-intensive prefetch aims to execute useful prefetch by analyzing the behavior of video application. Finally, scheduler of lastlevel cache aims to manage all the commands (prefetch, writeback, critical reading and critical writing).
We evaluate our Management policies with System C on ARM Cotex-A9 processor model. Experimental study shows that the performance can be improved by the proposed mechanism in contrast to the general cache without Last Level Cache (LLC): up to 18 
II. RELATED WORK
To the best of our knowledge, little work about LLC in multimedia application has been done, especially for handheld devices. Restrictions on performance of the processors for handheld devices often include, but not limited to [10] cost, power consumption, and functionality. Digital audio and video applications call for a larger amount of processes, compared with other widely used applications for handheld devices. The required process rate for compression ranges from 100 megaoperations per second (MOPS) to more than one teraoperations per second [11] . Considering the "memory wall" mentioned above, efforts which only focus on the increase of the speed performance for the processor's frequency, or highlighting multimedia dedicated processor can not solve all the problems. On the other hand, recent cache optimization proposals [5] [6] [7] [8] don't work well enough for the multimedia application. Multimedia application has its own characteristics. Therefore the methods are discussed to improve cache performance on multimedia application. We discuss closely related work in prefetch and writeback.
A. Prefetch
There have been a handful of proposals for cache prefetch algorithms in the literature for the past few years. These proposed algorithms can be classified into several classes as follows.
A lot of previous DRAM scheduling policies were proposed to improve DRAM throughput in singlethreaded [12, 16, 17] , multithreaded [18, 19, 17] , and stream-based [20, 21] systems. In addition, some recent works [22, 23, 24] have provided some methods for fair DRAM scheduling across different applications sharing the DRAM system. Some of these previous proposals [16, 25, 19, 22, 23, 24, and 17] don't mention how to treat prefetch requests and demand requests. Our management policies are based on these scheduling policies: they can be extended to adaptively prioritize demand and prefetch requests and to give useless prefetch requests up.
Other DRAM proposals use two different approaches to handle prefetch requests. Some proposals [12, 26, 27, and 28] prioritize demand requests over prefetch requests. Other proposals [21] treat prefetch requests the same as demand requests. Thus, these previous DRAM controller proposals handle prefetch requests strictly. Strict handling of prefetches can cause significant performance loss compared with adaptive prefetch handling. Our work improves upon these proposals by incorporating the effectiveness of prefetch into DRAM scheduling decisions.
Some previous works proposed execute the prefech command based on dynamic information. Our work is complementary to these proposals, which are described below.
Lee et al. [14] proposed a new low-cost memory controller, called Prefetch-Aware DRAM Controller (PADC), which aimed to maximize the benefit of the useful prefetches and minimize the harm caused by the useless prefetches. To accomplish the goal, PADC estimates the usefulness of prefetch requests and dynamically adapts to its scheduling and buffer management policies based on the estimates. In contrast, our mechanism adapts to the selected policy between demands and prefetches based on stream prediction, because our research is targeting video application. As a result, Lee's proposal can be combined with our prefetch scheduling policy to provide higher performance for multimedia application.
B. Writeback
Much previous work [30, 23, 22] does not take the writing interference problems into account. Eager writeback [34] is the first work which expands write resources by using the LLC to reduce write-induced interference. Eager writeback writes back dirty cache blocks in the least-recently-used (LRU) position of the LLC sets whenever the bus is idle instead of waiting for the block to be evicted to reduce the memory traffic. However, the scheduling window of eager writeback is still limited to the size of the writing buffer. Thus, the scheduling decision it makes is still far from optimal.
Stuecheli et al. [33] proposed a virtual writing queue (VWQ) technique. Their technique takes a fraction of the LRU positions in the LLC as the VWQ. Dirty cache blocks with high locality in the VWQ are written back in a batch, therefore improving writeback efficiency. The drawback of this technique is that it needs to search the dirty cache blocks in VWQ which hit in the same row when mapping to the DRAM. Although it uses the Cache Cleaner technique to help searching, it still consumes significant LLC power and searching time.
Wang et al. [7] proposed a rank idle time prediction driving LLC writeback technique. In contrast to previous work [33, 35] which does not exploit rank idle time, the technique allows the memory to service writing requests during the significant idle rank time. The technique can be used with LLC writeback scheduling techniques to improve the memory efficiency.
Based on the previous work, we propose the new writeback mechanism. When the memory is idle, the writeback command will be executed according to the data writeback command and Bank open/close information maintained, therefore the best time and schedule of writeback will be chosen.
III. MECHANISM
A. DRAM Access Background [14] An SDRAM system consists of multiple banks which can be accessed independently. Each DRAM bank comprises rows and columns of DRAM cells. A row contains a fixed-size block of data (usually several Kbytes). Each bank has a row buffer (or sense amplifier), which caches the most recently accessed row in the DRAM bank. A DRAM access can be done only by reading (writing) data from (to) the row buffer using a column address.
There are three commands which need to be sequentially issued to a DRAM bank in order to access data:
1) a precharging command to precharge the row bitlines 2) an activating command to open a row into the row buffer with the row address 3) a read/write command to access the row buffer with the column address. After the completion of an access, the DRAM controller can either keep the row open in the row buffer (open-row policy) or close the row buffer with a precharging command (closed-row policy). The latency of a memory access to a bank varies depending on the state of the row buffer and the address of the request as follows:
1. Row-hit: The row address of the memory access is the same as the address of the opened row. Data can be read from/written to the row buffer by a reading/writing command, therefore the total latency is only the reading/writing command latency. 2. Row-conflict: The row address of the memory access is different from the address of the opened row. The memory access needs a precharging, an activating, and a reading/writing command sequentially. The total latency is the sum of all three command latencies. 3. Row-closed: There is no valid data in the row buffer (i.e. closed). The access needs an activating command and then a reading/writing command. The total latency is the sum of these two command latencies. Recent microprocessors employ hardware prefetch to hide long DRAM access latencies. If prefetch requests are accurate and fetch data early enough, prefetching can improve performance. Existing DRAM scheduling policies take two different approaches as to how to treat prefetch requests with respect to demand requests. Some policies regard a prefetch request to have the same priority as a demand request. This can significantly delay demand requests and cause degradation for performance, especially if prefetch requests are not accurate. Other policies always prioritize demand requests over prefetch requests so that data known-to-be-needed by the program instructions can be serviced earlier. One might think that it provides the best performance by eliminating the interference of prefetch requests with demand requests. However, such a rigid policy does not consider the nonuniform access latency of the DRAM system (row-hits vs. row-conflicts). A row-hit prefetch request can be serviced much more quickly than a row-conflict demand request.
DRAM access time is shortest in the case of a row-hit [15] . Therefore, a memory controller can try to maximize DRAM data throughput by maximizing the Hit Rate in the row buffer. Previous work [12] introduced the commonly-employed FR-FCFS (First Ready-First Come First Serve) policy which prioritized requests such that it services 1) row-hit requests first and 2) all else being equal, older requests first. This policy was shown to provide the best average performance in systems which do not employ hardware prefetching [12, 13] . However, this policy is not aware of the interaction and interference between demand and prefetch requests in the DRAM system, and therefore treats demand and prefetch requests equally.
Then the management and scheduling policies of LLC are proposed. They are based on the features of memory access in multimedia application. The efficient writeback command and the inefficient writeback command are differentiated and scheduled. Also the writing and reading commands are unified scheduling to maximize the DRAM efficiency and minimize the writeback overhead.
B. Logic Diagram of Last Level Cache Managment
In the management and scheduling policies of LLC, in order to schedule the prefech and writeback command more reasonably, it is necessary for the cache to get DRAMC bank open/close information as Figure1 shows. The logic diagram of the whole LLC management unit is shown in Fig. 1 .
In the logic diagram, Prefech Manager, Writeback Collector and Scheduler are designed.
Prefetch Manager: The Hit Rate of prefetch is very critical for the performance of the memory access. So the prefetch manager is designed according to the feature of the video engine. It makes prediction according to the current behavior of the memory access, so that it can improve the Hit Rate of prefetch.
Writeback Collector: The time for writeback is an important decision for performance, so a Writeback Collector is designed. The collector maintains data writeback command and Bank open/close information, so the best time and schedule of writeback will be chosen.
Scheduler: The cooperation of the reading and writing operation will reduce the useless memory access and decrease the cost. Scheduler is designed to incorporate the prefetch command into the writeback command, and utilize the cycle of the Bank operation to improve the efficiency of every cycle. Behavior of the Video Engine on multimedia scenario has been analyzed when this replacement policy are designed. Two replacement methods are proposed in these multimedia applications.
C. Memory-Friendly Writeback
The sequence diagram of LLC WriteBack mechanism is shown in Fig. 2 The LLC WriteBack data collector will scan the cache memory when the memory is idle. If dirty line is found, it will record this line in the table inside the collector with the index of DRAM's Row and Colum, and generate a writeback command. Besides, the collector will keep a counter for each pending writeback command. This counter will calculate the waiting time consumed in the buffer. When it hits a threshold once, this command will be discarded. For each cycle, DRAM queue will update the bank open/close information for the LLC. Thus the collector could select the most appropriate writing command and send it to DRAM queue. The rule to select command is as follows:
1. Page hit command: According to the bank information provided by finding a page hits command. Because DRAM access time is the shortest in page hit case.
Page open command:
If there is no page hit command, the collector will select a page open command. 3. The command which hits the waiting time threshold will be discarded. Because the selection will not try to interface the DRAM queue efficiency by issuing row-conflict command. With the Memory-friendly writeback policy, the writeback operation will not bring extra cost of Bank open/close, and only will be executed when there is an appropriate open bank. In addition to avoiding the inefficiency writeback, during the period of waiting the scheduler, the same rewriting will be merged to reduce the actual times of writeback operation.
D. Prefetch Algorithm for Multimedia Intensive access
Prefetch modules issue prefetch command according to the prefetch prediction result. We use stream prediction style for prediction address. It means that the prediction address is the next 64 bytes or 128 bytes address.
For multimedia intensive scenario, the continuous address will appear with high possibility. However, because of cache resource limitation, it cannot hold all the prefetch data long time enough. It means that some prefetch data might not be used before it is written back. So only the address which will likely be accessed in near future should be prefetched.
There is a "prefetch manager" inside the prefetch module which records the cycle interval of the predicted command by each multimedia master. For each master, the manager will count the cycles between the continuous address for several times and record the maximum cycles. If the cycle value is smaller than a threshold, for the following reading command the prefetch module will issue prefetch command for the very master.
The procedure is as Fig. 3 : Figure 3 . The prefetch algorithm for multimedia intensive
The matter in prefetch is that invalid prefetch will bring extra memory access. Some design mentioned above treats prefetch requests the same as demand requests, so invalid prefetch will influence the performance. To improve the performance and reduce the cost of prefeth, prefetch and writeback command are united scheduling to ensure the consecutiveness of reading/writing. The scheduler is described as follows.
E. Scheduler for LLC Manager
Scheduler of the LLC selects one command and sends it to DRAM queue manager. There are four kinds of commands under scheduler. Besides the writeback and prefetch command, there is critical writing and critical reading command. These two kinds of commands are issued by the cache core controller. Critical reading command is issued when a reading miss happened, and critical writing command is issued when writing miss happened.
The main purpose of the scheduler is to keep a balance between the latency and DRAM efficiency. The scheduler performs arbitration under following rules:
1. Continuous service of read or write until it hit the service count threshold. 2. Critical Read and Write have higher priority than prefetch read and writeback command. 3. Try to issue page-hit command for both read and write to maximize the DRAMC efficiency. a) The writeback module has provided the most appropriate command for current DRAM page status. b) Prefetch read command mostly is page-hit command to the original command. With the selected policy of the scheduler, the time of page-hit is taken full advantage of and the writeback command will not be disturbed in continues reading command. Although Critical Reading/Writing command which is the most influential for the performance will be executed continues, the page-hit writeback and prefech will be scheduled to maximum the efficiency.
IV. METHODOLOGY

A. System Model
To evaluate the improvement of CPU reading latency of LLC accurately, the ARM Cotex-A9 processor model provided by Carbon(TM) Design Systems was chosen. It was a highly accurate model with L1 Cache. We ran a segment of fake program which behaved like an image decoder. The designed bandwidth was 0.275GB/s. All modules of the testing platform were developed with systemC. The simulation frequency was 40 KHz, i.e., the processor's frequency was 40 KHz as seen from the real world. All modules of the platform except CPU worked at the frequency of 266MHz. The testing platform is shown in Fig. 4 .
The System Cache in the figure was the one to be evaluated. The MT8320 External Memory Interface (EMI) connected to the cache was the DDR controller, whose total bandwidth was 8G and whose designed frequency was 266 MHz. The controller was a cycle-accurate systemC model. Each sub module exchanged data through systemC standard TLM interface. The interfaces were also cycle-accurate. After compilation, the testing platform could be used for simulation. The period needed for running one time was 27ms, which was also the period used for decoding one frame by the MMTG module. We used the standard Video Decoder Engine and the standard software running on CPU to decode the MPEG4 720P video file in the test. And we evaluated the improvement by observing indicators such as Bandwidth, Hit Rate, CPU Read Latency and MM Latency.
The test load was generated by Traffic Generator (TG in the figure). Traffic Generator issued the instructions of instruction-reading, instruction-writing and data-writing according to the configuration file. The time, address ranges and total bandwidth of these instructions were defined in the configuration file. The configurations and effects of the testing traffic generators are shown as follows.
MT8320 EMI SystemC Model
System Cache 
V. EXPERIMENTAL EVALUATION
We analyze the experimental results on the platform in this section. To evaluate the results of LLC optimization accurately, we use test size of LLC 1MB, 2MB, 4MB and 8MB, respectively. As shown in Fig. 5(a) and (b) , the test results of Bandwidth, Hit Rate, MM Latency and CPU Read Latency on platform with and without LLC optimization are respectively described. The experiment uses standard mpeg 720p video files. The light-gray part is with LLC optimization technique and the dark-gray part is without LLC optimization technique.
By comparing the two scenarios, we can conclude that the performance of cache has been improved greatly after optimization. As seen from the results, during encoding phase, bandwidth grows by an average of 8.26%, with 18.87% Hit Rate increased, 10.62% MM Latency and 46.43% CPU Read Latency decreased, respectively. During decoding phase, bandwidth grows by an average of 4.23%, with 52.1% Hit Rate increased, 11.43% MM Latency and 47.48% CPU Read Latency decreased, respectively.
There are two main reasons for the improvement: 1. Proposed prefech mechanism analyzes the way of memory access and the feature of memory address during video decoding phase, improves the Hit Rate and avoids too much invalid prefetch. 2. Memory-friendly writeback mechanism reduces the reading/writing operation switching frequency, improves the efficiency of cache loading and reduces the CPU Read Latency substantially. The results shown in the figure can be analyzed respectively as follows. Firstly, overall bandwidth did not increase much and it means that the system overhead did not increase significantly. Secondly, Hit Rate increased substantially and it meant that the DRAM loading was reduced. Thirdly, CPU path delay decreased significantly. This was very meaningful because the high-performance CPU requested that the delay of entire Data Path must be low, otherwise without data the CPU would be idle, which claimed that several GHz are useless. Finally, Multimedia delay decreased and it meant that the movie would be played smoothly. Fig. 6 describes performance enhancement per BW consumed when the size of LLC is 1MB, 2MB, 4MB and 8MB, respectively. As seen from the figure, bandwidth utilization is improved significantly by using the optimization technique. VI. CONCLUSIONS
In this paper, we have shown that existing Last Level Cache (LLC) optimization technique solves general memory access problems. Therefore this work mainly addresses video application on SoC, which is a common application in handheld devices.
According to the feature of the video application, we have proposed a set of LLC manage policies including memory-friendly writeback, prefetch algorithm for multimedia intensive access, and unified scheduling of the reading/writing command. Memory-friendly writeback mechanism chooses the most appreciating writeback command according to the Bank info of the current memory access, avoiding the inefficient rowconflict command. Prefetch algorithm improves the Hit Rate and avoids too much invalid prefetch. Unified scheduling of the reading/writing request insures the consecutiveness of the reading/writing request taking advantage of the time for the Page Hit, and improves the efficiency of cache loading.
These algorithms have been tested on the SoC simulation platform developed by systemC with ARM Cotex-A9 processor model. They are able to achieve significant Hit Rate improvement (18.87%), MM Latency reduction (10.62%) and CPU Read Latency reduction (46.43%) with only an average of 8.26% of bandwidth increased during encoding phase. Also, they achieve significant Hit Rate improvement (52.1%), MM Latency reduction (11.43%) and CPU Read Latency reduction (47.48%) with only an average of 4.23% of the bandwidth increased during decoding phase. 
