In this paper, we present a cache based motion compensation (MC) architecture for Quad-HD H.264/AVC video decoder. With the significantly increased throughput requirement, VLSI design for MC is greatly challenged by the huge area cost and power consumption. Moreover, the long memory system latency leads to performance drop of the MC pipeline. To solve these problems, three optimization schemes are proposed in this work. Firstly, a high-performance interpolator based on Horizontal-Vertical Expansion and Luma-Chroma Parallelism (HVE-LCP) is proposed to efficiently increase the processing throughput to at least over 4 times as the previous designs. Secondly, an efficient cache memory organization scheme (4Sx4) is adopted to improve the on-chip memory utilization, which contributes to memory area saving of 25% and memory power saving of 39∼49%. Finally, by employing a Split Task Queue (STQ) architecture, the cache system is capable of tolerating much longer latency of the memory system. Consequently, the cache idle time is saved by 90%, which contributes to reducing the overall processing time by 24∼40%. When implemented with SMIC 90 nm process, this design costs a logic gate count and on-chip memory of 108.8k and 3.1kB respectively. The proposed MC architecture can support real-time processing of 3840x2160@60 fps with less than 166 MHz.
Introduction
While 1080 HD has already become a current standard for TV broadcasting and home entertainment, even higher specifications such as 4 Kx2 K Quad-HD format, have been targeted by next-generation applications. To store and transmit these mass video contents, video compression is indispensable. Compared with previous MPEG standards, H.264/AVC [1] which provides over two times higher compression ratio with better video coding quality, is a promising tool for compressing these massive data. The high coding efficiency of H.264/AVC comes from various new features, such as variable block size motion compensation, quarter-sample fractional interpolation, multi-mode intra prediction, integer transform, in-loop deblocking filter, and context adaptive entropy coding (CABAC). However, the use of these new techniques, along with the everincreasing demand for resolution, greatly challenges the design of video decoders.
For real-time decoding 3840x2160@60 fps, the required throughput should be 500 Mpixels/s. With current technologies, H.264/AVC high-profile decoding for 1080p@60 fps requires a DRAM configuration of 32-bit DDR-400 (1.6 GB/s) [2] . Therefore, for 3840x2160@60 fps, the bandwidth requirement increases by at least 4.3 times proportionally to the throughput, to be near 7 GB/s. For the cache and interpolation components, the computational power also increases by 4.3 times when compared with the architecture applied for 1080@60 fps decoder. Assuming the whole LSI system is working at 166 MHz, the theoretical time budget for real-time processing a 3840x2160@60 fps sequence should be no more than 88 cycles for each component. In practice, it is better to be working at less than 70 cycles per MB so as to preserve enough space for the system. The 4kx2k motion compensation(MC), which is speed bottleneck of the whole decoder, are mainly challenged by the following aspects.
Firstly, compared with HD application, the throughput requirement for MC interpolation in Quad-HD cases is increased by at least 4 times. In order to meet this requirement, the straightforward way is parallelizing the processing unit from the previous one row [3] , [4] to four rows. Although the parallelized architecture can increase the throughput, the critical data alignment problem will lead to extra overhead on both the memory read power and interpolation processing time. This means the cost for parallelism will be larger than the enhancement in throughput.
Secondly, with higher specifications, memory bandwidth requirement increases significantly. Previous contributions proved that the cache systems [5] - [9] can be an effective way to reduce the external DRAM bandwidth. However, the on-chip memory bandwidth from the cache system to the interpolation component becomes higher and costs larger power consumption, because the width of data memory increases proportionally with the interpolation parallelism.
Thirdly, the latency between cache sending the request to receiving the data from the memory system becomes longer due to two reasons. One is that the DRAM latency increases since higher-speed DRAM specifications such as DDR2 and DDR3. The CAS latency in terms of absolute time is almost the same among DRAM generations, but with the clock frequency increased, the CAS latency in terms of number of clock cycles increases, which influences the cache architecture. On the other hand, new techniques adopted to enhance the DRAM access efficiency, such as reference frame recompression [10] , though reduces the total access amount, incurs longer access delay. As a result, while the memory system latency is only around 10 clock cycles in HD decoders, it can increase to over 40 clock cycles in the new Quad-HD applications. Generally, in order to hide the DRAM latency, task queue is utilized in a cache system. However, this architecture requires conflict checking to avoid flushing the useful data in the cache, which will be described in Sect. 3. The longer memory system latency will drastically increase the probability of conflict in the cache system, which results in long pipeline stall and decreases the overall system performance.
In order to solve the above problems, three schemes are proposed in this paper to achieve an efficient MC architecture for H.264/AVC real-time decoding of Quad-HD applications. Firstly, Horizontal-Vertical Expansion and LumaChroma Parallelism (HVE-LCP) based interpolation is implemented to reduce the influence from data alignment problem while increasing the decoding throughput to at least over 4 times as the previous works. Secondly, an efficient cache memory organization scheme (4Sx4) is adopted to improve the on-chip memory utilization. By applying this scheme, memory power is saved by 39%∼49%, and memory area is reduced by 25%. Finally, by employing a Split Task Queue (STQ) architecture, the cache system is capable of tolerating much longer latency of the memory system. Consequently, the cache idle time is saved by 90%, which contributes to reducing the overall processing time by 24%∼40%.
The remainder of this paper is organized as follows. Section 2 and Sect. 3 describe the proposed design for the interpolation and cache components. Section 4 summarizes the overall MC architecture. Implementation results and conclusion are given in Sect. 5 and Sect. 6, respectively. Some terms are firstly defined. 1) When conflict happens in the cache system, some operations in the cache will stop and wait until all the conflict tasks are output to interpolation. This waiting time is defined as cache idle time. 2) For a whole decoder system, the time required for decoding the whole sequence, is named as decoding time. 3) In a cache system, the memory system latency is the time interval between the cache sending the memory request and receiving the data from the memory system. 4) Throughput is the amount of pixels which are processed in a unit processing time. 5) Cache hit ratio is the percentage of accesses that result in cache hits.
Parallelism of MC Interpolation
Most of the previous works on MC interpolation decompose an MB into 16 4x4 blocks and for each 4x4 block load an area of at most 9x9 reference pixels. As described in [3] , 4 pixels in the same row are processed simultaneously to improve the data reuse and reduce the processing time. This 4-pixel parallel luma interpolater architecture is shown in Fig. 1 . Luma interpolation involves using a 1-D 6-tap filter to generate the half-pel locations. A 1-D bilinear filter is then used to generate the quarter-pel locations. The data path of the luma interpolator is made up of 6x9 8-bit regis- ters, (4+9) 6-tap filters, and four bilinear filters. The interpolator uses a 6-stage pipeline. At the input of the pipeline, for vertical interpolation, a column of 9 pixels is read from the frame buffer and used to interpolate a column of 4 pixels. A total of 9 pixels, representing the full and half-pel locations, are stored at every stage of the interpolator pipeline; specifically, the 4 interpolated half-pel pixels and the 5 center (positions 2 to 6) full-pel pixels of the 9 pixels from the frame buffer are stored. The 9 registers from the 6 stages are fed to 9 horizontal interpolators. Finally, 9:2 muxes are used to select two pixels located at full or half-pixel locations as inputs to the bilinear interpolator for quarter-pel resolution.
However, the 4x4 block based row by row interpolation requires at most 288 clock cycles for processing one MB, which can not meet the requirement of 4Kx2K application. In order to increase the throughput, one solution is to expand the row of 4 pixels to 8 pixels (horizontal expansion), as shown in Fig. 2(a) . However, when the partition size for inter prediction is 4x4 or 4x8, this method does not seem efficient. Since the 8 pixels in one row are from two different partitions, there are no loading data that can be shared and the processing speed can not be improved. Moreover, when expanding one row from 8 pixels to 16 pixels, this method results in almost no improvement on the throughput. Another way to increase the throughput is to process two or more rows in parallel (vertical expansion), as shown in Fig. 2(b) . The processing time of 4x4 and 4x8 sized partitions, can also be shortened when using the vertical expansion method. However, the data alignment problem will decrease the speed. Especially for 4Kx2K applications, when four lines are parallelized, the data alignment problem becomes more serious. For example, as shown in Fig. 3 , loading a vertically unaligned 4x4 block requires 2 clock cycles even when each word stores a 4x4 block. More loading clock cycles will not only increase the memory power but also decrease the processing speed. Another parallelization method for the interpolation is to process two 4x4 blocks simultaneously, which is employed by Sze et al. [11] . However, the corresponding internal memory organization and data control can be very complicated.
In order to obtain a suitable parallelization method for 4Kx2K application, we propose to combine the horizontal and vertical expansion methods based on the following considerations. Firstly, regarding the high-level limits described in the H.264/AVC standard, although the horizontal expansion method is not so efficient for 4x4 and 4x8 partitions, it will not influence the average speed. One limits is that for level 3 or higher, which is designed for video specifications higher than or equal to 720x576@25 fps, biprediction MV is not allowed for partition sizes smaller than 8x8. This means the data loading times for interpolation of 8x4, 4x8 and 4x4 partitions can be less than that of the larger ones. The other one is that on levels higher than 3.1, maximum number of motion vectors per two consecutive MBs is 16, which further constrains the influences of small blocks. Moreover, a 4Sx4 internal memory organization, which is to be introduced in Sect. 3.2, can be utilized for the horizontal expansion method to reduce the memory data width. However, 8-pixel-parallel processing still can not meet the throughput requirement of 4 Kx2 K application. Therefore, based on the horizontal expansion, a vertical expansion is further applied to process 2 rows in parallel, as shown in Fig. 2(c) . Compared with the 4-row-parallel vertical expansion, the memory width of the proposed horizontal-vertical expansion method is reduced by half, and the influence from alignment problem is decreased.
In order to further enhance the throughput, the interpolation of luma and chroma samples are parallelized. Since it was originally not easy to reuse the hardware resources of luma and chroma interpolation components, the lumachroma parallelism can provide 1.5 times the performance (for 4:2:0 sampling) with almost no hardware cost overhead.
Consequently, compared with the general 4x4 block based row by row interpolation architecture, the proposed Horizontal-Vertical Expansion and Luma-Chroma Parallelism (HVE-LCP) based interpolation can enhance the throughput to at least over 4 times.
Proposed Cache Architecture
As shown in Fig. 4 , for the cache system design, the cache mapping is targeted to reduce the off-chip DRAM bandwidth, and the internal memory organization is aimed to improve the data throughput and save the on-chip memory bandwidth. Cache mapping has been well discussed in previous contributions, but few works pay much attention to the internal memory organization. In a 4Kx2K cache system, in order to meet the higher data throughput requirement, the width of internal memory should increase propor- tionally. Moreover, the data alignment problem introduced in Sect. 2 further increases the bandwidth of internal memory bandwidth. In the meanwhile, with a higher parallelism, the area increase of the other parts of the decoder is usually smaller than the speed-up [10] . Therefore, the power and cost portion of the internal memory part becomes more significant in the whole decoder system, if a more efficient memory organization is not proposed. The cache mapping method of this work is given in Sect. 3.1, and the proposed internal memory organization is presented in Sect. 3.2.
Moreover, the memory system latency is increased from around 10 clock cycles in HD decoders to over 40 clock cycles in the new Quad-HD applications. The longer task queue is required to hide the longer memory system latency, and the longer task queue will drastically increase the probability of conflict in the cache system, which results in long pipeline stall and decreases the overall system performance. Therefore, the general one task queue based conflict checking mechanism is no longer efficient for the longer system memory latency. The detail of the problem and the proposed solution are discussed in Sect. 3.3.
2-D Cache Mapping
Reference read operation of motion compensation (MC) composes a dominant portion of a video decoder's DRAM traffic. To reduce this part of DRAM bandwidth, cache based architecture is utilized for reusing the overlapped reference samples of neighboring blocks. Figure 5(a) shows the 2-set 2x2-MBs sized 2-D cache for this work, which is similar to the design in [9] . The 2-D organization combines the lower parts of the parX and parY physical coordinates of the Access Units (AUs), which are the basic storage units in the DRAM, to be the cache index. The higher parts of parX and parY coordinates, together with the picture ID (used to specify the physical storage slot of a decoded frame in the DRAM) are combined to be the tag. Considering the use of bi-directional inter prediction in the latest video coding standards, two cache sets should be required for the two reference lists respectively. In our 4Kx2K video decoder [10] , because of a wider BUS width and the use of frame recompression technique, AU size equals to the compression unit size, which is larger than that in [9] . The other difference from [9] is that the luma and the corresponding chroma samples are combined into the same AU. Hence, in this work, the AU size is 384 bits containing the luma and chroma samples of an 8x4 block in the reference frame, as shown in Fig. 5(b) . Moreover, Partial-MB reordering (PMBR) applied in our whole decoder [10] can increase the cache hit ratio. For the MC cache architecture, PMBR is only related to the cache size. In this paper, to make a fair comparison with the other works, we use a non-PMBR configuration of the cache. As a result, by applying the 2-D cache mapping, an average of 60% reduction of external DRAM bandwidth for reference frame read can be achieved, on the bases of the previous VBSMC [12] scheme.
Internal Memory Organization
The proposed internal memory organization is targeted to meeting the high data throughput requirement from interpolation, while not significantly increasing memory area and memory power.
Generally, an MB is decomposed to 4x4 blocks. For each 4x4 block, an area of at most 9x9 pixels is loaded for interpolation. In [3] , one 32-bit (4-sample) width RAM is used, so that least 3 cycles are needed to load 9 pixels. Thus, for each 4x4, 27 cycles are required for data loading. Chen et al. [9] propose an interlaced storage format to buffer the AUs in two 64-bit (8-sample) wide RAMs (hereafter as 8Sx2). As shown in Fig. 6(a) , by using this 8Sx2 internal memory organization, the required 9 pixels of one row can be fetched in one cycle, which enhances the data throughput. However, this is still not enough for the 4Kx2K applications. As described in Sect. 2, there are two ways to increase the interpolation throughput. One is vertical expansion, as shown in Fig. 6(b) . When using the 8Sx2 scheme with vertical expansion, the memory width is increased proportionally with the data throughput requirement. Even though the memory size is the same, the wider memory width will increase memory area and memory power.
A 4Sx4 (interlaced storage in four 4-sample wide RAMs) scheme is designed to maintain the total memory width while expanding the horizontal parallelism. As described in Fig. 6(c) , when the horizontal neighboring two 4x4 blocks have the same MV (or in one partition), at most 13 pixels for each row are required for interpolation. Four 4-sample wide RAMs with interlaced storage format are applied to ensure generating the 13 samples in one cycle, while maintaining the total memory width to be 16 samples. Based on this 4Sx4 scheme and the interpolator described in Sect. 2 which processes two rows of luma and chroma samples at same time, the width of each RAM should contain luma and chroma samples of a 4x2 block. The proposed inter- nal memory organization is shown in Fig. 5(c) : every AU is divided into four sub-blocks, each of which contains 4x2 luma samples and the corresponding 2x1x2 chroma samples (4:2:0 sampling). These 4 sub-blocks are stored into the 4 different RAMs, while the storing sequence is determined by the lowest bit of parX, for ensuring the neighboring pixels in same two lines are not stored in the same RAM. As a result, each AU can be written to data RAM in one cycle and 32 pixels in 2 lines from different AUs can be read in one cycle.
Split Task Queue Architecture
In order to tolerate longer memory system latency in the 4Kx2K decoder, Split Task Queue (STQ) architecture is proposed. Figure 7 shows the previous cache architecture proposed in [9] . Firstly, tasks which describe the location and size of reference block are sent to JUDGE unit which judges miss or hit of AUs inside the reference block according to the TAG RAM. If the needed AUs are not in the data RAM, read requests are sent to DRAM, and then, the fetched AUs are written to data RAM. When all the required data for the task is available in data RAMs and the interpolation unit is ready, the data for this task is output. Because the time from cache sending read requests to receiving the required data from memory system is long, in order to hide the memory system latency, a task queue is applied after JUDGE unit to store the tasks when waiting the data from memory system. When using the task queue, subsequent basic blocks can be continuously processed during the waiting time. However, in this architecture, conflict checking operation must be processed before JUDGE unit sending current task into the queue to avoid flushing the useful data in the cache. Conflict checking is searching the task queue which stores the previous tasks, and detecting whether the required data of current task will flush the data required by previous tasks. If there is no conflict, the current task is sent to the queue and read requests are sent to DRAM when the AUs needed in this task are not in the data RAM. Otherwise, the JUDGE unit stops sending task to the queue and requests to DRAM, until all conflict tasks are output. Based on this design, the length of the task storing queue is decided by the mem-ory system latency and the speed of interpolation. In the 4 Kx2 K decoder, the longer system latency will increase the length of the queue. Consequently, the conflict checking operation which checks all the tasks in the queue, costs larger gate count. Moreover, longer queue brings higher conflict probability, and results in more idle time.
In order to overcome the above problems, we design to separate the task storing queue into two queues. One stores the data unready tasks called DUT queue, and the other buffers the data ready tasks, called DRT queue, as shown in Fig. 8 . In the proposed system, JUDGE unit continuously sending tasks to the following DUT queue, and when the required data of the task is available, this task is sent to RE-CEIVE unit. Then, the RECEIVE unit checks whether the required data of the current data ready task will flush the data required by previous ones stored in DRT queue. If conflict happens, RECEIVE unit stops sending the task to the DRT queue, until all the conflict tasks in DRT queue are output. When the interpolation unit is ready, the task in DRT queue is sent out. Thus, the DRT queue which is utilized for conflict checking can be shorter than the one in previous architecture, since the length of DRT queue is only based on the speed of interpolation. As a result, with the STQ architecture, the influence from longer memory system latency can be reduced, which results in less pipeline stall and lower hardware cost.
System Architecture
The whole motion compensation architecture is illustrated in Fig. 8 . Firstly, tasks which describe the location and size of reference block are sent to the JUDGE unit which judges miss or hit of AUs inside the reference block according to the TAG RAM. Then, JUDGE unit sends task information to the DUT queue and the addresses of missed AUs to the request queue (REQ queue). According to these addresses, DRAM read requests are generated and burst length of requests is checked which is utilized for vertically successive AUs. Together with the DRAM controller employed in our decoder which was described in [13] , burst command can reduce the DRAM latency. When the data is fetched back from DRAM, the RECEIVE unit should detect whether all the missed AUs in the current task are available. So, a dirty code stored in DUT queue is applied for AUs ready detection. For each task, the dirty code represents the situation of required AUs. If the AU is missed, the corresponding bit in dirty code is noted as 0, and otherwise, 1 is noted. Therefore, when the missed AU is received, the corresponding position in the dirty code is noted as 1, and until the dirty code is all 1 which means the required AUs are available, the task is finished. After conflict checking, the finished task is sent to DRT queue. Then, when the interpolation unit is ready, the task in DRT queue is sent out and the required data for interpolation is output. Based on the proposed internal memory ordination, the interpolation unit can receive required data for 8x8/8x4 based two row parallel processing in one cycle.
Implementation Results and Comparison
The proposed architecture is implemented in Verilog HDL on RTL level, and then synthesized with Synopsys DesignCompiler by using SMIC 90 G standard cell library. This design is verified both independently in a test environment with inputs given as software generated data, and in a whole Quad-HD video decoder architecture [10] .
Interpolation Performance
In order to efficiently enhance the interpolation throughput, an HVE-LCP scheme is proposed. Table 1 shows the average processing time of interpolation for different sequences, and this value is only for the inter MBs. Due to different MVs and partition sizes, the interpolation processing time for each MB is different. In our work targeting to 4Kx2K application, considering the biprediction is not allowed for the partition size smaller than 8x8 on high levels, the worst case is 80 cycles/MB. This case happens when the MB is partitioned to 16 4x4 blocks, each 4x4 block requires a 9x9 block from reference frame, and each 9x9 block is unaligned. Hence, the probability of this case is very low. Moreover, since on level 3 or higher, the maximum MV number of two consecutive MBs should be less than 16, the worst case for one MB which is 80 cycles, only happens when the neighboring MB is intra. So, on this case, the average processing time for the two consecutive MBs is 40 cycles. Considering the maximum MV number limits and Bi-prediction mode is forbidden for the partition size smaller than 8x8, the worst case for two consecutive MBs is 130 cycles. Figure 10 shows an example of worst case for two consecutive MBs. Hence, the worstcase on average processing time for each MB is 65 cycles. Figure 9 shows the detail result of the IntoTrees sequence which has the worst performance in Table 1 . As shown in this figure, most of the MBs require 56 cycles, and the average processing is 48 cycles/MB. The speed requirement in our whole pipelined 4Kx2K decoder is 64 cycles/MB, which is described in [10] . The average speed of the proposed interpolation, and even 58 cycles/MB which has the highest occurrence probability, can meet the requirement. 
Cache Memory Features
In order to reduce the internal memory power and area, 4Sx4 scheme is proposed. Based on this internal memory organization, four 96-bit-64-word data RAMs are applied to ensure the interpolation throughput of every cycle two lines with 8 pixels in each line. By using the SMIC register file generator, the memory area of our work is 108800 µm 2 . The other way to realize a similar throughput with our work, is parallel processing four lines with 4 pixels in each line. For this method, based on 8Sx2 scheme, two 384-bit-32-word data RAMs are required. The memory area of this method is 153836 µm 2 , which is much larger than ours. Table 2 shows the power comparison between the proposed 4Sx4 based memory organization and 8Sx2 based one. For our work, the reading power reduction is 5∼9 mW, since the number of reading times and unit reading power are reduced. Because the same cache size is utilized for these two methods, the total writing data size is the same. Hence, the writing power of our work is a little higher due to the larger memory depth. However, since the unit writing power is lower and the writing ratio is much lower than reading ratio, the total writing power increasing is not significant. Finally, the total memory power reduction can be 39∼49%.
Overall Performance
STQ Scheme is proposed to reduce the influence from longer memory system latency, and then decrease the area and cache idle time.
In the previous cache structure for comparison, as shown in Fig. 7 , the queue which is utilized for conflict checking is 60-bit wide and 20-word deep. The area cost of this queue with conflict checking is 20.8 k, when synthesized with Synopsys DesignCompiler by using SMIC 90 G standard cell library.
In the proposed STQ architecture as described in Fig. 8 , the DUT queue is 60-bit wide and 16-word deep, while the DRT queue is 36-bit-wide and 4-word deep. The total area of DUT queue and DRT queue with conflict checking is 15.7 k (10.6 k for DUT queue and 5.1 k for DRT queue with conflict checking). Therefore, the total area can be reduced by 25%. Besides the low area cost, the STQ architecture can significantly reduce the idle time, which contributes to reducing the overall processing time. Table 3 : The decoder system can not support real-time decoding of this sequences because when the QP is 24, the bit-rate of this sequence is larger the specification of entropy decoding. So entropy decoding component becomes the bottleneck of the whole decoder instead of MC. When the QP is larger, the whole decoder can support real-time decoding of this sequence. [15] and [8] , respectively. 2) : SP: single-port SRAM or register file with one R/W port; TP: two-port SRAM or register file with one read port and one write port.
3) : Considering the bi-prediction limits on high levels, the throughput is 288 cycls/MB, if not, it is 384 cycles/MB. 4) : Considering the maximum MV number and bi-prediction limits on high levels, and worst case on per two consecutive MBs is 130. shows that the decoding time reduction is from 24% to 40%, compared with the architecture without STQ. The details of every frame on the IntoTrees sequence are shown in Fig. 11 . For this sequence, compared with the architecture without STQ, the cache idle time is reduced by about 90%. In a whole system, the processing time for P frame is decreased by about 22%, and B frame decoding time reduction is about 46%. Consequently, the average processing time is saved by 39%.
Whole Architecture Performance Comparison
A comparison between this architecture and state-of-the-art works is shown in Table 4 . In our design, the worst-case of interpolation throughput is 65 cycles/MB, when considering the maximum MV number and bi-prediction limits on high levels. Compared with the previous works, the throughput is enhanced to at least over 4 times. At the cost of increased parallelism, the logic gate count is also increased. When synthesized with SMIC 130 nm process with a timing constraint of 200 MHz, the architecture costs a logic gate count of 108.8 k including 37.6 k for cache and 71.2 k for interpolation, which is competitive considering its high performance. Moreover, owing to the 4Sx4 based internal memory organization, the memory area and memory power are optimized. Finally, with the DQT scheme, our design can tolerant longer memory system latency and reduce the decoding time of whole system.
Conclusion
In this paper, three schemes are proposed to achieve an efficient MC architecture for H.264/AVC real-time decoding of Quad-HD application. Firstly, a high-performance interpolator based on Horizontal-Vertical Expansion and LumaChroma Parallelism (HVE-LCP) is proposed to efficiently increase the processing throughput to at least over 4 times as the previous designs. Secondly, an efficient cache memory organization scheme (4Sx4) is adopted to improve the on-chip memory utilization, which contributes to memory area saving of 25% and memory power saving of 39∼49%. Finally, by employing a Split Task Queue (STQ) architecture, the cache system is capable of tolerating much longer latency of the memory system. Consequently, the cache idle time is saved by 90%, which contributes to reducing the overall processing time by 24∼40%. When implemented with SMIC 90 nm process, this design costs a logic gate count and on-chip memory of 108.8 k and 3.1 kB respectively. We also verified this design both independently in a test environment with inputs given as software generated data, and in a whole Quad-HD video decoder architecture [10] .
Program of Waseda University" of the Ministry of Education, Culture, Sports, Science and Technology, Japan, and by the CREST project of Japan Science and Technology Agency.
