Nowadays, people enjoy watching mobile videos more than ever and mobile video streaming contributes to the majority of the total mobile data traffic. However, due to the high power consumption of mobile video decoders, especially the on-chip memories, short battery life represents one of the biggest contributors to user dissatisfaction. Various mobile embedded memory techniques have been investigated to reduce power consumption and prolong battery life. Unfortunately, the existing hardware-level research suffers from high implementation complexity and large overhead. In this paper, by introducing advanced data-mining techniques, we investigate meaningful data patterns hidden in mobile video data and apply the identified patterns to implement a low-power flexible hardware design with dynamic power-quality trade-off. A 45nm 32kb SRAM is presented that enables three levels of power-quality trade-off (up to 43.7% power savings) with negligible area overhead (0.06%).
INTRODUCTION
In recent years, the growing popularity of smart phones and the powerful accessibility to high speed networks have resulted in exponentially increasing demand for video services on mobile devices. According to market research, by 2020, the amount of data that is created, replicated, and consumed, will be as large as 40ZB (Zettabyte, or 10 21 B) [14] ; and more than half of the data traffic will be video data [15] . However, as people enjoy watching videos anytime and anywhere, video processing has become the most important energy-intensive application used in mobile devices [16] . In particular, the intensive computations of processing these video streams need to frequently access on-chip memory, which contributes to over 30% of system power consumption and occupies more than 65% of video decoder area [17, 18] . This situation is only expected to grow with the emerging high quality mobile video formats such as 8K Ultra HD applications due to the increased complexity per pixel and a huge requirement for on-chip memory as pipeline buffers [19] .
Various low power SRAM designs have been developed for mobile video applications. In [10] and [24] , two hybrid SRAM structures with 6T/8T and 8T/10T are developed to optimize the power efficiency for mobile video streaming. In [16] , a heterogeneous sizing scheme was presented to reduce the failure probability of conventional 6T bitcells. In [11] , a two-port SRAM with majority logic and data reordering is presented to minimize power consumption. However, all of those existing techniques suffer from large silicon area overhead. Also, the power-quality tradeoff is set during design time, which can no longer automatically guarantee maximum power efficiency for different video applications. Recently, Frustaci et al. [12] presents a voltage-scaled SRAM that can dynamically manage the trade-off between power and video quality, but the utilized write assist technique and Error Correcting Code (ECC) encoder and decoder circuits result in large penalties in computation complexity and silicon area.
In this paper, we propose a low-cost Data-Driven power efficient Adaptable SRAM Hardware (D-DASH) design with dynamic power-quality tradeoff for mobile video applications. By introducing advanced data mining techniques, we investigate meaningful data patterns hidden in video data and incorporate these key findings into our hardware design. D-DASH enables three levels of power-quality management (up to 43.7% power savings) with negligible area overhead (0.06%). This paper is organized as follows. In Section 2, data-mining enabled mobile video data patterns are analyzed. In Section 3, we present D-DASH. The evaluation results are provided in Section 4. Finally, the conclusion is drawn in Section 5.
Mobile Video Data Pattern Analysis 2.1 Mobile Video Data Characteristics
Mobile video application characteristics imply that it is possible to incorporate application-level video data behaviors to the hardwarelevel design process. Although mobile videos are delivered over different networks and are visualized in various mobile terminals, there are three common characteristics that may potentially contribute to hardware-level memory design [20] : (1) inputs: the video data is noisy and redundant; (2) outputs: the videos on mobile devices are generated for humans and minor variations cannot be discerned by humans' eyes; and (3) computation patterns: statistical computations during the video decoding process potentially result in specific data patterns, which can contribute to low-power hardware design. However, it is difficult for hardware designers to observe the inherent video data behaviors directly from the large volume of video data. To achieve this, we introduce datamining techniques to comprehensively explore mobile video storage data characteristics. In particular, we use an association rule mining technique to explore the relationship between different video data bits and obtain data patterns for efficient hardware design.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. ISLPED '16, August 08-10, 2016, San Francisco Airport, CA, USA
Data Mining Assisted Video Analysis
Today's mobile video frames are typically stored and processed in YUV format. It includes one luma (Y) component, which contains the brightness information of the image and two chroma components which contain the blue-difference (Cb) and reddifference (Cr) color information. Fig. 1 shows a typical frame of video data stored in embedded memory using a 352 × 288 resolution YUV 4:2:0 video as an example. As shown, each pixel has 8-bit luma data and 8-bit subsampled chroma data. Since video data is stored in on-chip memory as binary bits, we utilize an association data mining technique to identify the bit-level data patterns.
Association rule mining was introduced in 1993 to discover relationships between different variables, called items, in a dataset or database [9] . A complete dataset is made up of many transactions where each transaction contains a set of items. Each item can be associated with a binary attribute, 0 or 1, that is used to distinguish that item is present or not in its corresponding transaction. This type of data organization is illustrated in Fig. 1 . Each resulting rule, generated from the association rule mining process, is an implication of the form X → Y, where X and Y are disjoint sets of, or individual, items. Each rule is also accompanied by collected statistics from the dataset called support and confidence values. The support value for a set of items is the proportion of transactions in the dataset that contains such set of items. The confidence value for an association rule, X → Y, is the proportion of transactions that contain X which also contain Y, or the conditional probability P(Y | X).
To enable association data mining, we use different video benchmarks with various characteristics (e.g. motion, scene complexity) to build a dataset, including 12 videos from [3] and 4 videos from [4] . In total, the video data size is 415,600,000 bytes. 12 benchmark videos from [3] were combined into a single YUV file. This combined video file contains a total of 3470 frames with 352 × 288 resolution. Each chroma bit is defined as an individual item and a sample of the .arff format with descriptions of its different parts can be seen in Fig.1 . We used Weka [2] to perform the well-known association rule mining algorithm -Apriori on our large video dataset. To evaluate the impact of compression on the identified data patterns, the two formats including the non-sampled YUV 4:4:4 format and the subsampled YUV 4:2:0 format are investigated using data mining techniques. They both contain eight bits of luma (Y) data and eight bits of each chroma component (Cb and Cr) that are shared among 4 pixels.
One interesting data pattern we obtained from the results is the strong association of chroma bits to Cr's most significant bit (MSB), Cr1. As shown in Fig. 2 , if the value of Cr1 is equal to 1 (0) the remaining Cr bits have a larger probability to be 0 (1). Other than Cb1, the majority of Cb bits have a larger probability to be 0 (1) when Cr1 equals 0 (1). Based on such identified data patterns, a flexible D-DASH with power-quality adaption is implemented which will be discussed in the following section. 1,1,1,1,0,0,1,0,0,0,0,0,0,1  0,1,1,1,1,1,1,0,1,0,0,0,0,0,0 
Video Quality Metrics
The well-known peak signal-to-noise ratio (PSNR) metric is applied widely to evaluate video quality [10, 12, 23] , which is defined as [21] = 10 log 10 ( 255 2 ) (1) where MSE is the mean square error between the original videos (Org) and the degraded videos (Deg), expressed as
However, recent research shows that the PSNR cannot describe the true human perception of videos since it only takes the amount of errors into account, not necessarily the effect that errors have on the user's perception of the image being displayed [1] . Accordingly, the structural similarity (SSIM) metric is developed to predict the perceived image quality and it combines separate calculations for luminance, contrast, and structure changes all together, as expressed in (3) In our analysis, we use both PSNR and SSIM to evaluate the video output quality. The quality reduction (PSNR(SSIM)% Reduction) can be calculated using (4):
)) × 100% (4)
PROPOSED D-DASH
This section presents the identified data patterns enabled D-DASH. D-DASH is highly flexible with three design schemes (D-DASH-I to D-DASH-III), providing a run-time dynamic power-quality trade-off for mobile video streaming.
D-DASH-I
D-DASH-I enables zero-cost power-efficient storage, using dataaware low-power readout buffer connections based on the obtained data patterns in mobile video data, as shown in Fig. 3 . In conventional SRAM, a readout buffer consisting of two NMOS transistors is used to access the stored value by connecting it to reversed storage node (QB) [22] , as shown in Fig. 3 (a) . During the reading process, the read bit-line (RBL) is precharged to supply voltage (Vdd) before RWL is asserted. In the case a 1 is stored in the SRAM cell and RWL is asserted, RBL will stay at Vdd as the bottom NMOS is turned off and there is no switching activity in RBL, enabling low-power reading process. In the case that a 0 is stored in the SRAM cell, the RBL will be discharged to ground (GND), resulting in large power consumption. The discharging activity during readout process contributes significant power consumption in mobile video memory [10] [11] [12] 24] . In this paper, data-aware low-power readout buffer connections are proposed. The traditional connection as shown in Fig. 3 (a) are referred to as type-1 bitcell. Alternatively, a type-0 bitcell is presented to achieve low-power reading 0 process, by connecting readout buffers to Q, as shown in Fig. 3 (b) . Note that, as compared to the conventional SRAM bitcell (type-1), the type-0 bitcell does not cause any silicon area overhead.
The obtained data patterns in Section 2.2 show that, in the chroma data for each pixel, there is over 70% probability that the Cr and Cb data will be 1000000001111111 in binary. Based on this, type-1 and type-0 bitcells are applied to store/load 1 and 0, respectively. Accordingly, for the majority of chroma data, the switching activity during the readout process is significantly reduced, achieving power savings without area overhead. Fig. 3 (c) shows the structure of one wordline in D-DASH-I. Based on the identified data pattern, it uses an optional combination of type-0 and type-1 bitcells to enable zero-overhead power efficient mobile video on-chip storage.
D-DASH-II
Based on D-DASH-I, we implement D-DASH-II for additional power savings by considering the rest of the data with the pattern of 1000000001111111. Since the most significant bit (MSB) of the Cr data determines the value of lower order bits (LSBs) as shown in the identified patterns, we implement a write circuit and a read circuit to the SRAM to maximize the read bitline power saving, as shown in Fig. 4 . In the write circuit, we first detect the MSB of the Cr data and then determine whether to invert the input data or not, and use a similar flag-bit scheme as [22] to indicate if the data is flipped or not: 1 means the data is not inversed and 0 means data is inversed. The scheme implemented in [22] uses an additional bitcell to store the flag bit, causing 7% area overhead. To minimize the overhead, we utilize the least significant bit (LSB) of the chroma data to store the flag bit. Our results in Table 1 show that using the Cr LSB induces negligible video quality degradation (0.044% PSNR reduction and 0.058% SSIM reduction) and therefore we use the Cr LSB to store the flag bit, as shown in Fig.  4 . Accordingly, D-DASH-II enables more power savings as compared to the implementation cost and video quality degradation of D-DASH-I.
D-DASH-III
To meet the high power efficiency requirement of some video applications, we further design D-DASH-III to maximize the power savings based on bit-truncation (dropping) technique. Bit- truncation technique has been widely used in mobile video memory area [13] . To determine the number of bits for truncation, we evaluate the video quality with the bit-truncation technique and the results are listed in Table 1 . As the number of truncated bits is larger than 7, PSNR and SSIM reduction are significant (PSNR reduction ≥ 24.395% and SSIM reduction ≥ 2.512%), indicating large video quality degradation. Accordingly, the number of bits for truncation can be 5 or 7. We further evaluate the video output quality using the sign_irene video benchmark as shown in Fig. 5 . The video sign_irene contains blue and red colors that would be directly affected by the corruption of chroma data [5] . It shows that, truncating 5 
LSB bits (2 LSBs of Cr and 3 LSBs of Cb) is an optimized trade-off between video quality and power saving.
To enable the bit-truncation technique, we implement the control circuit as shown in Fig. 4 . The control bit SW is connected to Write Enable (WE) and Read Enable (RE) of a wordline. When SW is 0, all of the 32 bitcells connected to the wordline work as traditional bitcells; when SW is 1, D-DASH-III is enabled and the outputs (WE_out and RE_out) will disable the WE and RE of the truncated bitcells (yellow bitcells shown in Fig. 4) and therefore, bit truncation is enabled to achieve additional power savings. 
SIMULATION RESULTS
To evaluate the effectiveness of the proposed technique, a 32kb SRAM is implemented using a high-performance 45-nm FreePDK CMOS process to meet the multi-megahertz performance requirement of today's mobile video decoders.
Performance
We first evaluate the performance of the proposed D-DASH. Table  2 lists the detailed performance parameters. Both write delay and read delay are approximately 0.15 ns, which successfully delivers high-quality video format such as 8K Ultra HD applications [19] .
Layout
We also evaluate the area overhead of the proposed design. Fig. 6 shows the layout of D-DASH. For scheme I, there is no area overhead; for scheme II, the area overhead is 0.64%; for scheme III, after careful layout design, we integrate the control circuit with read decoder, without additional area overhead.
Output quality
We use different videos to verify the output quality based on the proposed SRAM scheme. Fig. 7 shows three video outputs as examples. D-DASH-I and D-DASH-II can deliver good video output quality and D-DASH-III results in negligible video degradation to achieve optimal power efficiency. ...
Cb[1]
Cb [6] . ...
Cr[5]
Cr [1] . ...
Cb[6]
Cb [1] . 
Power savings
To evaluate the power efficiency of D-DASH, we model the read bitline (RBL) power consumption of mobile video memory as:
where Pr is the power consumption on read operation; k is the bit number; t is the SRAM type; i is the value stored in SRAM, F(i) indicates the probabilities of a bit to be 0 and 1, which is shown in Table 4 ; Z(i) indicates if the bit will be truncated (if truncated, Z(i) will be 0, if not truncated, Z(i) will be 1). Table 3 lists the read power consumption for two types of D-DASH bitcells. We also extract the probability of each bit being 0 or 1 from the 12 benchmarks and the results are listed in Table 4 . The bits marked in grey are truncated bits in D-DASH-III. Table 5 concludes the power saving of our proposed technique over a standard SRAM design. D-DASH enables power savings from 7.82% (D-DASH-I) to 43.07% (D-DASH-III). Table 6 compares the D-DASH's performance with the state-ofthe-art. D-DASH exhibits the lowest implementation cost (0.06%) with dynamic power-quality tradeoff. D-DASH-I and D-DASH-II exhibit the best video quality, except for reference [11] , which, however, is realized with large area overhead (~14%). D-DASH-III demonstrates highest power efficiency, except for [24] , but the hybrid 8T/10T structure in [24] requires bitcell array modification, resulting in as high as 52% silicon area overhead. 
Comparison with prior work

CONCLUSION
This paper presented a data-driven flexible mobile video SRAM with negligible silicon area overhead (0.06%) that can be adjusted between three different levels of power-quality tradeoff. Based on the data patterns obtained by data-mining techniques, a low-power data-aware readout buffer connection technique is applied to enhance the power efficiency without implementation overhead; based on this, a bit-flipping technique is used to further improve the power efficiency by introducing 0.06% area overhead; finally, a bittruncation technique is utilized to maximize the power efficiency with negligible video quality degradation. The developed D-DASH provides three flexible low-cost schemes, each with different power-quality tradeoff, to meet the requirements of various applications.
ACKNOWLEDGMENTS
This work was supported in part by NSF under grant CCF-1514780, and the Beijing Municipal Natural Science Foundation under Grant 4152004. 
