Abstract-Approximate computing is an emerging design paradigm that leverages the inherent error tolerance present in many applications to improve their power consumption and performance. Due to the forgiving nature of these error-resilient applications, precise input data is not always necessary for them to produce outputs of acceptable quality. This makes the memory subsystem (i.e., the place where data is stored), a suitable component for introducing approximations in return for substantial energy savings. Towards this end, this paper proposes a systematic methodology for constructing a quality configurable approximate DRAM system. Our design is based upon an extensive experimental characterization of memory errors as a function of the DRAM refresh-rate. Leveraging the insights gathered from this characterization, we propose four novel strategies for partitioning the DRAM in a system into a number of quality bins based on the frequency, location, and nature of bit errors in each of the physical pages, while also taking into account the property of variable retention time exhibited by DRAM cells. During data allocation, critical data is placed in the highest quality bin (that contains only accurate pages) and approximate data is allocated to bins sorted in descending order of quality, with the refresh rate serving as the quality control knob. We validate our proposed scheme on several error-resilient applications implemented using an Altera Stratix IV GX FPGA based Terasic TR4-230 development board containing a 1GB DDR3 DRAM module. Experimental results demonstrate a significant improvement in the energy-quality trade-off compared to previous work and show a reduction in DRAM refresh power of up to 73 percent on average with minimal loss in output quality.
Ç

INTRODUCTION
M ANY applications in the domains of machine learning, multimedia, recognition, computer vision, graphics, etc., exhibit the property of intrinsic error resilience, which is the ability to produce outputs of acceptable quality even when some of their underlying computations are performed in an approximate or inexact manner. Approximate computing leverages this intrinsic error resilience of applications to improve the energy consumption and performance of computing systems that execute them. The error resilient nature of these applications endows them with the ability to produce good-enough results even if their inputs are noisy or slightly erroneous. This gives rise to the prospect of designing approximate memories, where the strict constraints on data integrity can be relaxed (in a controlled manner) in exchange for large savings in energy consumption.
Although several different memory technologies exist today, DRAM is still the primary choice for main memory in most embedded systems due to its high density, longevity, and low cost. However, data stored in DRAM is not persistent and the DRAM needs to be periodically refreshed to restore the charge that leaks away over time. As a result of these periodic refresh operations, DRAM consumes power during both active and standby modes, which results in a significant energy overhead. Prior work has shown that DRAM can consume up to 30 percent [1] , [2] , [3] of overall system power with refresh operations being responsible for 10-50 percent [4] of the total DRAM power. In addition, DRAM refresh is also the source of increased memory access latency and lower throughput [4] , [5] . An effective way of reducing DRAM power consumption is to increase the refresh interval to a value higher than the standard 64 ms used by most DRAM integrated circuits (ICs) today. As we will show, doing so provides significant power savings while introducing a small number of bit errors that can be endured by most error-resilient applications.
This paper introduces a new approach for the construction of an approximate DRAM system. We propose the notion of a quality-aware approximate DRAM and develop a novel data allocation scheme for the proposed approximate DRAM. The core idea is that at sub-optimal refresh rates, DRAM physical pages can be split into a number of quality bins based on the characteristics of the errors seen in each page. Approximate data can then be allocated to pages belonging to the bins in decreasing order of quality, ensuring that we always allocate to the least erroneous pages. The location and nature of the bit-errors are obtained through extensive error characterization of off-the-shelf DRAM ICs at various refresh rates. These errors correlate well with the eventual application-level output quality and hence, are used to guide the allocation of application data to DRAM pages based on the output quality specification. Our proposed mechanism is inherently quality configurable since it has the provision of increasing the refresh rate as needed, which increases the number of pages in higher quality bins (with lower errors), leading to better quality. Compared to prior approaches, our work requires only a single refresh interval for the entire DRAM. Therefore, it is simpler to implement and results in a better energy-quality trade-off compared to prior work. Specifically, this paper makes the following contributions:
We perform an extensive characterization of DRAM errors at various sub-optimal refresh rates using offthe-shelf DRAM ICs. We use the characterization results to derive vital insights about bit errors in DRAM-for example, words containing 1 to 0 bitflips and 0 to 1 bit-flips are mutually exclusive. We also leverage the characterization results to analyze and subsequently alleviate the adverse impact of variable retention time [6] , [7] , [8] on the overall allocation scheme. We use the insights obtained from the characterization to guide the construction of a systematic data allocation scheme for storing critical and approximate data in the approximate DRAM. In particular, we propose four novel strategies for constructing quality bins taking into account the frequency, significance, and nature of the bit-errors in DRAM pages. Subsequently, approximate data is allocated to these quality bins according to the specified output quality bound, while critical data is always allocated to fully accurate pages. Thus, these quality bins serve as key components in our quality configurable approximate DRAM. We implement and validate our data allocation strategy using a custom memory management unit deployed within the lightweight operating system, mC/OS-II, running on an Altera Stratix IV GX FPGA-based Terasic TR4-230 development board containing a 1 GB DDR3 DRAM module. We conduct experiments of our data allocation strategy on eight different error-resilient applications taken from the domains of recognition, mining, and synthesis. Our experimental results clearly demonstrate a significant improvement in the energy-quality trade-off, compared to existing approximate DRAM approaches, and show average power savings of up to 73 percent with minimal loss in application quality.
RELATED WORK
Approximate computing has been explored at different layers of abstraction spanning circuits [9] , [10] , architectures [11] , and algorithms [12] . An effective way of exploiting the intrinsic error resilience of applications is by introducing errors in the storage subsystem. In this section, we concentrate on prior work related to approximate memory particularly approximate DRAM since that is the focus of our work.
One of the earliest works in the field of approximate memory is Flikker [13] , which deals with a low power mobile DRAM. Flikker first partitions an application into critical and non-critical parts and uses a sub-optimal refresh rate to inject errors in the non-critical portion of the application in exchange for refresh power savings. As a result, Flikker requires the existence of two refresh controls for partitioning DRAM into an accurate section and an approximate section (available in LPDDR DRAM). Sparkk [14] is a hardwarebased method of approximating a DRAM where the most significant bits of operands are refreshed at a higher refresh rate than the least significant ones. However, since this requires hardware changes it is very difficult to realize in practice. Refs. [15] , [16] propose simulation models for analyzing the impact of DRAM errors on error-resilient applications. Enerj [17] devises a method by which programmers can annotate error-tolerant portions of an application and hence can be put into approximate SRAM and DRAM. This is an essential requirement for approximate memories as they need to have a notion of non-critical data whose precision can be relaxed. Ref. [18] proposes the design of an experimental platform for evaluating different approaches that aim to reduce DRAM power consumption by controlling the refresh-rate. In this case, the experiments were performed using a TI MSP430F2618 microcontroller along with a small capacity 32KB DRAM chip, and hence the given observations and analysis have limited scope and applicability. Our work builds upon the concept of critical and non-critical data partitioning described in Refs. [13] , [17] and renders a much improved energy-quality trade-off by employing a qualityaware data allocation scheme. Compared to other approaches, our proposed methodology requires only software modifications and minor changes to DRAM controller and, hence, can be applied to COTS devices.
Note that a preliminary version of this paper appeared in Ref. [19] . Compared to that work, this version includes a mechanism for addressing the challenges posed by the phenomenon of variable retention time in the context of an approximate DRAM. It also provides a comprehensive account of a method for enabling quality configurability in an approximate DRAM, which is necessary to obtain optimal energy savings for a specified application-level quality degradation bound. This paper also shows the variation in the number of DRAM bit-errors across different temperature ranges. Finally, the allocation strategies are implemented using a real lightweight operating system mC/OS-II running on Altera Stratix IV FPGA.
BACKGROUND AND MOTIVATION
This section provides a brief overview of the basic DRAM operation and motivates the importance of refresh-rate reduction in the DRAM.
DRAM Refresh
The fundamental building block of the 2D-DRAM array is a DRAM bitcell, which is made up of a single capacitor and an access transistor as shown in Fig. 1 . A binary data value of 1 or 0 is identified depending on whether the DRAM cell capacitor is in a fully charged or discharged state. DRAM capacitor loses charge over time due to various factors predominantly related to the non-ideality of the access transistor like sub-threshold leakage and gateinduced drain leakage. This charge leakage will eventually result in the loss of data after a certain time interval (retention time) and hence, the charge needs to be replenished (refreshed) periodically. A DRAM row is usually refreshed by simply activating it using the appropriate wordline. Commercial DRAM modules have a worst-case retention time of 64 ms, which is determined by the leakiest cell in the entire DRAM array. The memory controller maintains a single refresh rate for simplicity and refreshes each row every 64 ms (refresh interval) to guarantee data integrity. Conventionally, there are two methods for refreshing the DRAM module: (i) Auto Refresh, and (ii) Self Refresh. In auto refresh, the external memory controller is responsible for issuing refresh commands regularly to the DRAM. On the other hand, the self refresh operation takes place within each DRAM module independent of the memory controller when the system is in standby mode. Self refresh is an extremely efficient and lowpower refresh mechanism since it does not involve the external clock; however, it incurs a considerable overhead for switching in and out of the refresh mode.
Motivation for Refresh Rate Reduction
Although refresh operations are imperative from the point of correctness, they also have an adverse impact on the overall energy efficiency and performance. Energy efficiency declines due to the expensive periodic activation of individual rows as well as the increase in energy consumption due to a longer execution time. The latter component is a by-product of the performance penalty that arises due to the auto refresh command. During active mode, a refresh command is normally issued to an entire rank and all the banks within that rank are unable to service any memory requests until the completion of the command. This not only increases the memory access latency, but also causes row misses due to closing previously activated rows, thereby decreasing the total memory throughput. Consequently, reducing the refresh-rate improves both the overall memory power consumption as well as the overall system performance. Existing literature also highlights the importance of refresh-rate reduction in DRAM; e.g., [4] predicts that due to the ever increasing capacity of DRAM, these problems will aggravate further with the refresh power becoming the most dominant power component, and throughput losses reaching nearly 50 percent of the useful time. Another work [5] reveals that due to the increasing DRAM density and device variability, coupled with high-temperature operation and row buffer size reduction, the refresh duty cycle 1 will continue to increase resulting in longer periods of memory unavailability.
Motivation for Quality-Aware Allocation
A direct consequence of refreshing DRAM at intervals larger than the standard 64 ms is the occurrence of bit errors throughout the memory module. Fig. 2 shows the error map of a randomly selected 64 MB contiguous chunk of DRAM (representative of the entire DRAM) when refreshed at various intervals, where the colormap denotes the total number of bit errors occurring in each physical page. Since we assume the page size to be 1 KB, so altogether there are 65,536 rows as shown in Fig. 2 . The error map reveals that bit flips occur at random locations in arbitrary pages resulting in unpredictable errors and equally unpredictable degradation in application quality at the output. Hence, random allocation of data to the DRAM when it is refreshed at higher intervals, can lead to significantly degraded application-level quality at the output and can even cause complete application failure. This necessitates the construction of a systematic method that can track the errors in different pages and allocate the application data according to their criticality. This serves as the primary motivation for constructing a quality-aware memory module where we split the entire physical memory into a number of quality bins on the basis of error characteristics of pages. The quality bins can be used to efficiently allocate pages of different applications that can tolerate different amounts of error to generate acceptable output quality. For critical data allocation, the accurate pages (denoted by blue) can be coalesced logically into a larger contiguous chunk of completely accurate memory by keeping track of page addresses devoid of any errors. Similar records of page addresses are also kept for other approximate pages by noting the frequency, significance, and nature of the bit flips in each of them.
CHARACTERIZING DRAM ERRORS
We generate exhaustive retention time and error profiles of each DRAM module at different granularities to gain useful insights about the frequency, significance, and nature of the bit-flips when the refresh interval is modulated across a range of sub-optimal values within the range of 1s to 100s. Lower refresh intervals are ignored as the results can be trivially extrapolated. It is important to note here that the refresh interval is modulated at a coarse granularity to minimize the effect of variable retention time [7] . The characterization was performed using a Terasic TR4-230 development board [20] based on the Altera Stratix IV GX FPGA consisting of a Hynix 1 GB SODIMM DDR3 DRAM module at 30 C. In order to verify that our observations hold true universally, we performed each experiment with eight different DRAM modules (selected from five different DRAM manufacturers). These modules belong to 2Â Hynix, Kingston, Micron, Elpida, and 3Â Samsung. The characterization results derived from them exhibited significant variations both within and across vendors. However, due to lack of space we present the results for only two DRAM chipsone from Hynix and one from Kingston (shown later in Section 6). Note that for all our experiments we use the characterization results derived from one of the Hynix modules since it proved to be the most error-resilient. We programmed a DDR3 DRAM memory controller on the FPGA and it is used in conjunction with the soft-core Nios II processor [21] to operate the DRAM module. In addition, we also created a custom slave running on the processor which can instruct the memory controller to start and stop the auto refresh and self refresh operations. For proof of concept, we only used self refresh as the method for refreshing the DRAM modules in all our experiments.
DRAM characterization was performed sequentially as described below:
1) First we write data containing all 1s, all 0s or other data patterns (in separate experiments) in the entire DRAM and subsequently start self refreshing the DRAM at 64 ms so that 100 percent data is retained. 2) Next, the custom slave disables the self refresh, puts the DRAM into the lowest power mode, 2 and waits for the selected refresh interval (t r ) determined by a precise timer controlled directly by the FPGA hardware, before it restarts the refresh again. This has exactly the same result as issuing a refresh command after every t r time interval.
3) This process is repeated a number of times before reading the values back from each memory location.
The acquired data is then compared to the original one to check for any corruption. 4) We repeat this experiment for multiple refresh intervals and evaluate the retention time, t r , of a DRAM cell to be the maximum refresh interval for which it can retain the stored data. We characterize the entire DRAM to evaluate the total number of bytes, words, and pages that are erroneous at each refresh interval. Fig. 3 shows that the error rate, i.e., the fraction of erroneous bytes, words, and pages increases with the refresh period. Note that cells leak to either 0 or 1 depending on whether it is a true cell or an anti-cell [6] . Without any loss of generality, each page is considered to be a contiguous block of 1 KB. The plot appearing on the left side of Fig. 3a depicts that the number of 1 to 0 flips is almost equal to the 0 to 1 flips. However, the magnified version on the right side reveals that 1 to 0 flips is slightly higher than the 0 to 1 flips. Fig. 3b presents the overall error rate at different refresh intervals by combining both 0 to 1 and 1 to 0 bit flips. As expected, the page error rate is larger compared to the word error rate which itself is more than the byte error rate. This is because although the number of errors remain constant, the total number of pages is less than the total number of words which in turn is less than the total number of bytes in the entire DRAM. It also shows that even if the refresh interval is increased by almost 1000x (t r = 60 s), the total number of erroneous pages is only 20 percent, which decreases further down to less than 0.2 percent for bytes and words. This proves the immense potential for refresh-rate reduction in the DRAM module.
We also plotted an error distribution chart showing the locations of the bit and byte errors in each word as well as the locations of word errors in each page as illustrated in Fig. 4 . The plot shows that errors occur randomly without any preference to a particular location or range of locations. This leads us to the conclusion that bit flips occur due to random variations in the DRAM and have no relation to its internal hardware implementation. Note that the DRAM characterization step also plays an important role in conceiving the four strategies used in the construction of quality bins described in Section 5.2. 
DESIGN OF APPROXIMATE DRAM
This section explains the proposed technique for constructing a quality configurable DRAM and compares it to the existing methods.
High-Level Overview
Retention aware data placement is an effective technique to reduce refresh power consumption in DRAM as demonstrated in some of the earlier works such as RAPID [22] and RAIDR [4] . However by design, these techniques are oblivious to the inherent error-resilient nature of certain applications and therefore, cannot maximize the energy and performance improvements possible for these approximate applications. Flikker [13] is one of the earliest works that exploits errors caused by sub-optimal refresh rates in mobile DRAM to reduce memory power consumption. Our work builds upon the aforementioned strategies by proposing a systematic design methodology that can further improve the energy-quality trade-off associated with the construction of an approximate DRAM for various errorresilient applications. Our work differs from existing works such as Flikker and RAPID in the following ways:
Unlike Flikker, our work does not make any assumptions about the type of DRAM module. Flikker is only applicable to LPDDR-type DRAM modules containing the Partial Array Self Refresh (PASR) feature, where the DRAM is partitioned into two separate sections, critical and non-critical, that are refreshed at different rates. Flikker sets the refresh rate at the optimal value of 64 ms for the critical portion and refreshes the noncritical part at a sub-optimal refresh interval of 1 s. Compared to Flikker, our work utilizes only a single refresh control that makes the design methodology generic and makes the overall hardware as well as the external memory controller simpler. Another point of difference is that Flikker does not take into account individual physical page characteristics during data allocation, whereas we allocate data to physical pages sorted in decreasing order of quality at the selected refresh-rate. In our case, the refresh rate is initially fixed only on the basis of the total amount of critical data and can be further fine-tuned based on the quality requirement of applications. On the contrary, RAPID sets the refresh rate according to the current DRAM utilization as shown in Fig. 5 and makes no distinction between critical and non-critical data. The efficacy of RAPID is inversely related to DRAM utilization and is totally ineffective when a high percentage of the memory is in use, a common case in smart-phones and tablets of today. Our work does not suffer from such drawbacks and is applicable to systems of today. Other mechanisms such as RAIDR have multiple refreshes rates depending on the retention capacities of rows, and hence require complex hardware and software support. In a nutshell, RAPID and RAIDR are basically retention-aware mechanisms where the physical pages are sorted according to the worst case retention time of a bit in each page, whereas our work is a quality-aware mechanism where the physical pages are sorted according to the error characteristics of each individual page such as the total number of bit flips, nature of bit flips, and the locations of bit flips in each page.
Quality Bins
Fig . 5 represents the basic idea behind our proposed design technique and compares it to the other two techniques. In case of Flikker, approximate data is allocated to the DRAM physical pages in a sequential order where each page contains a random number of errors. In our scheme, we first characterize the entire DRAM for a range of refresh intervals (t r ) using the method described in Section 4. Subsequently for each t r , we sort all the DRAM pages in ascending order of the number of errors per page. Data is then allocated to each page in the sorted order, i.e., data is first assigned to the least erroneous pages and later to pages with higher errors. This ensures that for any selected t r , the stored data always incurs the least amount of errors. In the proposed scheme, the refresh interval is first fixed to the maximum possible value, t c , that can ensure that the critical data of each application can be allocated to all fully accurate pages. Usually, all physical pages whose retention times are greater than or equal to t c are completely accurate. Fig. 3 shows that even a considerable high value of t c can produce a large number of accurate pages sufficient for all the applications. Next, approximate data is allocated to either . Proposed allocation scheme and its comparison to existing works RAPID [22] and Flikker [13] . Here, (a) represents the baseline DRAM module, (b) shows the same module when data is allocated in pages sorted in decreasing order of their retention times, (c) shows RAPID which uses the retention-aware scheme as the basis, (d) represents Flikker, and (e) represents the current work where the refresh interval is tuned based on critical data and specified application quality.
accurate (if free) or erroneous pages sorted according to the number of errors. In this case, the critical data size provides a hard upper bound to the value of refresh interval, which can be further fine tuned on the basis of the output quality specification. In general, reducing the refresh interval increases the number of accurate pages, which leads to better output quality. It is important to note that our work leverages the concepts described in [13] , [17] to split an application into critical data and non-critical (or approximate) data. We explain the partitioning procedure in detail while describing the test applications in Section 8. Note that when the entire set of loaded applications is critical (requiring all allocated data to be 100 percent accurate), our strategy will result in the same energy savings as that of RAPID. However, the energy savings will still be greater (and in a very few cases equal) to that of Flikker, since Flikker can impose sub-optimal refresh rates to only select fixed capacity memory partitions.
Upon characterization, all DRAM physical pages are sorted on the basis of different parameters such as the frequency, significance, and nature of the bit-flips, and subsequently, the sorted pages are allotted to different error bins called quality bins. These quality bins are sorted according to decreasing quality (or increasing number of errors) using various strategies. Subsequently during allocation, data is placed in pages belonging to these quality bins in decreasing order of quality. The first quality bin (qbin0) contains only error-free pages, and our selected refresh interval ensures that qbin0 can accommodate the entire size of critical data. Approximate data is then put in qbin0 (if free), followed by qbin1, qbin2, and so on. We propose four strategies for allocating the pages into different quality bins as shown in Fig. 6 . The primary objective of these strategies is to determine how the bit errors correlate to the notion of output quality.
Sorting Strategies
1) Strategy 1:
In the first strategy (Fig. 6a) , the pages are sorted according to the total number of word errors present in each page. Word errors arise if a word has at least a single bit flip. Larger number of word errors denote higher degradation in quality and hence, those pages are assigned to lower quality bins. Note that we could have chosen bytes/halfwords instead of words as the error granularity. This decision depends entirely on the nature of the data to be stored and can be trivially extended for either of those cases.
2) Strategy 2: The second strategy (Fig. 6b) is more finegrained and sorts the pages on the basis of total number of bit errors in each page. Note that for Strategies 1 and 2, errors include both 0 to 1 and 1 to 0 bit flips. Allocation to the quality bins are done according to the total number of errors. For example, qbin0 consists of pages with no word or bit errors and hence has the highest quality. qbin1 has a single word or bit error and so on. If we denote a quality bin as qbinN then a higher value of N denotes a higher degradation in application quality since it will consist of a larger number of word or bit errors. 3) Strategy 3: This strategy (Fig. 6c) is even more finegrained and refined than the previous two as it incorporates the significance of bit error positions in each erroneous word by assigning weights based on the bit error locations within the word. Higher weights correspond to errors at most significant bits (MSB) and lower weights correspond to the least significant ones (LSB). The weights are added for all the words in a page and the pages are sorted in different quality bins depending on this sum. For example, in our case we assigned weights of 1, 4, 16, and 64 when errors exist at the first, second, third and fourth byte of each word. The weighted sum acts as the quality metric for each page and is called Bit/ Byte Weighted Error or BW m . A higher value of BW m represents a higher numerical change of the stored data and hence those pages are placed in lower quality bins. 4) Strategy 4: Our last strategy (Fig. 6d) gives preference to pages with 0 to 1 flips over 1 to 0 flips or vice versa depending on the nature of the test dataset. Within each type, the pages are sorted by Strategy 2. One sample use case of Strategy 4 is when the data is sparse and contains a large number of zeros (or very small values). In this case Strategy 4 allots pages with 1 to 0 bit flips to higher quality bins since they are likely to incur less error compared to pages with 0 to 1 errors. Fig. 7 depicts the total number of pages in each quality bin when we characterize the DRAM module using all 4 strategies at a sample t r ¼ 60 s. The definition of each qbin is given in the table shown in Fig. 8 . Fig. 7 shows that for all the strategies, more than 75 percent of the pages appear in qbin0 and are fully accurate. The next lower quality bin qbin1 contains around 15 percent of the total pages. The cumulative plot in part (c) shows the fraction of pages having quality equal or better than each qbin at t r ¼ 60s. About 99 percent of the total pages belong to high quality bins qbin0 À qbin5, thus resulting in negligible quality loss. One interesting observation is that the quality bins for Strategies 1 and 2 are almost identical which can be attributed to the fact that in almost all cases word errors are caused due to a single bit error. Hence, Strategy 1 and 2 result in almost identical quality-efficiency tradeoff. These four strategies are meant to provide four different ways of sorting physical pages suitable for different types of error-resilient applications. However, in order to gauge which sorting strategy is best for a particular application, an extensive analysis of the nature and degree of error-resiliency in each application is required, which is a future work.
Overall Methodology
Algorithms 1 and 2 describe the overall procedure of data allocation. Algorithm 1 describes the creation of quality bins where we assume that we have already performed the memory characterization and acquired the error characteristics of DRAM pages from the characterization step. Without any loss of generality, we have used Strategy 1 for constructing the quality bins. It basically counts the total number of word errors in each page (lines [14] [15] [16] [17] [18] and allocates each page to a particular qbin depending on this number. The pseudo-code for allocation in Algorithm 2 shows that the refresh interval t c is selected on the basis of the critical data D and the specified output quality. It shows that even after guaranteeing sufficient number of accurate pages for critical data, the refresh interval can be iteratively refined until we meet the specified quality requirement for the application. This renders the feature of quality configurability to our proposed approximate DRAM design. The initial value of t c can be selected from the t r versus qbin plot in Fig. 9 generated after the characterization step. Algorithm 2 shows that allocation of non-critical (or approximate) data first starts from free pages available in qbin0 and then to pages belonging to the next qbin (say qbin1) if free pages are no more available in qbin0 and so on and so forth. We start allocating the non-critical data from the highest qbin possible as it guarantees the least quality degradation at a particular refresh interval. The app_quality stated in Algorithm 2 can be calculated in two ways. One way is to split the entire operation into two phases: 1) a training phase and 2) a testing phase. In the training phase, the output qualities are pre-calculated for each qbin using a sample input dataset for each benchmark after assigning the approximate data repeatedly to each of the qbins. Subsequently, based on the output quality specification, the refresh rate can be modified to ensure that we have the desired number of pages in the required qbins during the testing phase. A second way to check the app_quality is to periodically perform an output quality check after the evaluation of a fixed number of inputs during runtime and adapt the refresh rate according to the periodic quality checks. Note that in our evaluation we followed the first method for obtaining the app_quality. We wish to emphasize here that the application quality only depends on the non-critical data and the errors occurring in it, i.e., higher the frequency/significance of bit-errors, higher is the output quality degradation. On the contrary, critical data does not play any role in the output quality since it can only be allocated to qbin0 (quality bin with all accurate pages) irrespective of where the approximate/non-critical data is allocated, otherwise it will lead to a catastrophic failure of the application. Hence, when we check the app_quality for different quality bins during the training phase, we do it by allocating only the approximate data to different qbins.
Finally, Fig. 9 provides experimental values for a 1 GB capacity DRAM module when Strategy 1 is employed as the sorting strategy. It depicts the total capacity of each of the quality bins at different refresh intervals. For the sake of explanation, assume that we have at least 1GB of data to be allocated to the DRAM out of which 70 percent is critical data and the rest 30 percent is approximate. Using this charplot, our strategy sets the refresh interval at around 68 s (corresponding to t c in Algorithm 2) which ensures that we have 700 MB of fully accurate pages as represented by the qbin0 size (and represented as Min Bound in Fig. 9 ). The approximate data can then be placed in the quality bins qbin1 and qbin2 (denoted by the longer white arrow). In case the output quality needs to be improved, our framework automatically decreases the refresh interval so that the number of accurate pages increases. Data that was previously allocated to qbin1 and qbin2 can now be put in qbin0 and qbin1 respectively, resulting in an improved quality (denoted by shorter white arrow). The appropriate refresh interval can be selected by analyzing the impact of the quality bins on output quality when the test applications are executed using a set of training inputs. 
Algorithm
1. Pseudo-code for quality bin Formation from Error Characterization Input: P: Set of physical pages in DRAM; T: Set of refresh intervals in increasing order; P t : Set of physical pages with no bit flips at refresh interval; t P t Ã: Set of physical pages with at least 1 bit flip at refresh interval t; E n t : Error=quality bins for pages where lower value of n represents higher quality level at refresh interval t Output: Characterized DRAM 1 P t ¼ fg; P Ã t ¼ fg; E n t ¼ fg; 2 foreach t 2 T do 3 foreach p 2 P do 4 if check errorðpÞ then 5 Pt Ã ¼ P Ã t [ fpg 6 else 7 P t ¼ P t [ fpg; 8 E 0 t ¼ E 0 t [ fpg
Quality Configurable Execution
One of the key features of our proposed allocation mechanism is its ability to adapt itself according to a specified output quality. Quality configurability is essential since the intrinsic resilience of an application varies significantly depending on the context in which its outputs are consumed as well as on the nature of inputs to the application. This requires the memory to operate at different quality levels for optimizing the energy consumption. In order to achieve quality configurable execution, first we figure out the lowest quality bin which satisfies a given output quality specification. This quality bound is provided by the normalized quality degradation with respect to the case when the data is allocated to qbin0. The quality bin can be assessed by allocating a set of training data to the physical pages belonging to different quality bins and noting down the application level quality degradation corresponding to each quality bin. Subsequently, during the evaluation phase, the lowest qbin (say q i ) which maintains the specified quality bound is selected and the test data can be allocated to free pages belonging to any bin between qbin0 to q i (starting sequentially from qbin0 to q i ). In case there are insufficient number of free pages belonging to these quality bins, the refresh interval can be increased dynamically to accommodate the data (till it violates the critical data condition). Algorithm 3 describes this method of constructing a quality configurable DRAM in the form of a pseudo-code.
Algorithm 2. Pseudo-Code for Allocation
Input: D: Set of pages that need to be allocated to the DRAM; T: Set of refresh intervals in increasing order; E 
Executing Multiple Applications Simultaneously
The allocation strategy can be easily extended to the case where multiple applications run simultaneously. For a multiapplication scenario, the total critical data to be allocated is the sum of the sizes of the critical data of individual applications. Hence, in this case it needs to be ensured that we have at least as many accurate pages (pages in qbin0) as required by the critical data of all applications. This is ensured by Algorithm 2 using a minor tweak where the set of pages for critical data can be assumed to be critical pages belonging to all the applications to be allocated. It is more interesting to see how the allocation methodology works for the non-critical or approximate data. Due to different application characteristics or different quality requirements, they will end up having different worst-case qbins. As long as there are sufficient number of free pages in the qbins (from qbin0 till the worst-case qbin) for each application, the methodology works as before. Even if sufficient number of pages are not present in the required qbins, the refresh interval can be further decreased to increase the number of pages in higher qbins. However, we have to ensure that we begin with the maximum refresh interval that ensures that the total critical data gets allocated to all qbin0 pages. The refresh interval can be further reduced for meeting quality requirements. This is reflected in Algorithm 3 where we first check whether we have sufficient number of noncritical physical pages (for all applications) in the qbins or not and, if not, the refresh interval is reduced further. Note that in the scenario where the data types used by applications are different, the system designer can always leverage Strategies 1, 2 and 4 that uses non-significance (non-position) based metrics like total number of word errors or bit errors per page to construct the quality bins. Strategy 3 can be avoided in such cases as it will require extra overhead and complexity to keep track of the different quality bins formed by each variable type (such as int, float, double).
Variable Retention Time
Due to the existence of the phenomenon called variable retention time (VRT) [6] , [7] , [8] , a DRAM cell can exhibit multiple retention times (or states) randomly over different characterization rounds when using a refresh interval higher than 64 ms. This can be a potential impediment to our allocation strategy since a particular physical frame can now belong to arbitrary quality bins even for a fixed refresh interval. The situation becomes even more precarious when a qbin0 page exhibits VRT since that may lead to a crash of the entire application due to incorrect allocation of critical data. This issue can be addressed to a great extent by enhancing our profiling mechanism. First, we repeat the characterization process repeatedly for each refresh interval over several rounds (at least 100 times) for a sufficiently long period (3-4 days) [6] , [8] and then take a union of the bit errors that occur at each bit position in each page across those rounds. This process is described in Fig. 10 , where x and y denote the two distinct rounds of characterization for a refresh interval t r fmg. The final quality bin assigned to a page depends on the maximum number of bit errors that can possibly occur in the page. This is obtained by performing the OR operation (union) among the positions of bit flips occurring in the page acquired over all the characterization rounds. In Fig. 10 , for illustration purpose, we consider only the higher quality bins.
Note that for a particular refresh interval, we only need to handle the case when VRT causes a bit flip at a lower refresh interval. The case where VRT causes a cell to have a higher retention time can be safely ignored. This is integrated into our enhanced profiling mechanism where the quality bins for a particular refresh interval are derived by taking the union of error bits occurring in a page at different refresh periods cumulatively, starting from the lowest characterization period till the current one. The underlying mechanism is exactly similar to the one presented in Fig. 10 (annotated within brackets), where instead of OR-ing pages at different rounds, we now also OR the pages across different refresh intervals. This ensures that any cell that exhibits VRT is taken into account as a bit error for all refresh intervals equal to or greater than the lowest retention state. In this example, t r fng and t r fn þ 1g represent two consecutive values in our pre-determined set of refresh intervals, and q v bin t r denotes the final qbin of a page after taking into account the effect of VRT. Additionally, experiments have shown that usually a VRT cell has retention times that do not fluctuate over a wide range [6] , [23] and hence, performing the characterization at widely separated refresh intervals (separated by 10 s) also enables us to mitigate this problem to some extent. Moreover, at higher refresh intervals (more than 20 s), a large portion of the DRAM VRT cells occupy their lower retention states [23] , thus, enabling us to identify most of the possible bit failures. Fig. 11 depicts the change in the number of pages in different quality bins as a result of our proposed strategy. It shows that the number of pages in qbin0 decreases while those in quality bins qbin1 and above increases due to the VRT phenomenon. Note that VRT does not impact the nature of bit flip (i.e., whether the flips are 0 to 1 or 1 to 0). For most DRAM modules, it was sufficient to perform 100 characterization rounds over 3-4 days to identify all VRT cells since the number of unique VRT cells saturated within that time frame.
DISCUSSIONS
This section provides an account of the additional hardware and software support required for run-time operation as well as the characterization and memory overhead required for storing the different quality bins.
Additional Hardware-Software Support
The first step in our approach is to characterize each DRAM module according to a suitable strategy over a range of refresh intervals and recording the output from each run. Custom scripts are then used to parse the characterization output and generate the quality bins. Next is the allocation step, where partitioning of each application into critical (CD) and non-critical (ND) or approximate data is performed along the lines of the methods described in [13] , [17] . An application can be divided into code, heap, stack, and global data. Data is said to be critical if even the smallest modification to it can cause the application to digress from its purpose and lead to a catastrophic failure. On the other hand, errors in approximate data only cause minor quality degradation at the output. Our design framework requires a custom allocator which will map the user annotated critical and approximate data into virtual pages (VP) indicated by a critical bit for each page. Mapping of virtual to physical pages (PP) is usually accomplished by the Operating System (OS) with the help of a page table. In our case, the OS has the added responsibility of assigning virtual pages to physical frames belonging to specific quality bins depending on the specified output quality bound.
Both the OS and the MMU have to be instrumented with the additional logic to implement the proposed allocation strategy. First, the OS keeps a track of pages belonging to the different qbins using a special data structure called qbinmap. Second, the page table in the MMU needs to be added with an extra field to specify the qbin of each virtual page obtained from the custom allocator. The OS then uses the qbin-map to find a suitable physical frame for mapping to each virtual page. The qbin-map is used in conjunction with the core map. Once the physical page is obtained from the qbin-map, the OS checks the core map to see whether the selected page is free or not before deciding to either allocate to the currently selected physical page or skipping it for the next one. An overview of the qbin-map and the overall lookup procedure is provided in Fig. 13 . Thus, whenever a virtual page is to be mapped to a physical page, the OS will first check whether the virtual page is critical or noncritical and if it is non-critical, it will then check for the worst-case qbin for the desired output quality as specified by the system designer. The OS will check subsequently if there is any free page which belongs to the worst-case or higher qbin using the qbin-map. Note that compared to the baseline implementation, there is a small performance penalty for looking up this qbin-map. One way to reduce this lookup latency is to look-up multiple entries of the table simultaneously for determining the correct physical frame. Conceptually, our framework expects the OS to completely automate the entire process of allocation as depicted in Fig. 12 . An offline analysis of output quality versus quality bin is performed over a set of training inputs and the result can be fed to the OS for fine tuning the refresh rate during runtime. Once an application has completed its execution, all the associated physical pages are freed. Note that OS support is essential to make approximate DRAM a reality in high-end sophisticated systems like large-scale data-centers.
Storage, Performance, and Characterization Overheads
To keep a record of the pages belonging to different quality bins, we require a multi-column bitmap (qbin-map) where rows correspond to pages and each column corresponds to the quality bins for each strategy. In the scenario where we have 1 GB memory, 1 KB page size, eight quality bins per strategy and five refresh time intervals, the total memory overhead for storing this bitmap will be around 1.8 MB per strategy. The storage efficiency can be further improved by using larger page sizes (e.g., 4 KB) and coarser refresh rates. This bitmap is kept in persistent storage and can be referred by the OS while performing the virtual to physical memory address translation. The performance overhead depends on the hardware and software architecture as well as the characteristics of the different benchmarks. In this paper, we used C/OS-II as the OS for allocation. In our implementation, the qbin-map is created in software and the qbin-map entries are looked up sequentially. Usually on an average we required almost 800 total sequential reads each in the qbin-map (or qbin-map reads) for processing one input of each application that takes less than 0.1 ms overall and is negligible compared to the overall application execution time ( > 10 s). Hence, we can conclude that the additional overhead is negligible from a performance and energy point of view.
The energy and latency overheads associated with the DRAM characterization is negligible in the long run since the characterization needs to be performed only once before the DRAM module is used in the system. Variation in retention time of DRAM cells due to aging and other factors can be taken into account by performing the process again albeit after a very long period (several months/years).
Effect of Temperature
The DRAM cell retention time decreases exponentially as temperature increases and the exact nature can be predicted numerically as shown in [6] . To incorporate the effect of temperature during the binning process, we can always maintain a guard-band and use the characterization results derived at a higher temperature than the maximum operating temperature used during normal execution. This is possible since errors occurring at a lower temperature are a subset of those occurring at higher temperatures, and hence we can guarantee that there will be no additional bit errors in the lower temperature range. The highest temperature used for characterization can be determined empirically based on the use case, e.g., 85
C for commercial applications. In our experiments, we performed the characterization at room temperature range, and the operating temperature for evaluation was always ensured to be within this value. For demonstrating the effect of temperature on DRAM errors we present characterization results for a higher temperature (70 C) and compare it with one done at 30 C as shown in Fig. 14 . As we can see, the total number of errors increases with higher temperature. Consequently, the number of pages in lower quality bins also increases. Although we do not show explicitly, the errors at 30 C are found to be a subset of the errors at 70 C at each refresh interval.
Characterizing Different DRAM Modules
As stated before, random variations in DRAM bit retention time causes significant difference in DRAM error characteristics for modules not only belonging to different DRAM vendors but even across modules belonging to the same vendor. We compare the error characteristics of previously characterized Hynix module to that of a Kingston module in Fig. 15 . One can clearly observe the differences in error characteristics where the Hynix module is significantly more error-resilient than the Kingston module at a fixed refresh interval. Hence, we expect the Hynix module to result in higher energy savings for a given quality degradation bound.
Data Patterns Used in DRAM Characterization
To take into account the effect of noise on bitline voltage generated due to bitline-bitline and bitline-wordline couplings, the characterization was repeated with other non-intuitive data patterns such as checkerboard, walk, and random as described in [6] . We then take the union of the errors in each page resulting from each data pattern to account for the worst case. This will result in an improved fault coverage. However, for the modules that we used for experiments, the usage of data patterns only showed a marginal improvement over the fixed data patterns of all 0 s and all 1 s.
EXPERIMENTAL SETUP
This section provides a brief description of the experimental setup used to validate our design.
Hardware Setup
For verifying our data allocation scheme, we performed all our experiments on an Altera Stratix IV GX FPGA based Terasic TR4-230 development board [20] consisting of a Hynix 1 GB SODIMM DDR3 DRAM operating at 1.5 V. The entire experimental setup is shown in Fig. 16 . It consists of an ADEXELEC DDR3-SODIMM-01 extender, which allowed us to measure the current consumption during DRAM operation with the help of a high precision Keithley-6430 SourceMeter. As stated earlier, we used the soft Nios II Processor [21] along with the UniPHY DDR3 memory controller [24] provided by Altera for controlling the DRAM module. The processor runs at 133 MHz. The temperature experiments were performed by placing the FPGA inside a Quincy Lab 12-140E Incubator as shown in Fig 16a. For all our experiments, we report the self refresh power consumption of the DRAM.
Software Setup
It is not imperative for an embedded computing system to be equipped with a well defined memory management unit (MMU) for allocating data to physical frames. However, since we deal with data allocation in page granularity, we require a custom MMU which will handle memory allocation in fixed size blocks (considered as 1 KB in our case). To demonstrate the effectiveness of our approach, we use the lightweight operating system, mC/OS-II [25] . mC/OS-II provides a set of comprehensive APIs for managing different computing resources. We primarily used the task and memory management APIs for allocating data in pages belonging to different quality bins. mC/OS-II was selected due to two primary advantages. First and foremost, it supports the Nios II soft processor which allowed us to seamlessly integrate it into the overall design flow. Second, it offered an extremely lightweight real time multitasking kernel (with a memory footprint of only 20 KB). The lack of a well-defined MMU worked as an advantage for us since we developed the custom MMU, henceforth termed as the software memory allocator (SMA) (Fig. 17) , from scratch optimized to the basic requirements of the allocation strategies.
OS Design
As stated earlier, we used the APIs provided by mC/OS-II and integrated them into our custom built wrapper modules for systematic page allocation. In the simplest case, we created two tasks, one dedicated entirely for the SMA and the other one for the error-resilient application to be evaluated. The SMA keeps track of the quality bins obtained from the characterization step and allocates data according to the quality specifications. The application task operates on the data allocated by SMA and writes the changes back, if any, to the designated bins. We extended this further to implement the multiapplication scenario where we created three tasks, one for the SMA and two other tasks for running the error-tolerant applications. The overall software framework is shown in Fig. 17 . Wrapper modules were used to create new tasks, allocate new pages, read pages, write pages, and delete existing ones. It is important to note that since we are also required to allocate the heap into the inaccurate pages (as approximate data is mostly in heap), so we needed to emulate malloc() and free() with our wrapper modules. malloc() is implemented using the OSMemCreate() module and free() is implemented by resetting the page counters keeping track of the quality bins. Each application is assigned its own partition consisting of a number of blocks (blocks are same as pages). During task creation, SMA is statically assigned a higher priority than the applications under test so that data is first allocated before being used. Note that the qbin-map was also implemented in software for tracking the addresses of the physical pages that belong to the different qbins. Our OS implementation completely mirrors the actual process of page based data allocation scheme in systems irrespective of whether virtual memory is present or not and hence can be trivially scaled up to any commercial state-of-the-art devices.
EXPERIMENTAL RESULTS
This section presents results from the experiments conducted to validate the novelty of our proposed scheme and also provides interesting insights gained from analyzing them. The results are divided into three broad subsections. In the first part, we show the trade-off between energy and quality that exists for selecting the appropriate refresh rate, an universal trait in any approximate computing mechanism. We also show that our approach results in much higher energy savings for the same quality compared to prior work. Closely related to this trade-off, the second part shows how the output quality is affected when data is put into different quality bins. Finally, the third part describes the results obtained by using the approximate DRAM subject to different quality specifications. Note that all results presented in the next section are obtained at room temperature (24 C).
Quality-Energy Trade-Off
Fig . 18 portrays the energy-quality relationship when we adopt Strategy 1 (or Strategy 2) for our data allocation scheme. It shows that a decrease in refresh interval t r not only increases the number of accurate pages (and hence output quality) but also the refresh power consumption. Note that the variation in refresh power for the range of 1 s-100 s is small since t r is already scaled up by orders of 100-1000x. Hence, any refresh period modulation in this range causes minute variation in the range of tens of mW only. However, the order of t r is totally orthogonal to the core idea of this paper. This range is only used as an example to show a sufficient number of distinct quality bins each containing atleast some minimum number of pages. The overall concept is generic and can be easily extended to lower refresh intervals where the refresh power varies widely. In the present case, at t r = 60 s (1,000x refresh reduction), the chart shows 80 percent of the total pages to be accurate and 15 percent in qbin1 for a refresh power equal to only 27 percent of the original. For all cases the refresh power is normalized to the original refresh power which was measured to be 66 mW when the entire DRAM refreshes at the period of 64 ms. Fig. 18 also shows that the refresh power saturates to a constant value, which is actually the minimum power consumed by the DRAM module when put into the lowest power mode. Fig. 19 compares the refresh power consumption of Flikker with our work at different critical data (CD) sizes. For acquiring the power numbers for Flikker using our experimental setup, we assume that Flikker refreshes the critical and non-critical portion at 64 ms and 1 s respectively. Our work sets the refresh rate only on the basis of CD. Experimental results show that our work results in a maximum refresh power reduction of 73 percent compared to only 35 percent for Flikker when we consider that half of DRAM contains critical data, which is a 2.4X improvement in refresh power consumption. Obviously with decreasing CD size, Flikker's power reduction becomes comparable since it now applies the optimal refresh rate of 64 ms to a smaller portion of the DRAM. Another important point to note is that Flikker does not have any control over the occurrence of errors in the approximate data, however our work always ensures that approximate data is allocated to the least erroneous pages first, thus resulting in the least possible output quality degradation for a fixed t r . It is important to note that both these results are independent of the nature of individual applications and is only dependent on the total size of critical and non-critical data present at a particular instant.
Impact of Quality Bins on Output Quality
We tested the four proposed allocation strategies on eight different error resilient applications taken from a variety of domains. All applications were implemented on top of the soft-core Nios II Processor in the FPGA and, subsequently, the output qualities were computed using quality metrics stated in Table 1 and the application outputs obtained during runtime. The test applications and the constituent approximate portion of the data are tabulated in Table 1 . For simplicity, the distinction between critical and approximate data is performed at a coarse granularity where we assumed all data other than the approximate data stated in Table 1 as critical. Note that we could have chosen any mechanism for demarcating critical and approximate data in the applications and this choice is completely orthogonal to the core idea of this paper. Fig. 21 presents the results of running our data allocation scheme using each strategy for the case of a JPEG encoder. It illustrates how the output quality of the image (represented quantitatively by PSNR) degrades from a higher quality bin to a lower quality bin when pixels of the test image are allocated in entirety to each of the quality bins. Since the image pixels are declared as bytes, the BW m for Strategy 3 is calculated in byte granularity by empirically selecting the weights 1, 2, 4, and 8 for bit positions 0-1, 2-3, 4-5, and 6-7 respectively. The results verify the fact that placing approximate data in higher quality bins results in an improved application quality. As a result, at smaller refresh intervals, the presence of higher number of pages in higher quality bins helps us to achieve better output quality. Note that for qbin0, the PSNR of the image is infinite since it does not have any errors relative to the original image output. The other PSNR values (in dB) are embedded within the plots. The graphs in Fig. 20 show the normalized applicationlevel output qualities when non-critical data is allocated to different quality bins qbin0 À qbin4 using Strategies 1, 2, and 3 at a t r = 60 s. Note that the output qualities obtained using Strategies 1 and 2 are similar for higher quality bins since each word error is caused by a single bit error in each page, and hence, they are represented as a single plot in Fig. 20a . Fig. 20b shows the output qualities for Strategy 3. These plots depict that even on reducing the refresh rate by nearly 1,000x, we incur very little application-level quality degradation. The maximum quality loss turns out to be less than 7 percent (for MPEG) when data is put in qbin4 using Strategy 3. For Strategies 1 and 2, only GLVQ has significant quality degradation when allocated to bins qbin1 À qbin4 and higher. The quality degrades gradually when approximate data of these applications are placed at lower quality bins due to higher number of bit flips. Applications such as KNN, SOBEL, IMG-SEG, and CNN retain almost 100 percent of quality even when put into lower quality bins for all three strategies. Thus, we achieve DRAM energy savings upto 73 percent with virtually no loss in quality. Note that Strategy 4 is heavily data and application specific and may not be applicable in all cases. Hence, we only show a single use case for it as in Fig. 21 .
The results in this section depict the influence of quality bins on application quality, which is necessary to understand the impact of migrating approximate data from lower to higher quality pages or vice-versa. This, in turn, is useful to comprehend how application quality changes with refresh rate which is required for fine grained refresh control. Finally, we can conclude from these results that Strategy 3 is a much more fine-grained and accurate allocation strategy than Strategies 1 and 2.
System-Level Energy Savings
We leverage a smart camera-based system built using the Terasic TR4 development board for providing an estimate of the system-level energy savings obtained from the approximate DRAM. A smart camera-based system represents a real state-of-the-art embedded system that is used extensively in image recognition, gesture detection, traffic control, home security, disaster monitoring, etc. This system can execute a host of error-resilient applications including the evaluation benchmarks we used in this paper. The smart camera-based system consists of three major subsystems: i) computational subsystem, ii) memory subsystem, and iii) sensor (or camera) subsystem. Along with the Nios II soft processor, the system consists of a Terasic TRDB-D5M 5 MP camera module and a 1 GB DDR3 SODIMM DRAM module. We evaluated the total energy consumption of the system as well as each of the subsystems in our experiments for processing a single input for each of the benchmarks. Experimental results showed that the memory subsystem contributes around 50 percent of the overall system energy consumption. DRAM refresh specifically consumes 45 percent of the total energy on an average. Since we obtain a DRAM power reduction of 73 percent, reducing the refresh rate leads to a maximum system-level energy savings up to 33 percent for minimal quality loss.
Quality Modulation
This section specifies the worst-case quality bins for Strategies 1, 2, and 3 in Tables 2 and 3 , required for configuring the output quality according to specified constraints (represented as percent quality degradation or percent Qual) as per our discussion in Section 5.3.1. As noted previously, Strategies 1 and 2 show similar results. Quality bins specified in the form of qbin9þ denote that even when approximate data is allocated to qbin9 (the last qbin taken into consideration), the output quality was within the specified quality bound showing the extreme resilience of these applications (KNN, IMG-SEG, CNN for both strategies, SOBEL for Strategies 1 and 2, and KMEANS, GLVQ for Strategy 3) to DRAM refresh errors. Finally, it is important to note that quality bins acquired in this process are heavily dependent on the characteristics of the training data and hence, it becomes imperative to select the training data judiciously. Fig. 22 shows the normalized output qualities of the benchmarks when the approximate data is allocated according to the quality specifications stated in Table 2 . Note that this is shown only for Strategy 1 at t r = 60 s. The output qualities are normalized to the fully accurate case and are calculated by taking an average for the entire test input dataset. As we can see, the output qualities respect the quality degradation bounds.
CONCLUSION AND FUTURE WORK
This paper proposes a novel method for constructing quality-aware DRAM by characterizing the DRAM errors in each physical page at sub-optimal refresh rates. It also devises four novel strategies for segregating the physical pages into different quality bins which are used for systematically allocating critical and approximate data during page mapping. With the emergence of eDRAM (embedded DRAM) as a last level (L4) or memory side cache in recent high-end commercial processors, approximating the DRAM cache requires a new approach in the future. Another interesting direction is to adopt the proposed allocation strategy for other emerging memory technologies. " For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
