Abstract-Bounding each task's worst-case execution time (WCET) accurately is essential for real-time systems to determine if all deadlines can be met. Yet, access latencies to Dynamic Random Access Memory (DRAM) vary significantly due to DRAM refresh, which blocks access to memory cells. Variations further increase as DRAM density grows.
I. INTRODUCTION
Dynamic Random Access Memory (DRAM) has been the memory of choice in embedded systems for many years due low cost combined with large capacity, albeit at the expense of volatility. As specified by the DRAM standards [1] , [2] , each DRAM cell must be refreshed periodically within a given refresh interval. The refresh commands are issued by the DRAM controller via the command bus. This mode, called auto-refresh, recharges all memory cells within the "retention time", which is typically 64ms for commodity DRAMs under 85
• C [1] , [2] . While DRAM is being refreshed, a memory space (i.e., a DRAM rank) becomes unavailable to memory requests so that any such memory reference blocks the CPU pipeline until the refresh completes. Furthermore, a DRAM refresh command closes a previously open row and opens a new row subject to refresh [3] , even though data of the old row may be reused (referenced) before and after the refresh. Hence, the delay suffered by the processor due to DRAM refresh includes two aspects: (1) the cost (blocking) of the refresh operation itself, and (2) reloads of the row buffer for data displaced by refreshes. As a result, the response time of a DRAM access depends on its point in time during execution relative to DRAM refresh operations.
Prior work indicated that system performance is significantly degraded by refresh overhead [4] , [5] , [6] , [7] , a problem that is becoming more prevalent as DRAMs are increasing in density. With growing density, more DRAM cells are required per chip, which must be refreshed within the same retention time, i.e., more rows need to be refreshed within the same refresh interval. This increases the cost of a refresh operation and thus reduces memory throughput. Due to the asynchronous nature of refreshes relative to task schedules and preemptions, none of the current analysis techniques tightly This work was supported in part by NSF grants 1239246,1329780,1525609 and 1813004.
bound the effect of DRAM refreshes as a blocking term on response time. Atanassov and Puschner [8] discuss the impact of DRAM refresh on the execution time of real-time tasks and calculate the maximum possible increase of execution time due to refreshes. However, this bound is too pessimistic (loose): If the WCET or the blocking term were augmented by the maximum possible refresh delay, many schedules would become theoretically infeasible, even though executions may meet deadlines in practice. Although Bhat et al. make refreshes predictable and reduce preemption due to refreshes by triggering them in software instead of hardware auto-refresh [3] , the cost of refresh operations is only considered, but cannot be hidden. Also, a task cannot be scheduled under Bhat if its period is less than the execution time of a burst refresh.
This work contributes the "Colored Refresh Server" (CRS) to remove task preemptions due to refreshes and to hide DRAM refresh overhead. Contributions: (1) The impact of refresh delay under varying DRAM densities/sizes is assessed for real-time systems with stringent timing constraints. (2) The Colored Refresh Server (CRS) for uniprocessors is developed to refresh DRAM via memory space coloring and shown to hide refresh overhead almost entirely . (3) Experiments with real-time tasks confirm that both refresh delays are hidden and DRAM access latencies are reduced.
II. BACKGROUND AND MOTIVATION
Today's computers predominantly utilize dynamic random access memory (DRAM), where each bit of data is stored in a separate capacitor within DRAM memory. To serve memory requests from the CPU, the memory controller acts as a mediator between the last-level cache (LLC) and DRAM devices. Once memory transactions are received by a DRAM controller from its memory controller, these read/write requests are translated into corresponding DRAM commands and scheduled while satisfying the timing constraints of DRAM banks and buses. A DRAM controller is also called a node that governs DRAM memory organized into channels, ranks and banks.
A. Memory Space Partitioning
We assume a DRAM hierarchy with node, channel, rank, and bank abstraction. To partition this memory space, we obtained a copy of TintMalloc [9] , a heap allocator that "colors" memory pages with controller (node) and bank affinity.
TintMalloc allows programmers to select one (or more) colors to choose a memory controller and bank regions disjoint from those of other tasks. DRAM is further partitioned into channels and ranks above banks. The memory space of an application can be chosen such that it conforms to a specific color. E.g., a real-time task can be assigned a private memory space based on rank granularity. When this task runs, it can only access the memory rank it is allocated to. No other memory rank will ever be touched by it. By design, there is a penalty for the first heap allocation request with a color under TintMalloc. This penalty only impacts the initialization phase. After a "first touch" page initialization, the latency of any subsequent accesses to colored memory is always lower than that of uncolored memory subject to buddy allocation (Linux default). Also, once the colored free list has been populated with pages, the initialization cost becomes constant for a stable working set size, even for dynamic allocations/deallocation assuming they are balanced in size. Real-time tasks, after their initialization, experience highly predictable latencies for subsequent memory requests. Hence, a first coloring allocation suffices to amortize the overhead of initialization.
B. DRAM Refresh
Refresh commands are periodically issued by the DRAM controller to recharge all DRAM cells, which ensures data validity in the presence of electric leakage. A refresh command forces a read to each memory cell followed by a write-back without modification, which recharges the cell to its original level. The reference refresh interval of commodity DRAMs is 64ms under 85
• C (185 • F) or 32ms above 85
• C, the so-called retention time, tRET , of leaky cells, sometimes also called refresh window, tREF W [1], [2] , [10] , [11] . All rows in a DRAM chip need to be refreshed within tRET , otherwise data will be lost. In order to reduce refresh overhead, refresh commands are processed at rank granularity for commodity DRAM [12] . The DRAM controller can either schedule an automatic refresh for all ranks simultaneously (simultaneous refresh), or schedule automatic refresh commands for each rank independently (independent refresh). Whether simultaneous or independent, a successive area of multiple cells in consecutive cycles is affected by a memory refresh cycle. This area is called a "refresh bin" and contains multiple rows. The DDR3 specification [1] generally requires that 8192 automatic refresh commands are sent by the DRAM controller to refresh the entire memory (one command per bin at a time). Here, the refresh interval, tREF I, denotes the gap between two refresh commands, e.g., tREF I = 7.8us, i.e., tREF W/8192. The so-called refresh completion time, tRF C, is the refresh duration per bin. Auto-refresh is triggered in the background by the DRAM controller while the CPU executes instructions.
Memory ranks remain unavailable during a refresh cycle, tRF C, i.e., memory accesses (read and write operations) to this region will stall the CPU during a refresh cycle. DRAM ranks can be refreshed in parallel under auto-refresh. However, the amount of unavailable memory increases when refreshing ranks in parallel. A fully parallel refresh blocks the entire memory space for tRF C. This blocking time not only decreases system performance, but can also result in deadline misses unless it is considered in a blocking term by all tasks.
Furthermore, a side effect of DRAM refresh is that a row buffer is first closed, i.e., its data is written back to the data array and any memory access is preempted. After the refresh completes, the original data is loaded back into the row buffer again, and the deferred memory access can continue. As a result, an additional overhead of tRP + tRAS is incurred to close and re-open rows since the refresh purges all buffers. By considering both the cost of a refresh operation itself and the extra row close/re-open delay, DRAM refresh not only decreases memory performance, but also causes the response time of memory accesses to fluctuate. Due to the asynchronous nature of refreshes and task preemptions, it is hard to accurately predict and bound DRAM refresh delay. Depending on when a refresh command is sent to a bin (successive rows), two scheduling strategies exist: distributed and burst refresh (see [13] ).
III. DESIGN
The core problem with the standard hardware-controlled auto-refresh is the interference between periodic refresh commands generated by the DRAM controller and memory access requests generated by the processor. The latter ones are blocked once one of the former is issued until the refresh completes. As a result, memory latency increases and becomes highly unpredictable since refreshes are asynchronous. The central idea of our approach is to remove DRAM refresh interference by memory partitioning (coloring). Given a real-time task set, we design a hierarchical resource model [14] , [15] , [16] to schedule it with two servers. To this end, we partition the DRAM space into two colors, and each server is assigned a colored memory partition. (We show in [13] that two colors suffice, i.e., adding additional colors does not extend the applicability of the method, it would only make schedulability tests more restrictive.) By cooperatively grouping applications into two resource servers and appropriately configuring those servers (period and budget), we ensure that memory accesses can no longer be subject to interference by DRAM refreshes. Our approach can be adapted to any real-time scheduling policy supported inside the CRS servers. In this section, we describe the resource model, bound the timing requirements of each server, and analyze system schedulability.
A. Assumptions
We assume that a given real-time task set is schedulable with auto-refresh under a given scheduling policy (e.g., EDF or fixed priority), i.e., that the worst-case blocking time of refresh is taken into account. As specified by the DRAM standards [1] , [2] , the entire DRAM has to be refreshed within its retention time, tRET , either serially or in parallel for all K ranks. . We also assume hardware support for timer interrupts and memory controller interrupts (MC interrupts).
B. Task Model
Let us denote the set of periodic real-time tasks as T = {T 1 ...T n }, where each task, T i , is characterized by
for a phase φ i , a period p i , (worst-case) execution time e i , relative deadline D i per job, task utilization u i = e i /D i , and a hyperperiod H of T . Furthermore, let tRET be the DRAM retention time, L be the least common multiple of H and tRET , and K be the number of DRAM ranks, and let k i denote rank i.
C. DRAM Refresh Server Model
The Colored Refresh Server (CRS) partitions the entire DRAM space into two "colors", such that each color contains one or more DRAM ranks, e.g., c 1 (k 0 , k 1 ...k i ), and
We build a hierarchical resource model (task server) [16] , S(W, A, c, p s , e s ), with CPU time as the resource, where W is the workload model (applications), A is the scheduling algorithm, e.g., EDF or RM, c denotes the memory color(s) assigned to this server, i.e., a set of memory ranks available for allocation, p s is the server period, and e s is the server execution time (budget). Notice that the base model [16] is compositional (assuming an anomaly-free processor design) and it has been shown that a schedulability test within the hyperperiod suffices for uniprocessors.
The refresh server can execute when (i) its budget is not zero, (ii) its available task queue is not empty, and (iii) its memory color is not locked by a "refresh task" (introduced below). Otherwise, it remains suspended.
D. Refresh Lock and Unlock Tasks
We employ "software burst parallel refresh" [3] to refresh multiple DRAM ranks in parallel via the burst pattern (i.e., another refresh command is issued for the next row immediately after the previous one finishes [13] . In our approach, there are two "refresh lock tasks" (T rl1 and T rl2 ) and two "refresh unlock tasks" (T ru1 and T ru2 ), T rl1 and T ru1 surround the refresh for color c 1 and are allocated to server S 1 while T rl2 and T ru2 surround the refresh for color c 2 and are allocated by server S 2 . The top-level task set T of our hierarchical model thus consists of the two server tasks S 1 and S 2 plus another two tasks per color, with the highest priority, for refresh lock/unlock, T rl1 and T ru1 as well as T ru2 and T ru2 : Fig. 1 . Refresh Task with CPU Work plus DRAM Controller Work When a refresh lock task is released (Fig. 1) , the CPU sends a command to the DRAM controller to initiate parallel refreshes in a burst. Furthermore, a "virtual lock" is obtained for the colors subject to refresh. Due to their higher priority, refresh lock/unlock tasks preempt any server (if one was running) until they complete. Subsequently, the refresh lock task terminates so that a server task (of opposite color) can be resumed. In parallel, the "DRAM refresh work" is performed, i.e., burst refreshes are triggered by the controller. We use e r1 and e r2 to represent the duration of DRAM refresh per color r1 and r2, respectively. A CPU server resumes execution only if its budget is not exhausted, its allocated color is not locked, and some task in its server queue is ready to execute.
Once all burst refreshes have completed, an interrupt is triggered, which causes the CPU to call the refresh unlock task that unlocks the newly refreshed colors so that they become available again. This interrupt can be raised in two ways: (1) If the DRAM controller supports interrupt completion notification in hardware, it can be raised by the DRAM controller. (2) Otherwise, the length of a burst refresh, δ, can be measured and the interrupt can be triggered by imposing a phase of δ on the unlock task relative to the phase of the lock task of the same color. Interrupts are triggered at absolute times to reduce jitter (see Sect. IV). The overhead of this interrupt handler is folded into the refresh unlock task for schedulability analysis in the following. In practice, the cost of a refresh lock/unlock task is extremely small since it only programs the DRAM controller or handles the interrupt.
The periods of both the refresh lock and unlock task are tRET . The refresh lock tasks are released at k * tRET , while the refresh unlock tasks are released at k * tRET + δ. The phases φ of T rl1 and T rl2 are tRET 2 and 0, respectively, i.e., memory ranks allocated to S 2 are refreshed first followed by those of S 1 . Let us summarize: T = {S 1 , S 2 , T rl1 , T ru1 , T rl2 , T ru2 }, where S1 = (0, p1, e1, p1), S2 = (0, p1, e2, p1), T rl1 = (tRET /2, tRET, erl, δ), Trl2 = (0, tRET, erl, δ), Tru1 = (tRET /2+δ, tRET, eru, δ), Tru2 = (δ, tRET, eru, δ).
The execution times e rl and e ru of the lock and unlock tasks are upper bounds on the respective interrupts plus programming the memory controllers for refresh and obtaining the lock for the former and just unlocking the the latter task, respectively. (They are also upper bounded by δ.) The execution times e 1 and e 2 depend on the task sets of the servers covered later, while their deadlines are equal to their periods (p 1 and p 2 ). The task set T can be scheduled statically as long as the lock and unlock tasks have a higher priority than the server tasks. A refresh unlock task is triggered by interrupt with a period of tRET . Since we refresh multiple ranks in parallel, the cost of refreshing one entire rank is the same as the cost of refreshing multiple ones. Furthermore, the cost of the DRAM burst refresh, δ, is small (e.g., less than 0.2ms for a 2Gb DRAM chip with 8 ranks).
E. CRS Implementation
Consumption and Replenishment: The execution budget is consumed one time unit per unit of execution. The execution budget is set to e s at time instants k * p s , where k ≥ 0. Unused execution budget cannot be carried over to the next period.
i.e., CRS is directly applicable to them as well. Bhat et al. [3] make DRAM refresh more predictable. Instead of hardware auto-refresh, a software-initiated burst refresh is issued at the beginning of every DRAM retention period. But the memory remains unavailable during the refresh, and any stalls due to memory references at this time increase execution time. Although memory latency is predictable, memory throughput is still lower than CRS due to refresh blocking, i.e., CRS overlays (hides) refresh with computation. Furthermore, a task cannot be scheduled if its period is less than the duration of the burst refresh.
VIII. CONCLUSION
A novel uniprocessor scheduling server, CRS, is developed that hides DRAM refresh overheads via a software solution for refresh scheduling in real-time systems. Experimental results confirm that CRS increases the predictability of memory latency in real-time systems by eliminating blocking due to DRAM refreshes.
