Abstract-Synchronous dynamic random access memories (SDRAMs) are widely employed in multi-and many-core platforms due to their high-density and low-cost. Nevertheless, their benefits come at the price of a complex two-stage access protocol, which reflects their bank-based structure and an internal level of explicitly managed caching. In scenarios in which requestors demand real-time guarantees, these features pose a predictability challenge and, in order to tackle it, several SDRAM controllers have been proposed. In this context, recent research shows that a combination of bank privatization and open-row policy (exploiting the caching over the boundary of a single request) represents an effective way to tackle the problem. However, such approach uncovered a new challenge: the data bus turnaround overhead. In SDRAMs, a single data bus is shared by read and write operations. Alternating read and write operations is, consequently, highly undesirable, as the data bus must remain idle during a turnaround. Therefore, in this article, we propose a SDRAM controller that reorders read and write commands, which minimizes data bus turnarounds. Moreover, we compare our approach analytically and experimentally with existing real-time SDRAM controllers both from the worst-case latency and power consumption perspectives.
INTRODUCTION AND RELATED WORK
S DRAM memories are widely employed in multi-and many-core platforms, e.g., [1] and [2] , due to their highdensity and low-cost. However, their benefits come at the price of a complex two-stage access protocol, which reflects their bank-based structure and an internal level of explicitly managed caching. As a consequence, the execution time of a request depends on the history of previous requests, which poses a challenge from the real-time perspective. In order to tackle such challenge, several SDRAM controllers have been proposed [3] , [4] , [5] , [6] , [7] .
The classical approach to build real-time SDRAM controllers relies on a combination of bank-interleaved address mapping and close-row policy. The former refers to a single request accessing more than one SDRAM bank, while the latter refers to the internal level of caching being flushed between consecutive requests. Extensive work has been done on strategies to select the appropriate parameters for this approach [3] , [8] , [9] . More specifically, on selecting the number of banks over which a single request is distributed and how the caching can be exploited within the boundary of a single request. After the parameters are defined, bandwidth guarantees can be extracted with the approaches from [10] , which makes the strategy attractive for streaming applications with well defined requirements.
From the real-time perspective, the combination of bankinterleaved address mapping and close-row policy is very effective in scenarios in which the SDRAM data bus is narrow and/or the SDRAM requests have a large granularity. However, if that is not the case, its ability to effectively exploit the SDRAM is compromised [11] . For instance, consider the many-core platforms from [1] and [2] , which contain processors that rely on caches and that share 4 SDRAM controllers, each managing a 64-bit wide SDRAM module. In such case, a single read or write command to one of the SDRAM banks transfers 64 bytes of data (a common cache line size) and, hence, there is no need to employ interleaving or to exploit the internal level caching for a single request.
To address the aforementioned scenario, researchers proposed using a combination of bank privatization and openrow policy [6] , [7] , [11] . The former refers to granting a realtime task exclusive access to one or more banks. The latter refers to not flushing the internal level of caching between successive requests. Consequently, the locality of the caching is potentially exploited over the boundary of all requests performed by a task.
Nevertheless, with the new approach, a new challenge was uncovered: the data bus turnaround time. In SDRAMs, a single data bus is shared for read and for write operations. Hence, SDRAM controllers must enforce a minimum timing interval between the execution of a read and of a write command (or vice-versa). Such intervals are known as bus turnaround times and are required in order to change the OnChip Termination (OCT) of SDRAM chips from input to output or from output to input.
As detailed in [12] , the faster a SDRAM device is, the larger the corresponding data bus turnaround times are. Hence, the turnarounds pose a challenge that, if not dealt with, lead to poor SDRAM utilisation. In traditional real-time SDRAM controllers, which were discussed in the beginning of this section, such challenge is mitigated because each incoming request is translated into a statically computed bundle of several read (or write) commands that does not cause a turnaround.
In COTS SDRAM controllers, which are optimized for average performance and rely on the open-row policy, the bus turnaround times are mitigated by buffering pending write requests until their number reaches a certain threshold, after which they are served in a batch [13] . Such approach, however, was analyzed in [14] and has been shown as ineffective from the real-time perspective. The reason being that, in order to compute the worst-case latency of a read request, designers must assume a system backlogged with write requests, which leads to pessimistic timing bounds.
In existing open-row real-time SDRAM controllers, the turnarounds are either simply accounted for in the timing analysis [6] or are mitigated using multi-rank SDRAM modules [7] , [11] . The former demands designers to assume an alternating pattern of interfering reads and writes in order to compute guarantees, which leads to poor timing bounds. The latter is not cost efficient, as multi-rank modules are expensive. Furthermore, blindly alternating between the ranks is not a scalable solution [15] .
Consequently, in [12] , we proposed a real-time SDRAM controller that bundles read and write commands, i.e., minimizes the number of bus turnaround events, thus efficiently tackling the problem in the single-rank domain. To clarify the benefits of read/write bundling, consider Fig. 1 , which depicts the minimum distance between the first and the last commands of two different command sequences. Sequence 1 contains an alternating pattern of writes and reads which requires three bus turnarounds. Sequence 2 contains a bundled pattern, which only requires one bus turnaround. Notice that the bundled pattern is clearly better, being up to 35 cycles faster than the alternating one (for DDR3-2133N).
This article is an extended version of [12] . Its main contributions in comparison with the original work are:
(1) The evaluation performed in [12] was limited to an analytical comparison, i.e., a comparison of latency bounds, with a state-of-the art real-time open-row SDRAM controller [6] . In this article, we also consider a real-time close-row SDRAM controller [5] . Moreover, we use cycle-accurate simulators to assess how tightly the corresponding timing analyses predict the worstcase behavior of an application. (2) In our original work, our computation of the worstcase latency of read and write commands relied on a subjective not-too-late assumption (better discussed in Sections 3.3.1 and 4.2). With regard to it, we improve the original work on three fronts: first, we pinpoint the portion of the analysis that is affected by the assumption. Second, we show how a small modification in the analysis computes a worst-case bound without the aforementioned assumption. Finally, we also discuss a small architecture modification to handle read and write commands that arrive too late. (3) We perform a comparison of power consumption trends between the real-time SDRAM controllers under consideration in this article. To our knowledge, this work is the first to compare open and closerow controllers from the power perspective. In order to avoid confusion and emphasize the contribution of this article, we also mention that we published a technical report [16] that proposes a multi-generation DDR SDRAM controller that implements read/write bundling. In comparison with this article, the technical report does not consider close-row real-time controllers, does not evaluate power consumption and data bus utilisation and omits a discussion about the not-too-late assumption and about how data bus turnarounds are addressed by the related work.
The rest of this article is structured as follows: in Section 2, we present the background on SDRAM systems. Then, in Section 3, we describe our SDRAM controller, followed by a timing analysis of it in Section 4. Finally, in Section 5, we present an evaluation of our approach, followed by the conclusion in Section 6.
BACKGROUND ON DRAM SYSTEMS
In this section, we first describe SDRAM memories and their timing constraints and then we discuss the advantages and drawbacks of open-row real-time SDRAM controllers. For all intents and purposes, we employ the word SDRAM to refer to DDR2 [17] or DDR3 SDRAM devices [18] . DDR4 SDRAMs [19] , which are not yet market dominant, introduced new architectural features. A proper discussion of such features is out of the scope of this article.
Naming Conventions
Double data rate (DDR) SDRAMs are identified by a string that uses the following pattern: DDRx-(speed bin)(grade). The x stands for the generation, e.g., DDR2 or DDR3. The speed bin is measured in MT/s (mega transfers per second), which corresponds to 2 times the frequency of the data bus measured in MHz (because of the double data rate). For instance, a DDR3-800E device is able to perform at most 800 MT/s and its data bus frequency is equal to 400 MHz. Finally, the grade, i.e., the letter appended to the end of the string, is used to distinguish between devices that belong to the same speed bin, but that have different timing constraints. The closer to 'A' the grade is, the smaller the timing constraints of the device are and, hence, the faster the device can execute SDRAM commands. For instance, a DDR3-800D can execute SDRAM commands faster than a DDR3-800E, even though both belong to the 800 MT/s speed bin.
SDRAM Organization, Commands and Constraints
We depict a SDRAM module and the logical structure of a SDRAM chip in Fig. 2 . A SDRAM module is a printed circuit board that contains one or more ranks. A rank is comprised of a set of SDRAM chips that share a clock, a command bus and a chip-select signal. Each rank is treated by the SDRAM controller as a single SDRAM chip with a larger number of data bus pins. For instance, in the figure, eight 8-bit data bus chips are used to form a 64-bit wide rank. We discuss the logical structure of SDRAM chips. SDRAM chips are divided into banks. We refer to the number of banks in a SDRAM chip as nB. In this paper, we consider DDR2 and DDR3 SDRAMs with nB=8 banks. Each bank contains a matrix-like structure and a row buffer (which is highlighted in the figure). The matrix-like structures are not visible to the memory controller. All data exchanges are instead performed through the corresponding row buffer, which represents the internal level of caching mentioned in the introduction.
There are four commands used to move data into/from a row buffer: activate, precharge, read and write. The activate (A) command loads a matrix row into the corresponding row buffer, which is known in the literature as opening a row. The precharge (P) command writes the contents of a row buffer back into the corresponding matrix, which is known in the literature as closing a row. The read (R) and write (W) commands are used to retrieve or forward words from or into a row buffer. We use the acronym Column Address Strobe (CAS) to refer to both read and write commands.
CAS commands operate in bursts, which means that each of them transfers more than one word. The exact amount of words transferred by a CAS command is determined by the the burst length (BL) parameter. Both DDR2 and DDR3 support BL=8, which is the configuration employed in this article. A single CAS command occupies the data bus for t BURST ¼ BL=2 ¼ 4 cycles and transfers BL Á W BUS bits, where W BUS represents the width of the data bus.
We discuss timing constraints. There are several timing constraints that dictate how many cycles apart consecutive commands must be. We enumerate them for one DDR2 and one DDR3 devices in Table 1 . Notice that there are three different types of constraints: the ones that refer to the minimum distance between commands issued to the same bank (exclusively intra-bank constraints), the ones that refer to the minimum distance between commands issued to different banks (exclusively inter-bank constraints), and the ones that refer to the minimum distance between commands issued to any bank (inter-and intra-bank constraints). The constraints that refer to data bus turnarounds fall into the last category. For the interested reader, we provide a graphical depiction of the constraints in Appendix A, which can be found on the Computer Society Digital Library at http://doi. ieeecomputersociety.org/10.1109/TC.2017.2714672.
We discuss data bus turnarounds. The DDR2 and DDR3 standards [17] , [18] specify two constraints that refer to data bus turnarounds: t RTW and t WTR . The former establishes the minimum distance between a read followed by a write. The latter establishes the minimum distance between the end of the data transfer of a write command and a read. To avoid confusion and have constraints with symmetric meanings and notations, we employ the notation t RtoW to refer to the minimum distance between a read followed by a write, and t WtoR to refer to the minimum distance between a write followed by a read. Given the textual definition, notice that t RtoW is equal to t RTW , while t WtoR amounts to t WL þ t BURST þ t WTR .
Finally, we discuss refreshes. SDRAMs must be refreshed every t REFI ¼ 7:8ms in order to prevent the capacitors that store data from being discharged. This is accomplished with the refresh (R) command. The amount of cycles required for a refresh to complete (referred to as t RFC ) varies according to the SDRAM device, e.g., t RFC ¼ 36 cycles for a DDR3-800E.
Open-Row Real-Time SDRAM Controllers
As discussed in the introduction, this paper concentrates on open-row real-time SDRAM controllers. Such controllers only precharge row buffers if a refresh must be executed or if an incoming request needs to access a row currently not present in the corresponding row buffer. Incoming requests are then translated into either a CAS command, in case of a row buffer hit, or into a precharge-activate-CAS command sequence, in case of a row buffer miss. With regard to traditional real-time controllers, which rely on a combination of bank-interleaved address mapping and close-row policy, open-row SDRAM controllers have two main advantages: first, they do not require narrow data buses and/or large request granularities to be effective. And second, they potentially consume less power, as they avoid power-hungry closing and opening of rows.
However, also with regard to traditional real-time controllers, they have two main drawbacks: first, they demand a bank privatization setup to be effective, as it prevents different real-time tasks from destroying the row buffer locality of each other. And second, due to the bank privatization, they potentially suffer from poor memory utilisation, as a task might not need all the storage provided by a bank.
Finally, it is worth highlighting that in case data exchange between real-time tasks is necessary, one or more banks can be designated for such purpose, i.e., they can be shared. Such strategy has been discussed in [11] and is out of the scope of this article. Moreover, assigning tasks to SDRAM banks can be achieved with a software virtual addressing layer, as discussed in [20] .
SDRAM CONTROLLER ARCHITECTURE
In this section, we first provide an architectural overview of our SDRAM controller and then we discuss in detail the blocks responsible for command scheduling. Before we start our discussion, however, we highlight that our controller supports a single request granularity. Supporting different granularities, which would be necessary for instance if a DMA engine competes for the SDRAM with cache-relying processors, is out of the scope of this article (as it constitutes an orthogonal challenge already investigated in [11] ).
Architectural Overview
We depict the architecture of our SDRAM controller in Fig. 3 . Notice that the architecture is comprised of six types of blocks: bank address mapping, bank request queues, bank schedulers, command registers, data buffers and channel scheduler. Incoming requests go first through the bank address mapping block, which decodes their addresses and forwards them to the proper bank request queue. Requests are then removed one at a time from the queues by the corresponding bank scheduler, whose job is to process them. Processing a request means translating it into a set of SDRAM commands, which are then forwarded to the command registers. Each bank scheduler has its own command register and each command register stores a single command (the oldest outstanding command).
Finally, the channel scheduler, which implements the read/write bundling mechanism, arbitrates between different command registers and sends the selected command to the SDRAM module, i.e., executes it. In the rest of this section, we discuss in detail the SDRAM controller blocks that are responsible for SDRAM command scheduling (depicted in gray in the figure).
Bank Schedulers and Command Registers
The function of bank schedulers is to translate a memory request into a set of SDRAM commands that fulfill such request. If the bank scheduler employs the open-row buffer policy, which is the case considered in this article, a request is translated into either a CAS command or into a prechargeactivate-CAS sequence, depending on whether it hits or misses at the row buffer. The function of the command registers is to serve as an intermediate level of buffering that decouples the implementation of the channel scheduler from the bank schedulers. There is one command register for each bank scheduler. The channel scheduler removes commands from the registers when the commands are executed (sent to the SDRAM module). This allows the bank scheduler whose register was emptied to insert a new command (after the pertinent constraints no longer pose a violation), and so on.
A bank scheduler must only place a command in its register if such command can be immediately executed by the channel scheduler without violating any exclusively intrabank timing constraints, i.e., timing constraints that rule the minimum distance between commands issued to the same bank and to the same bank only. For instance, if the channel scheduler executes an activate from a command register, then the corresponding bank scheduler must wait at least t RCD cycles before inserting a write into the aforementioned command register.
Channel Scheduler
The channel scheduler has two functions: first, to regularly refresh the SDRAM module and, second, to arbitrate between and execute commands from the command registers. Because the refresh logic is trivial, in this section, we focus on the (non-refresh) command scheduling. As we already discussed, commands placed in the command registers can be immediately executed without violating any exclusively intra-bank constraints. Hence, the channel scheduler only needs to prevent the exclusively inter-bank and the intra-and inter-bank timing constraints.
We depict a block diagram of the channel scheduler in Fig. 4 . Notice that commands are arbitrated in two layers. First, they are arbitrated inside their own type arbiters, i.e., CAS commands are routed to the CAS Arbiter and activate and precharge commands go to the Activate and Precharge Arbiters, respectively. Then, in the second layer of arbitration, i.e., the Command Bus Arbiter, a command that won the arbitration in its type arbiter competes with interfering commands from other types. We now discuss each of the aforementioned arbiters individually.
CAS Arbiter
The CAS Arbiter implements the bundling of read and write commands. For that purpose, it relies on the concept of scheduling rounds, which can last several cycles. Moreover, it schedules commands according to three rules: 1) in each round, at most one CAS command from each of the command registers is executed. 2) In the beginning of each round, the CAS Arbiter selects (if existing) CAS commands that match the type of the last CAS command from the previous round. For instance, in Fig. 5 , the CAS Arbiter starts round i þ 1 serving write commands, because round i finished with a write command. And 3), in each round, at most one sweep of read commands and one sweep of write commands is performed. Hence, if a CAS command which is not blocked by the first rule arrives too late, e.g., a read command arrives after the end of the sweep of read commands, such command is postponed until the next round.
We make two important observations about the rules. First, they enforce that at most one turnaround happens in each scheduling round. And second, the not-too-late assumption mentioned in the introduction refers to assuming (in the timing analysis) that any CAS command that is not blocked by the first rule will also not be postponed until the next round due to the third rule.
To implement bundling, the CAS Arbiter requires a stateful architecture. The state is comprised of a vector of served flags, that keep track of which command registers have already been served in the current round, and a bundlingtype register, that defines which type (read or write) of CAS commands have currently priority. To clarify how the state is used to perform scheduling, we depict a diagram showing the operation of the CAS Arbiter in Fig. 6 . In the figure, it is important to notice that the demultiplexing layer from the channel scheduler (see Fig. 4 ) enforces that only read or write commands arrive at the input of the CAS Arbiter. Therefore, the activate and precharge commands contained in the registers of banks 0, 1 and 7 do not arrive at the input of the CAS Arbiter (their boxes are empty).
We discuss each step performed by the CAS Arbiter. First, the CAS Arbiter masks out the command registers that have already been served in the round. This is performed using the served flags vector. Second, the arbiter performs CAS masking, which masks out pending CAS commands that do not match the bundling-type register. In the figure, the bundling-type is write consequently, the read command from bank 5 is masked out. Third, a round-robin arbiter selects the next CAS command to be executed. Finally, in the last step (called the timing constraints checker), the selected CAS command is simply held until it causes no timing violations. For CAS commands, there are four constraints that need to be accounted for: t CCD , t BURST , t RtoW and t WtoR . So, for instance, if the last command executed in the round was a read but the next command that the CAS arbiter wants to execute is a write, the timing constraints checker holds the pending write for t RtoW cycles.
We describe how the CAS Arbiter state is updated. We first discuss the served flags vector. Every time a CAS is selected and executed by the Command Bus Arbiter, the corresponding bit in the served flags vector is set. Furthermore, Fig. 4 . SDRAM channel scheduler. The refresh logic is omitted for the sake of simplicity. Fig. 5 . Example of read/write bundling in a system with nB=4 banks. In the figure, we consider that two command registers provide a continuous stream of writes and two other command registers provide a continuous stream of reads. Notice that at most one data bus turnaround is required in each round. Notice also that the turnaround causes an idle bubble in the data bus. Fig. 6 . CAS Arbiter operation for a system with nB=8. For the sake of simplicity, the logic that updates the CAS Arbiter state (served flags vector and bundling-type register) has been omitted. the vector is cleared every time a new round starts. A new round starts either when all bits of the vector are set, or when all unset bits of the vector belong to command registers that do not have a pending CAS command.
We discuss the bundling-type register. The bundling-type register defines which type of CAS operation has priority: reads or writes. Its value is flipped (from read to write, or from write to read) if four conditions are simultaneously satisfied. First, there is at least one command register that has a pending CAS command whose type does not match the value of the bundling-type register. Second, the served flag of the command from the first condition is unset. Third, the output of the CAS Masking step is null. And finally, no flipping of the bundling-type register has taken place in the current scheduling round. These four conditions enforce that, in each round, at most one data bus turnaround is required, as shown in Fig. 5. 
Activate Arbiter
There are two timing constraints that dictate how far apart activate commands to different banks must be from each other: t RRD and t FAW . The t RRD dictates the minimum distance between consecutive activate commands to different banks. The t FAW is a bit more complex. It establishes a time window in which at most 4 activate commands can be executed. As long as t RRD and t FAW are not violated, the Activate Arbiter simply forwards the oldest pending activate command to the Command Bus Arbiter.
Precharge Arbiter
There are no inter-bank timing constraints between precharge commands to different banks. For instance, a precharge command to bank 0 can be executed one cycle after a precharge command to bank 1. Hence, the precharge Arbiter simply forwards the oldest pending precharge to the Command Bus Arbiter.
Command Bus Arbiter
The command bus can only carry one command per cycle and, hence, needs to be arbitrated. Given that the exclusively intra-bank timing constraints are handled by the bank scheduler and that inter-bank timing constraints are handled by the type arbiters, this stage of arbitration only needs to select between the output of the type arbiters. For that purpose, it prioritizes the output of the CAS Arbiter. If no pending CAS is available, the oldest non-CAS (precharge or activate) is given priority. (Here we highlight that in [12] , activates had priority over precharges, which brought no benefit).
TIMING ANALYSIS
In this section, we describe how to calculate the worst-case cumulative SDRAM latency of a task (L SDRAM Task ) using our SDRAM controller, i.e., the maximum amount of time that a task spends idle while waiting for its SDRAM requests to be served. A task, for the sake of this article, is a processor executing a computer program.
We structure our analysis into four parts: first, in Section 4.1, we describe our processor, bank scheduler and bank address mapping assumptions. Second, in Section 4.2, we compute the worst-case latencies of individual commands. Then, in Section 4.3, we combine the worst-case latencies of individual commands with the delays introduced by the bank scheduler in order to calculate the worst-case latencies of SDRAM requests. Finally, in Section 4.4, we compute L SDRAM Task , i.e., the sum of the worstcase latencies of all SDRAM requests.
Assumptions
Our timing analysis demands no knowledge about the behavior of interfering tasks on the system. Moreover, it relies on the following assumptions: 1) the processor running the task under analysis (u.a.) relies on caches and only accesses the SDRAM to retrieve or forward cache lines.
2) The processor is fully timing compositional [21] , which means that it uses in-order execution and stalls at every read request.
3) The write-buffer between the cache and SDRAM is disabled and, hence, the processor also stalls at write requests. 1 In the ARMv8-A architecture [22] , for instance, this is achieved by disabling the early write acknowledgment feature. 4) No multi-threading/context switches occur due to task scheduling. This enforces that no cache related effects change the number of cache misses experienced by the task u.a.. 5) The task u.a. has exclusive access to one of the banks (bank privatization) and the corresponding bank scheduler employs the open-row policy. 6) Our lemmas and equations do assume the not-too-late behavior mentioned in Section 3.3.1. However, after we present them, we also show how a simple trick can be used to compute a bound that does not rely on such behavior.
Worst-Case Latency of a Command
In this section, we calculate the worst-case latency between the insertion of a command into a register and its execution by the channel scheduler. For that purpose, we assume that each command suffers maximum interference within its type arbiter. For instance, when calculating the worst-case latency of an activate, we assume all interfering command registers also have pending activate commands.
We make the following observations about our discussion: first, to avoid confusion, we refer to the command register that holds the command u.a. as cr. Second, we refer to each of the nB-1 interfering command registers as icr i , where i is an index. Third, to save space, the figures in this section are depicted with nB=4 banks, even though DDR3 devices have nB=8 banks. Fourth, because different commands are subject to different timing constraints and have different priorities inside the channel scheduler, we calculate the worst-case latency individually for read, activate and precharge. The case for a write command is similar in a symmetrical fashion to the one for a read command and, hence, only discussed in Appendix B, available online.
Finally, for CAS commands, i.e., read or write commands, we further distinguish between two types of worst-case latency: 1) the one experienced if the CAS u.a. succeeds, i.e., follows, a CAS command in cr (SC). And 2) the one experienced if the CAS u.a. succeeds a non-CAS command in cr 1. Notice that the presence of a write-buffer allows a processor to keep executing while a write request is being processed, potentially hiding the latency of a write request. Consequently, making a no writebuffer assumption is conservative.
(SNC). As it will become clear, this distinction allows us to properly analyze the round-oriented operation of the CAS Arbiter. We summarize the notations used for worstcase latencies of commands in Table 2 .
We now calculate the worst-case latency of a read. For that purpose, we first describe an expression that computes the latency of a read as a function of t DELAY , i.e., the distance between the insertion of the read u.a. into cr and the execution of the previous CAS command that occupied cr. We refer to such expression as
are then computed using it with different parameters.
We depict a scenario that induces the worst-case latency of a read in Fig. 7 . The intuition is that the read u.a. is potentially blocked twice by each interfering command register (depending on how small t DELAY is). Moreover, we assume that data bus turnarounds happen as often as possible, i.e., one at every scheduling round. With that in mind, we state Lemma 1.
Lemma 1. The worst-case latency of a read (that is inserted into cr t DELAY cycles after the previous CAS that occupied cr is executed) is calculated with
(1) where:
Proof. It is trivial to observe that, in order to maximize L R ðt DELAY Þ, the previous CAS that occupied cr should be the first to be served in its scheduling round and that the read u.a. should be the last to be served in its scheduling round. Moreover, to enforce one turnaround in each of the scheduling rounds, the previous CAS in cr must also be a read. Consequently, the rest of this proof consists in addressing the correctness of Eq. (1).
The worst-case latency of a read command is given by the sum of the blocking experienced by it in two distinct and consecutive rounds of the CAS Arbiter (see Fig. 7 ): first, in round i À 1, i.e., the round in which the previous read that occupied cr is executed, and then in round i, i.e., the round in which the read u.a. is executed.
We first discuss the blocking in round i À 1. In round i À 1, the amount of blocking experienced by the read u.a. amounts to maxft prev-round À t DELAY ; 0g cycles. The t prev-round portion is computed according to Eq. (2) and represents an upper bound on the time required for round i À 1 to finish because: 1) all icrs provide an interfering CAS, and 2) one data bus turnaround is required. The t prev-round is then subtracted by the t DELAY , which can be inferred from our assumption of a processor stalls at every request.
We discuss the blocking in round i. In round i, i.e., the round in which the read u.a. is executed, the blocking suffered by the read u.a. amounts to t curr-round , which is calculated according to Eq. (3). Again, the equation computes an upper bound on the blocking experienced by the read u.a. in round i because: 1) all icrs provide an interfering CAS command. And 2) a turnaround is required. This concludes, by construction, the proof of correctness of Eq. (1).
From the definition of L R ðt DELAY Þ, we can derive Theorem 1.
Theorem 1. The worst-case latency of a read is given by Eq. (4) if it succeeds a CAS, and by Eq. (5) if it succeeds a non-CAS
Proof. Both Eqs. (4) and (5) use the function from Lemma 1
with different values of t DELAY . For the calculation of L R SC , i.e., the worst-case latency of a read that succeeds a CAS in cr, we know t DELAY is at least t RL þ t BURST cycles (as depicted in Fig. 8a) , because of our assumption of a processor that tolerates at most one outstanding request. However, for the calculation of L R SNC , we know there is an activate and a precharge before consecutive CAS commands. Hence, we add the corresponding latencies, as depicted in Fig. 8b . t u
We now discuss the not-too-late assumption. For that purpose, consider the example from Fig. 7 and the equation from Lemma 1. Notice that if t DELAY is sufficiently large, the equation will only account for the blocking experienced in a single scheduling round (given by t curr-round ). However, this only remains accurate if the third operational rule from the CAS Arbiter (see Section 3.3.1) is never invoked. In theory, it is possible to carefully handcraft a scenario in which the third rule must be used by the CAS Arbiter. This is demonstrated in Fig. 9 . In practice, cycle-accurate simulations have proven such scenario to be quite unlikely. Hence, it can be safely ignored, given that the end-result of our analysis is computed over all requests from a task, from which the vast majority will not experience the too late behavior of a read command. (Alternatively, one could employ FCFS instead of round-robin in the CAS Arbiter and modify its second rule of operation so that if a CAS arrives too late, then its type determines how the new scheduling round starts, e.g., in Fig. 9 , round i would start executing the read u.a.. This would have minimum impact in Lemma 1).
Assuming no modifications in the CAS Arbiter, a bound independent of the not-too-late assumption can be calculated by enforcing that the L R ðt DELAY Þ function is only used with
. The number one is used instead of zero because, in order to construct the scenario, we must assume that the read u.a. is inserted into cr one cycle after the decision to end the sweep of read commands is made.
We now discuss the worst-case latency of activates and precharges. For that purpose, we first state Lemma 2, which captures the effect of the lower priority of activates and precharges with regard to CAS commands.
Lemma 2. Given a sequence of n precharge and/or activate commands that can be immediately executed without violating timing constraints and that can only postpone each other for one cycle due to command bus contention, the maximum timing interval (in cycles) required to execute such sequence is given by
Proof. Activate and precharge commands have lower priority than CAS commands. However, any two consecutive CAS commands must be executed at least t BURST cycles apart (or by even more cycles if a data bus turnaround is required). Hence, in any interval of t BURST cycles, at least t BURST À 1 cycles will be free for the execution of activates and precharges. Eq. (6) comes directly from such observation (for the interested reader, a graphical depiction is available in Appendix D, available online). t u
We now compute the worst-case latency of an activate command.
Theorem 2. The worst-case latency of a activate command is calculated using
where:
Proof. In order to assist our proof, we depict an example of the worst-case latency of an activate in Fig. 10 . Notice that the figure has three main features: (1) it considers that four activates are executed as-late-as-possible before the insertion of the activate u.a.. (2) When the activate u.a. is inserted into cr, each of the nB À 1 interfering banks has an older pending activate. (3) After activate command(s) from interfering bank(s) are executed, such bank(s) provide higher-priority CAS commands. We now discuss why such features lead to the worstcase. The first feature forces us to account for a residual latency (consequence of t FAW ). The second feature comes from the observation that a CAS or precharge command in an interfering bank can only postpone the activate u.a. by one cycle (due to data bus contention), while an activate in an interfering bank can postpone the activate u.a. by at least t RRD cycles. The last feature enforces that activates are blocked as often as possible by higher-priority CAS commands (by as often, we mean as long as the second feature is not compromised).
That being said, we now prove the correctness of the equation that computes L A , which has two main terms. The leftmost term accounts for the residual latency mentioned in the first condition. The rightmost term (max operator) accounts for the remaining latencies by selecting the largest value between two expressions. The first one (exp1) considers that t FAW is hidden by the occurrences of t RRD and the blocking due to higher-priority CAS commands (which is not the case in Fig. 10 ). The second one (exp2) considers that t FAW is not hidden and basically just replaces (4 Á t RRD þ 3 Á D A ) by t FAW in exp1 for each of the K times in which the t FAW constraint is activated. t u Fig. 9 . An example of a read arriving too late for scheduling. Fig. 10 . Worst-case latency of an activate command in a hypothetical system in which nB=5. The letter C refers to a CAS command.
Finally, we compute the worst-case latency of precharge commands with Theorem 3.
Theorem 3. The worst-case latency of a precharge command is calculated using
Proof. Precharge commands can be executed back-to-back (one per cycle). In the worst-case, the precharge u.a. is blocked once by older non-CAS commands in interfering banks. Moreover, interfering non-CAS commands can be blocked higher-priority CAS commands once every t BURST cycles. Consequently, we compute L P by invoking a PA with nB as argument. Notice that we employ nB (instead of the nB À 1 interfering non-CAS commands) as an argument to the a PA ðnÞ function, as we also account for one cycle required to execute the precharge u.a.. t u
Worst-Case Latency of a Request
We define the worst-case latency of a request as the time between the request arriving at the SDRAM controller and the corresponding data transfer being completed. This latency is influenced by three factors: 1) The latencies imposed by the bank scheduler, which enforce that a command is only inserted into the corresponding command register if it can be immediately executed without violating any intra-bank timing constraints.
2) The commands required to fulfill the request and their corresponding worst-case latencies, which were calculated in the previous section.
3) The cycles required to perform the data transfer.
Because of the open-row policy, the commands required to fulfill a request depend on whether it hits or misses at the row buffer and whether it is a read or a write. Hence, we classify requests into four different types, as described in Table 3 . For instance, a read request that does not target a currently opened row is referred to as a Read Miss (RM), it requires a P-A-R command sequence to be fulfilled and its worst-case latency is given by L RM Req . In this section, we compute the worst-case latency for read misses and read hits (the case for write misses and write hits is similar and, hence, is only provided in Appendix C, available online).
We discuss L RM Req . For ease of comprehension, we depict the factors that contribute to L RM Req in Fig. 11a . Notice that, in the figure, there is a number below every latency that contributes to L RM Req . These numbers correlate each latency with one of the factors described in the beginning of this section. For instance, L P , L A and L R SNC are command latencies (factor 2) and, hence, have a 2 below them. Also, observe that, in the worst case, the request u.a. arrives exactly after the previous request was served. This is a direct result of our timing compositional processor assumption. Finally, notice that we use a pattern of white and gray to depict the previous request. Such color scheme is employed to represent that the previous request can be either a read or a write request (and, hence, the letter C, which stands for CAS, is used inside the command box).
We first describe t Residual . To fulfill a RM request, the bank scheduler first needs to precharge the row buffer. However, in order to enforce that no t WR or t RAS violations occur (see Table 1 ), the bank scheduler needs to delay the insertion of the required precharge into the command register by t Residual cycles. To calculate t Residual , we consider both the case in which the previous request was a read and the case in which the previous request was a write, as displayed in Eqs. (13), (14), and (15)
If the previous request was a read, then t Residual is a consequence of the t RAS constraint and is given by Eq. (14) , which comes directly from the semantics of the used constraints. If the previous request was a write, then t Residual is given by Eq. (15), which simply corresponds to the write recovery time.
We highlight that in order to compute the latency of the request u.a. independently from the previous request, we conservatively employ the max function in Eq. (13) to select between the largest of the two cases. In the next section, when we combine the latencies of all requests to extract guarantees for a task, we describe a correction term that compensates for this overly conservative assumption).
We now discuss the computation of L RM Req . As we already discussed, there are three factors contributing to L RM Req . In order to compute L RM Req , we simply add all three factors. This is formalized with Lemma 3. Lemma 3. The worst-case latency of a RM request is given by Eq. (16) . In the equation, the leftmost, the middle and the rightmost portions compute the influence of factors 1, 2 and 3, respectively
Proof. Eq. (16) is trivial, as it simply sums the command latencies, the bank scheduler delays and the time required to perform a data transfer. t u
We compute L RH Req . A RH request only requires a read command and, hence, its worst-case latency is simpler, as depicted in Fig. 11b . As soon as the request arrives, the bank scheduler inserts the read u.a. into the command register. The read u.a. is executed after at most L R SC cycles and the data transfer is completed after t RL þ t BURST cycles. Notice that, because no precharge is required, the t Residual latency does not need to be taken into account. Moreover, a possible data bus turnaround is already accounted for inside L R SC . These observations trivially yield Lemma 4.
Lemma 4. The worst-case latency of a RH request is given by Eq. (17) . In the Equation, the leftmost and the rightmost portions compute the influence of factors 2 and 3, respectively
Worst-Case Cumulative SDRAM Latency of a Task
We define the worst-case cumulative SDRAM latency of a task (L
SDRAM Task
) as the maximum amount of time that a task spends idle waiting for its SDRAM requests to be served. For that purpose, we assume that we know the pattern of SDRAM requests performed by a task that leads to its worst-case latency. By pattern, we mean the information enumerated in Table 4 . Because computing such pattern is out of the scope of this article (we refer the interested reader to [23]), we extract the pattern from execution traces. We highlight that the same assumption has been made in [11] , [12] , which also employed a trace-based approach.
In order to compute L SDRAM Task
, we first multiply the numbers from the first portion of Table 4 by the corresponding request latencies. Then, we correct the overly conservative result (which is a consequence of Eq. (13) 
where: Notice that, for the sake of simplicity, our theorem purposely disregards the effect that refreshes have in L DRAM Task . This is because, as discussed in [24] , the effect of refreshes is negligible in comparison with other command delays, provided that the execution time of the task u.a is not too short. For tasks that fall into the short scenario, a software approach for predictable refreshes is available at [25] .
EVALUATION
In this section, we compare our approach with two different real-time SDRAM controllers: the controller from Wu et al. [6] . the Analyzable Memory Controller (AMC) [5] . The SDRAM controller from Wu et al. also uses the openrow policy and its analysis assumes that the task under consideration has exclusive access to one of the banks, i.e., it considers bank privatization. Its main difference in comparison with ours is that older CAS commands always have priority over newer ones, regardless of whether they force a bus turnaround or not. The second portion of the table is derived from the first portion and is only employed to calculate the correction term for the overly conservative computation of t Residual (see Eq. (13) in Section 4.3).
The AMC employs close-row policy and originally relied on a interleaved address mapping. However, with a 64-bit data bus (the scenario considered in this evaluation), an interleaved address mapping is not useful when SDRAM requests have the size of a cache line. Hence, in order to perform a comparison with our approach, we adopt the strategy employed in [6] : we implement AMC with a bank privatization setup in which each incoming request ultimately is translated into a static command group containing an activate-(CAS with Auto-Precharge) sequence. The oldest pending command group, regardless of whether it forces a bus turnaround or not, is given priority.
We discuss the evaluation. Our evaluation is based on SDRAM request traces obtained using the Gem5 platform [26] . The traces are then employed to calculate analytical timing bounds, i.e., worst-case cumulative latencies (L
SDRAM Task
), which rely on the first five assumptions from Section 4.1.
2 Moreover, the traces are also used as stimuli for cycle accurate simulators of our controller and the other two controllers under consideration in this section. From the simulators, we obtain the following information: 1) the observed cumulative worst-case latency of each application, which we compare with the corresponding analytical bound, 2) the data bus utilisation that each controller is able to maintain, which allows us to compare scheduling efficiency, and 3) SDRAM command traces, which serve as input for a power estimation tool (DRAMPower [27] ).
Application Traces
In our evaluation, we employ a total of eight applications from Mibench [28] , a number that matches the quantity of SDRAM banks present in the modules under consideration. In order to collect the traces, the applications are executed in Gem5 in isolation with a 1 GHz timing compositional ARM processor (see Section 4.1) that relies on a 64-kb L1-cache with a line size of 64 bytes. The ratio of request types (see Table 4 ) exhibited by each application is depicted in Fig. 12 . Notice that the profile of the applications varies greatly in terms of row buffer hit ratio and number of write requests.
We make one important remark about the collected traces. Because the number of requests of each application varies drastically, we summarize the traces. More specifically, we generate artificial traces, each containing 5,000 requests, but respecting the proportions depicted in Fig. 12 . Moreover, when executing the traces in simulation, we eliminate time intervals between successive requests, i.e., for each application, we inject a new request as soon as the previous one is served. 3 We perform these steps because equalizing the number of requests and eliminating interrequest time maximizes the interference that each application can exert on each other during simulation.
SDRAM Modules
For our evaluation, we consider four SDRAM modules manufactured by Micron [29] , which are enumerated in Table 5 . All modules have 64-bit wide data buses and and were selected according to their commercial availability at the time of writing. The data sheets for the selected modules, which contain the electrical parameters used by the DRAMPower tool, can be retrieved using the corresponding Part Numbers.
Comparison of Worst-Case Performance
We now compare the analytical bounds and the latencies observed in simulation for every combination of SDRAM module and SDRAM controller investigated in this article. For our controller, we compute two different analytical bounds: one that relies on the not-too-late assumption and one that does not (see Sections 3.3.1 and 4.2). For the other controllers, we employ the timing analyses presented in the corresponding papers. As for the observed latencies, we perform the following procedure: for each combination of SDRAM controller and SDRAM module, we execute all eight application traces simultaneously in the corresponding controller simulator and measure the delays that each trace experiences.
We present the results in Figs. 13a, 13b, 13c , and 13d. In the figures, solid colors are used to represent analytical bounds, while pattern fills represent latencies observed in simulation. Moreover, for each application, results are normalized to the analytical bound for AMC. From the analytical perspective, we observe the following trends:
(1) For the open-row controllers, applications that have a larger number of row buffer hits have smaller timing bounds than the ones with a low number of row buffer hits. Moreover, the advantage of open-row controllers over close-row ones is larger in high-speed modules, e.g., compare the results obtained for DDR2-800C with the ones for DDR3-1866M. (For the DDR2-800C, our controller provides worse bounds than AMC for most of the applications). This is because in high-speed modules, the overhead for closing and opening rows is larger (see Table 1 ). 3. Notice that such setup (of injecting a request as soon as the SDRAM controller acknowledges the service of the previous request) limits the gains that could be achieved with a write-buffer. As a matter of fact, modifying the SDRAM controller simulators in order to perform early acknowledgment of write requests (see Section 4.1) brought negligible changes in the simulation results.
(2) The advantage of our controller is better highlighted in high-speed modules. This is because, in the worstcase scenario, a CAS command in our controller can be potentially blocked twice by other CAS commands in each of the interfering banks (see Fig. 9 ), while in the other controllers, a CAS command can only be blocked once by CAS commands in interfering banks. Consequently, in order for the CAS command reordering to pay-off, the overhead for data bus turnarounds must be large, which is the case in high-speed modules such as the DDR3-1866M (moreover, we also provide results for a DDR3-2133N module in Appendix E, available online). (3) As expected, the bounds provided by our controller are better is we assume the subjective not-too-late behavior. This is because if such assumption is made, a portion of the interference suffered by a CAS command is hidden by t DELAY . Here we highlight that Section 4.2 discusses a CAS Arbiter modification to handle CAS commands that arrive too late. From the perspective of latencies observed in simulation, we observe the following trends:
(1) The observed latencies for AMC vary according to the number of write requests in the application under analysis. The larger the number of write requests, the larger is the probability that the application forces a bus turnaround (hence, experiencing a larger delay). Such effect would be hidden in systems with narrow data buses because of the static bundling of reads and writes (see Section 2.3). (2) For some applications and the DDR2-800C module, the AMC actually provides better observed latencies than open-row controllers. This is because for slower SDRAM devices, the overhead to close and open rows is smaller (see Table 1 ). Fig. 12 ), such scenario does not correspond to the common case.
Data Bus Utilisation
From the simulations performed in the last section, we extract SDRAM command traces. Each command trace contains commands from requests belonging to all eight applications. Scrutinizing the quantity and time stamp of all commands in a trace, we can compute the average data bus utilisation that each controller can maintain. We depict the results in Figs. 14a, 14b, 14c , and 14d. Notice that the figures measure time in data bus clock cycles (and not in nanoseconds). Hence, even though using a DDR2-800C takes less data bus cycles than a DDR3-1866M to serve the same set of requests, the latter takes less nanoseconds, as it has a clock period of only 1.07 ns, while the former has a period of 2.5 ns.
We observe the following trends:
(1) For SDRAMs with higher operating frequencies, it becomes harder to keep high data bus utilisations, e.g., compare Figs. 14a and 14d. This is because the timing constraints are larger for them. (2) For DDR2-800C, AMC actually performs better than the controller from Wu et al.. As a matter of fact, its utilisation mostly overlaps with the one displayed by our controller. Again, this is because the overhead to close and open rows is smaller for devices with reduced operating frequency. Hence, the smaller the bar, the better the result. For our controller, we compute bounds both assuming and not assuming the not-too-late behavior.
(3) For the remaining modules, AMC performs worse than the controller from Wu et al.. Moreover, regardless of the SDRAM module, our controller consistently maintains higher utilization than the other two investigated controllers.
Power Consumption
Finally, using the same command traces employed in the last section, we compute a power consumption estimate using the DRAMPower tool [27] . The results are depicted in Fig. 15 and represent the total amount of energy (in micro joules) required to serve all requests. We observe the following trends:
(1) Regardless of the controller, DDR2-800C has by far the worst power consumption. This is because DDR2 memories are simply not as energy efficient as DDR3. Moreover, they have an operating voltage of 1.8 V (see Table 5 ), against 1.35 V of the DDR3 investigated modules. (2) For all SDRAM modules, the AMC consumes more power than open-row controllers. This is because closing and opening rows is an energy-costly operation. 
CONCLUSION
In this article, we propose a real-time SDRAM controller that bundles read and write commands. We describe the controller architecture and provide a detailed timing analysis of it. We compare our approach analytically and experimentally with other two controllers: the open-row one from Wu et al. [6] and the close-row AMC [5] . The main results of our evaluation are:
As the clock speed of SDRAM devices increases, the penalty for data bus turnarounds becomes more significant and, consequently, can limit data bus utilisation. Such challenge is addressed by our controller. In scenarios with high operating frequencies, i.e., all investigated modules with the exception of DDR2-800C, the 
ACKNOWLEDGMENTS
This work was partially funded within the EMC2 project by the German Federal Ministry of Education and Research with funding ID 01|S14002O and by the ARTEMIS Joint Undertaking under grant agreement n. 621429. The responsibility for the content remains with the authors.
