In the arsenal of resources for computer memory system performance improvement, predictors have gained an increasing role in the past years. They can suppress the latencies when accessing cache or main memory. In paper [1] it is shown how temporal parameters of cache memory access, defined as live time, dead time and access interval could be used for prediction of data prefetching. This paper examines the feasibility of applying an analog technique on controlling of opening/closing DRAM memory rows, with various improvements. The results described herein confirm the feasibility, and allow us to propose a DRAM controller with predictors that not only close the opened DRAM row, but also predict the next row to be opened.
Introduction
A desire for better potential utilization of processors, which become faster and faster, demands a memory system with similar performance enhancements. A critical link in the hierarchically organized memory system is main memory, implemented with chips of dynamic memory (DRAM -Dynamic Random Access Memory). In order to achieve as large bandwidth as possible, chips of contemporary DRAM memories are organized with several independent memory banks, they allow memory access pipelining, and buffer the data from the last activated row in each bank. Although increasing the memory bandwidth, these solutions make contemporary DRAM memories performances dependable on memory access patterns. Contemporary DRAM memories are not really random access memories, characterized with identical access times to all locations in them. They are actually three-dimensional memories, with banks, rows, and columns as dimensions. DRAM data access with row opening demands the following time: Ta = Trp+Trcd+Tcl, where Trp is row precharge time, Trcd (RAS to CAS Delay) is row access time and Tcl (Column Latency) is column access time.
Using of read and write commands with autoprecharge eliminates the precharge time when the next access occurs, reducing the access time to Trcd+Tcl. Data accesses into already opened rows eliminate the precharge time and the row access time, reducing the access time to Tcl. The result is that consecutive accesses to different rows into a single memory bank have larger latencies than consecu- tive accesses into same row. Performance maximization of DRAM memories demands minimization of participation of precharges and row openings. This makes that we can influence DRAM memory performances (latency) by controlling the data placement into banks and rows. This is the basis of papers in which address remappings are considered, which transform memory addresses into banks, rows and columns that optimize DRAM performances for certain memory access patterns [3] , [4] .
DRAM memory latency can be decreased if the opened row is closed before the occurrence of the next data access, directed to a same bank, but to different row. In that way the precharge time Trp is being hidden, so the latency is practically reduced to Ta = Trcd+Tcl. This latency could be additionally reduced to Ta = Tcl, by hiding the row access time. This demands the next row that will be accessed, to be opened in advance. In-time closing of the opened row demands a prediction when to close the opened row. Opening in advance of the next row demands a prediction which row should be opened and when.
Papers [1] , [2] deal with possibilities to predict the moment when the data block in the cache memory is to be declared 'dead' (i.e. not to be used in near future) and when and which data block to fetch to the cache in advance. Those ideas could be applied to DRAM memories. That inspired us to investigate the possibilities of applying some of those ideas to DRAM memory performance optimization. In this paper we have defined proper characteristic time parameters for DRAM memories. By simulation, we have concluded that DRAM memory accesses have some regularity that can be used for prediction when to close the opened row, and which is the next row to be opened. Based on those results, we have proposed a set of predictors, which not only predict when to close the opened row, but also predict the next row to be opened. These predictors could be integrated into existent DRAM memory controllers.
The paper is organized as follows. In Sect. 2 we consider the existing DRAM controller policies. In Sect. 3 the basic idea, and in Sect. 4 the predictors' design and implementation, are exposed. Section 5 contains the used simulation model, Sect. 6 gives a review of the obtained results, and Sect. 7 contains information about related work of other authors. Section 8 is the conclusion. (Optimistic Policy) and Close Page (Row) Autoprecharge Policy (Pessimistic Policy). When using the first one, the accessed row is kept opened, which decreases the latency if the next DRAM access is directed to that same row, and increases the latency if the next DRAM access is directed to some other row. In the first case the latency is equal to Tcl, and in the second it is equal to the sum Trp+Trcd+Tcl. When using the second policy, each row is being closed after every access, so the latency is always the same -the sum Trcd+Tcl. Open Row Policy gives good results if there is a good memory access locality, and Close Row Autoprecharge Policy gives good results if DRAM accesses have mostly random character. In some of our previous papers [4] , [5] we have already considered various possibilities for obtaining hybrid policies, which use the advantages of both policies. The goal is to achieve a policy more efficient than both the Open Row and Close Row Autoprecharge Policy, and in that way, to decrease the DRAM latency. In ideal case the opened row should be kept open for as long as there are accesses into it, and not to some other row, and it should be closed after the last access into it. In that way the system would be prepared for the next row access. If this was achieved, further improvements could be made by predicting the next row to be opened, and then opening that row in advance.
DRAM Controller Policies

Basic Ideas
Since we want to apply the metrics analogous to those from [1] in order to improve DRAM memory performances, let us first define those metrics related to DRAM memory. Live time is a time interval that elapses from opening a row in a bank until the last access into that row before its closing. Dead time is a time which elapses from the last access to an open row until the moment of its closing. Access interval is a time interval which elapses between two consecutive accesses to an open row in a bank. These metrics are presented in Fig. 1 In this paper we consider a DRAM controller with 2 predictors: a close-page predictor, and an open-page predictor. First the close-page predictor predicts when to close the currently opened DRAM row. After that, the open-page predictor predicts the next row to be opened. In case of accurate predictions the latency time can be reduced to only Tcl.
The mentioned close-page predictor consists of two predictors: a zero-live-time predictor and a dead-time predictor. The first predictor is used always when a new row is opened, and it predicts whether its live time will be a zero live time or not. If yes, that row is closed immediately after completing the DRAM access. If not, the row is kept opened and after that access, and during further accesses the dead-time predictor is used to predict whether that row has entered its dead time. If it has, the row is closed, if not it is kept open.
In case of a prediction that closes the row, the openpage predictor is activated. This predictor consists of two tables -Row History Table and Pattern History Table, which remember the history. Based on these tables, the next row to be opened is predicted, and opened.
It should be said that some of these ideas are already shown in our previous paper [6] . There are 3 main differences between this paper and [6] . First, in [6] there are no zero-live-time predictors. Second, only latencies for the Open Page strategy and the complete predictor are shown in [6] . In this paper we also show the latencies for the predictors in between -the dead-time predictor and the full closepage predictor. Third, [6] contains results for only 1 cache configuration. In this paper we consider 4 cache configurations.
In next section all the predictors are described in detail.
Predictors' Design and Implementation
As already stated, we basically use two predictors: a closepage predictor and an open-page predictor. The close-page predictor also consists of two predictors: a zero-live-time predictor and a dead-time predictor. In this paper we consider three variants of zero-livetime predictors. The first one has a bit for each row in the DRAM, which tells whether its last live time was a zero live time or not. When opening a row, it is predicted that its live time will be a zero live time if it was a zero live time the last time it was opened and vice versa. The starting prediction for all rows is that the live time will not be a zero live time, since that corresponds to Open Row Policy. Each of the other two variants has two bits for every row in the system. Those two bits are used as a saturated counter, with values from 0 to 3. Every time a zero live time occurs the counter is incremented, except its previous value was 3. Every time a nonzero live time occurs the counter is decremented, except its previous value was 0 (second variant), or the counter is reset to 0 (third variant). When predicting, it is predicted that the live time will be a zero live time if the counter's value is 2 or 3, i.e. nonzero live time if the counter's value is 0 or 1. The starting counter's value is 0.
Implementation of this predictor is simple. It may be in a form of a SRAM memory with suitable organization integrated into the DRAM controller, since number of rows in the system may have large values. For example, a rank of DRAM chips that has 4 banks with 4096 rows each, de-mands 16 Kb or 2 KB. The other two variants are similar to this one. They are slightly more complicated and demand as much as twice more memory, since each row has a twobit counter. Changing the values of the counter (incrementing, decrementing, resetting) is done by read-modify-writes. When predicting, a read is performed, and depending on the value being read, the controller will issue commands with autoprecharge, or not.
Our dead-time predictor is based on access interval time values. Our simulation results showed that the average dead time is much larger than the average access interval time, so that fact is used for dead time prediction. When a value that is the last access interval time, multiplied by 2 or 4, elapses, it is predicted that the row has entered its dead time. So the only value that is being taken care of is the last access interval time. We used two solutions for storing the access interval time. The first one uses only one common value of access interval, which is defined by any appearance of new access interval in any bank. In the second solution there is one value of access interval for each bank in the system.
The implementation of the dead-time predictor demands the DRAM controller to have one counter for each bank (to take care of the elapsed time since last access), one common register for all banks, or one register for each bank, for storing the last access interval value, and one comparator for each bank (for comparing the access interval register value with the counter). In order to minimize the counters' length, they could be triggered with a signal derived by dividing the DRAMs clock. A simple shift operation by 1 or 2 positions over the access interval register would be needed for defining the boundary value. By comparing this value with the counter the controller would decide whether to issue a precharge command or not. A controller that implements Open Row Policy has a register for each bank for storing the last open row index, and a comparator for comparing the current access row index with that register. Compared to that, we could say that a controller with the dead-time predictor would have similar complexity and price, which would be slightly increased.
The structure of the open-page predictor is presented in Fig. 2 . It consists of two tables -Row History Table ( RHT), and Pattern History Table (PHT). RHT stores the last k rows that were activated in each of the banks, so there are k fields in an item for each of the banks. PHT contains the predictions. It has m ≤ n items, where n is number of bank rows. Each item contains j two-part fields: row and next predicted row (r k and r nxt ). PHT access index is obtained as t least significant bits of the sum (trancated addition) of the last k row indeces from the proper item for that bank in RHT, so
The open-page predictor works through two basic functions: Update and Lookup. In the description of these functions below, it is supposed that a new row access has just occured, and current bank and new row are the new row bank and row address, respectively. Update: This operation refreshes RHT and PHT when ac- cessing a new row, so that the history stored in RHT and PHT may always be valid. 1. As a start, current bank is used for a RHT access, and a row sequence (row 1 , row 2 . . . row k ) is located. This sequence is replaced with (row 2 . . . row k , new row). 2. The row sequence (row 1 , row 2 . . . row k ) is used for indexing PHT and an item in PHT is located. 3. From all of the two-part fields in the PHT set, the set signed with row k is located. 4. Finally, the part that predicts the new row in the located field is replaced with new row. By this, new row is defined as a next row that follows the sequence (row 1 , row 2 . . . row k ). Lookup: This operation predicts the next row, knowing the current row, based on information that the previous row sequence that was accessed in the given bank is (row 1 , row 2 . . . row k ). 1. The row sequence (row 1 , row 2 . . . row k ) is used as an index for locating an item in PHT. 2. The first part of the field from this item, signed as row k is selected, and the part which predicts its consecutive row r nxt , is chosen as a next row that follows the sequence (row 1 , row 2 . . . row k ). 3. Finally, DRAM controller uses r nxt to open the proper row in the given bank.
Implementation of the open-page predictor would demand g · b · k · log 2 n bits for RHT (g is the number of DRAM chip ranks, b is the number of banks per rank) and m · j · 2 · log 2 n bits for PHT. Also, one t-bit adder and a multiplexer of type (k, 1) × t are needed, for a control block implemented as a finite state machine. For the adopted DRAM structure of 512 MB with 4 ranks of DRAM chips and k = 4, 768 bits are needed for RHT and 12 KB for PHT with m = 4096 and j = 1.
System Simulation Model
For simulation we have used the program Sim-Outorder from the Simplescalar Tool Set [7] . We have integrated this simulator with programs that simulate DRAM memories, written by ourselves. This integrated simulator performs an execution-driven simulation, which is much more accurate than trace-driven simulations. Characteristics of the simulated processor are: a superscalar processor that issues at most 4 instructions on every clock cycle and supports out of order instruction execution. As a branch predictor a twolevel branch predictor was used. There are two levels of cache memories. The first one contains separate instruction and data caches. They are both 16 KB large and use direct mapping. The second level contains a unified cache, 2 MB large, and uses set-associative mapping with 4 lines per set. Each cache memory uses write-back policy. We have tried 4 combinations considering cache line sizes. In the first three combinations the first level cache line size is 32 B, and the second level cache line sizes are 64 B, 128 B and 256 B. In the fourth combination the first level cache line size is 16 B, and second level cache line size is 64 B.
The simulated DRAM memory has following characteristics: there are 4 banks in one chip, 4096 rows in a bank, row capacity is 1 KB, precharge time, row access time, and column access time are 20 processor clock cycles each, the memory bus has 128 data lines.
DRAM memories have had several generations with improved architectures [14] until now. Contemporary SDRAM memories are DDR2 and DDR3 types [17] , [18] . From standpoint of the way of SDRAM memories control, presented in this paper, DDR2/DDR3 memories are improved by the following solutions: -increased number of independent banks from 4 to 8, -working only in burst mode with burst length of 4 or 8 words, -posted CAS additive latency (AL).
More independent banks potentially increases the amount of data which can be accessed with Tcl. Burst mode with length of 4 or 8 shortens the time to transfer the cache line to/from DRAM memory. Posted CAS additive latency enables the DRAM controller to better utilize the address and control lines for issuing read/write commands to different DRAM banks. Although we have used DDR SDRAM in our simulations, it is clear that our proposed solutions in this paper are also applicable on DDR2 and DDR3 memory types.
We have simulated executions of 6 benchmark programs from SPEC95 suite: cc1, compress, ijpeg, li, m88ksim, and perl. Their characteristics can be found in [4] - [6] .
Results
The results shown in this paper were obtained gradually in time, step by step. First we did some measurement simulations to evaluate the possibility of using predictions, then we implemented the dead-time predictor, then we added the zero-live-time predictor, and finally we added the open-page predictor. We will present the results in this paper in the same order, always comparing the performances of the next predictor with the previous one, so the reader can see how and in what extent each predictor yields to overall performances. As already said, we have tried four combinations for cache memory line sizes. Varying the cache line sizes affected DRAM references, changing mostly the open row hit Table 1 have equal values. As can be seen from Table 1 , the differences between open row hit probabilities for various cache configurations does not differ a lot for same programs. Programs cc1, ijpeg and perl have lower hit probabilities for all the cache configurations, and programs compress, li, and m88ksim have higher hit probabilities. That influenced the results for various cache configurations described in this paper to be very similar for same programs. For that reason, all the results that will be presented in the remainder of this paper will apply to the cache configuration with the second level line size of 128 B, unless stated differently. This configuration, as can be seen in Table 1 , is the 'middle one', considering its line size and the DRAM open row hit probabilities it yields. We excluded the results for all the cases where different cache configurations did not retrieve different results, because of lack of space. However, some of the results, which we consider interesting and important, are included.
As already said, in the beginning of our research we did some measurement simulations to evaluate the possibility of using predictions. This evaluation included measuring the following parameters: number of accesses with zero/nonzero live times, and average values for access interval time, live time and dead time, measured in processor clock cycles. The results are shown in Table 2 . It can be seen that in benchmark programs with small open row hit probabilities (cc1, ijpeg, perl) the number of zero live times is much greater than the number of nonzero live times, which is reasonable. In benchmarks with large open row hit probabilities (compress, li, m88ksim) there are much more nonzero live times than zero live times. These results, with varying number of zero/nonzero live times from program to program, were one of the reasons that motivated us to try researching the possibilities of designing a zero-live-time predictor. If other parameters are observed, it can be noticed that in all the cases, not dependable on open row hit probability, the average value of access interval time is much less than the average value of dead time. This suggests a possibility of defining a simple predictor. If, from the last access to an open row, a certain amount of time (equal to some boundary value) has elapsed, then that row should be closed, since it has probably entered its dead time. If that amount of time has not yet elapsed, the row is to be kept open. As a boundary, a value that is the same order of magnitude as the last access interval should be used. For instance, it could be the last access interval multiplied by 2 or 4.
We have tried 2 variants for boundary levels -last access interval time multiplied by 2 and 4. The results were practically the same, i.e. the differences were insignificant. In this paper we show the results when the boundary value is equal to access interval multiplied by 2. For both combinations we have tried another two possible solutions. The first one uses only one common value of access interval time, which is defined by every appearance of a new access interval in any bank. In the second solution there is one value of access interval time for each bank in the system.
Average DRAM latencies, in processor clock cycles, are shown in Fig. 3 . This figure shows average DRAM latencies when using the Open Row Policy (Open Row), the policies with the proposed dead-time predictor with a common value and with separate values of access interval time (Common and Separate), and the policy with an ideal predictor, i.e. a predictor whose close-row prediction accuracy would be 100% (Ideal). It can be seen that the proposed solutions, although simple, give good improvements.
If we compare the solutions with a common value and with separate values of access interval, there are almost no differences among them. In the solution with a common value there are access interval interferences from different banks. That interference is removed when using separate values for each bank. This interference is not significant in a single program environment, which was the case of our simulations. In two cases (li and m88ksim) the results are worse for Separate than for Common. This could be explained by longer negative influences of extreme, relative to average, values of access interval time. In a multiprogram environment access intervals of different programs can differ a lot. In that case the solution with a common value would probably have lower prediction accuracy. We can conclude this from Table 2 , which shows that average access interval values for different programs can vary up to 1:290 (compress and li). Table 3 shows prediction accuracy and coverage when using one common register for all banks. Coverage presents the part of accesses for which the predictor made certain predictions, starting from the first appearance of an access interval value. Prediction accuracy and coverage when using separate values for each bank are very similar to these ones, so we omit them. In Table 3 close row is the probability of the accurate prediction that the row should be closed, and not close row is the probability of the accurate prediction that the row should be kept open. Proper coverage is given in the last 2 rows. By simple addition of these coverage percentages it can be concluded that the percentage of accesses not involved by the predictor is negligible -in almost all the cases it is about 1% or less. Only in case of li this percentage is about 5%. These accesses not involved by the predictor comprises all first accesses which are zero live times, until the appearing of the first nonzero live time, i.e. the first access interval value, which is the moment when the predictor starts with the prediction process. If we see the prediction accuracies themselves it can be seen that in 7 of 12 cases they amount more than 70%, and in 5 of 12 cases they amount more than 80%. These are rather good values. The high prediction accuracies also have high coverage in most of the cases. It happens, however, the prediction that the row should be kept open, to be very low, and to have rather high coverage, in benchmarks with low open row probabilities (cc1: 0.43 (63%), ijpeg: 0.34 (80%) and perl: 0.08 (78%)). These cases had caught our attention, and we wanted to see whether we could improve them. As can be seen in Fig. 3 , the latencies in these 3 cases are still far from the latencies that an ideal close-page predictor would have. The first logical idea we had was to try with a zerolive-time predictor. In all of these cases there are much more zero live times than nonzero live times, so a good zero-livetime predictor would close the row in a lot of these cases where the dead-time predictor omitted to do it. As we already stated, when using the dead-time predictor, we have tried 2 variants for boundary levels -the last access interval multiplied by 2 and 4. Since the results were practically the same, we decided to use only one boundary level -the last access interval multiplied by 2, in further researching. We also decided to use only the solution with one Table 5 are given in thousands.) common value of access interval time, for the same reasons.
We have added the 3 zero-live-time predictors described in Sect. 4 to the dead-time predictor, and that way obtained a full close-page predictor. This predictor first uses the zero-live-time predictor each time a new row is opened to predict its live-time. If it predicts its live-time to be zero, it is closed immediately after the access is finished. If not, the dead-time predictor is activated, and it closes the row or not depending on its prediction. Table 4 shows the prediction accuracies (number of correctly predicted zero live times divided by the number of predicted zero live times) of the zero-live-time predictors. The signs in this table are: ZLT1 -1b per row zero-live-time predictor, ZLT2 -2b per row zero-live-time predictor which decrements its counter on a nonzero live time occurrence, ZLT2' -2b per row zerolive-time predictor which resets its counter on a nonzero live time occurrence. In Table 4 there are no data for compress and li for ZLT2 and ZLT2', since there were no predictions in these cases (it never happened for some row to have two more zero live times than nonzero live times, or two consecutive zero live times). As expected, ZLT2 and ZLT2' are better than ZLT1 in most of the cases. In case of m88ksim one would conclude from the numbers in Table 4 that ZLT1 is slightly better than ZLT2 and ZLT2', but that is not quiet a correct conclusion. Namely, ZLT2 and ZLT2' show less prediction accuracy in percentages than ZLT1, but actually dramatically reduce the number of prediction misses in the named case. This can be seen from Table 5 . This Table shows the number of prediction hits and prediction misses. It can be seen from this Table that ZLT2 and ZLT2' are better from ZLT1 not only for cc1, ijpeg, and perl, but also for compress, li, and m88ksim. Let us comment here the results for m88ksim. Although the prediction accuracy is slightly higher for ZLT1 than ZLT2 and ZLT2' -0.012 compared to 0.00 (Table 4) , ZLT2 and ZLT2' are actually much better than ZLT1. ZLT1 gives 1 hit and 80 misses, and ZLT2 and ZLT2' give 0 hits and only 1 miss each. The row is misclosed 80 times when using ZLT1 and only once when using ZLT2 and ZLT2'. It is much better for the zero-live-time predictor to omit closing the row (which should be closed), since then there is a chance for the dead-time predictor to close that row, but if the zero-live-time predictor miscloses the row, there is nothing the dead-time predictor can do to correct this error.
Although ZLT1 is the worst, in some cases it gives fairly good prediction accuracies -cc1, ijpeg and perl. These results and the fact that ZLT1 is the simplest and the cheapest, show that this predictor can be a good choice in some cases. If we compare ZLT2 and ZLT2' themselves, it can be observed that they practically have equal accuracies, with ZLT2' being slightly better in one case (ijpeg). These results are expected, since the two predictors are very similar in complexity and price.
Results from Tables 4 and 5 show that in some cases the proposed zero-live-time predictors show very good prediction accuracies. However, in some cases the prediction accuracies are rather low, even for ZLT2 and ZLT2'. This shows that further investments should be made in order to find out some more efficient strategies which would gain higher prediction accuracies.
Average DRAM latencies, in processor clock cycles, are shown in Figs. 4 and 5. These figures show average DRAM latencies when using Open Row Policy (OR), Policy with a Dead Time Predictor (DTP), Policies with a full Close Page Predictor (CPP1 -DTP with ZLT1, CPP2 -DTP with ZLT2, CPP2' -DTP with ZLT2'), and Policy with an ideal close-page predictor, i.e. a predictor whose prediction accuracy would be 100% (Ideal). Several things can be seen from these figures. CPP1 shows good results in all the cases -it either improves DTP or does not spoil it too much. These results confirm what was already said about ZLT1. This predictor, although the simplest, can be a good choice. However, it can also slightly decrease the performances of DTP. The two-bit zero-live-time predictors (ZLT2 and ZLT2') correct this. In practically all the cases CPP2 and CPP2' are better or equal to DTP, which was our goal -to improve DTP if possible, if not then to retain its performances. This also applies to the cases mentioned in Table 3 about cc1, ijpeg, and perl, which was the main motive for developing zero-live-time predictors. In all these cases the zero-live-time predictors really improve the deadtime predictor. It is interesting to notice that in Fig. 5 DTP does not improve OR at all for ijpeg, but the zero-live-time predictors correct that, and give performances close to Ideal. The latter can be said for all the other cases in both Figures. Namely, in most of the cases the Close Page Predictors have performances that are close to Ideal, which is the theoretical performance maximum that can be attained, if we only use row closing. The question that we have asked ourselves, considering this fact, was could these results further be improved by using an open-page predictor. Namely, now that we have rather good predictions about closing the row, this can further be improved if we could predict the next row that will be opened, and then if we opened that row in advance.
Before we show the results with the open-page predictor, we will first present an analysis considering the average latency. The average latency of a DRAM controller that uses both a close-page and an open-page predictor (which we can refer to as a complete predictor) can be calculated by the Eq. (1).
The meanings of the used variables in (1) are: phit (pmiss) -open row hit (miss) probability, porc (porw) -probability of a correct (wrong) prediction that the row should be kept open, pcrc (pcrw) -probability of a correct (wrong) prediction that the row should be closed, pnrh (pnnrh) -probability that there is (not) a prediction of the next row to be opened in a case where a row hit would occur if Open Row policy was used, pnrm (pnnrm) -probability that there is (not) a prediction of the next row to be opened in a case where a row miss would occur if Open Row policy was used, pnrc (pnrw) -probability of a correct (wrong) prediction of the next row to be opened.
When considering these, next equations should be kept in mind: phit + pmiss = 1, porc
In a case where an open row hit (phit) would occur if the Open Row Policy was used, the latency is Tcl if it was correctly predicted that the row should be kept open (porc). If not, then the row was closed because of a wrong prediction (pcrw) and the latency increases to Trcd+Tcl, if there was not a prediction which is the next row to be opened (pnnrh), apropos to Trp+Trcd+Tcl, if there was such a prediction (pnrh). In a case where an open row miss (pmiss) would occur if the Open Row Policy was used, if there was a correct prediction that the row should be closed (pcrc) and if there was a prediction which is the next row to be opened (pnrm), then the latency will amount only Tcl if that prediction was correct (pnrc), apropos Trp+Trcd+Tcl, if the prediction was wrong (pnrw). If there was no such a prediction (pnnrm), then the latency will amount Trcd+Tcl. If there was a wrong prediction that the row should be kept open (porw), the latency will amount Trp+Trcd+Tcl.
It is obvious and logical that the latency will be smaller if the predictors probabilities (porc, pcrc, pnrc) are larger. The number of situations in which there is i.e. there is not a prediction of the next row to be opened also has influence. An interesting thing is that in case of a row hit, when the row is closed by wrong prediction of the close-page predictor (pcrw), the latency will be smaller if the number of situations that there is a prediction of the next row to be opened is smaller. Table 6 shows the characteristic probability values obtained by simulation. In this table CP1 and CP2 are abbreviations for Complete Predictor 1 and 2. They are obtained from the variants for the Close-Page Predictor signed as CPP1 and CPP2, with the described Open-Page Predictor. Since the results for CPP2 and CPP2' were very similar we did not include CPP2' in our further investigation. It can be seen that, again, the benchmark programs can be divided into two groups. The first group -cc1, ijpeg and perl comprises the programs with low page hit values (phit). The second group comprises compress, li and m88ksim, which have much larger page hit values.
The values porc and pcrc from Table 6 are related to full close-page predictor, which means that they include both zero-live-time and dead-time predictors. Values for porc are in the interval from 0.53 for perl to 0.95 for m88ksim. Values for pcrc are better and they are in the interval from 0.80 for ijpeg to 0.98 for compress.
The last three columns in Table 6 are related to OpenPage Predictor. Values for pnrh are in the interval from 0.26 for cc1 and m88ksim to 0.43 for li. These values are relatively low, but considering Eq. (1), that is good for performances. Values for pnrm are in the interval from 0.37 for cc1 to 0.86 for ijpeg. The last column, pnrc, shows the probability of the correct prediction of the next row, and it has rather good values -from 0.45 for li to 0.86 for perl, It is very difficult to obtain improvements in these programs, since improvements are possible only when the opened row is changed, which happens very rarely. This can be corroborated with the fact that the theoretical latency minimum that can be obtained is Tcl, which is 20 cycles, and it can be seen that OP itself gives latencies of about 22-25 cycles.
In programs with less open page hit values (cc1, ijpeg, perl) in Fig. 6 there are visible improvements when using the predictors. This is true for all the programs in Fig. 7 and the explanation for this lies in the fact that in these cases with larger L2 cache line sizes, the open row hit probabilities are smaller (Table 1) , so there is room for improvement of the Open Page strategy.
If we make a comparison of the Complete predictors and the Close-Page predictors, we can conclude that the Complete predictor upgrades the performances of the ClosePage predictor in benchmark programs with low open row hit probabilities in Fig. 6 (cc1, ijpeg, perl) and in four programs in Fig. 7 (the exceptions are cc1 and li). The programs with high open row hit probabilities in Fig. 6 (compress, li, m88ksim) already have really good performances, close to the ideal 20 cycles, and the fact that the predictors do not spoil these performances may be considered a success. The reasons that in cc1 and li, in Fig. 7 , the Complete predictor is worse than the Close-Page predictor, are low pnrm values in these cases, which means that the Open-Page predictor was rarely activated, probably because of lower number of DRAM accesses -as already explained at the beginning of Sect. 6, increasing L2 cache line size causes some of the DRAM hits not to occur, which effectively means smaller number of DRAM accesses.
Average improvements of the strategy with Complete predictor compared to the basic Open Page strategy amounts 29.7% for L2 line size of 64 B, from −4.9% (li) to 80.3% (perl), and 25.9% for L2 line size of 256 B, from 6.0% (li) to 56.9% (perl).
Related Work
Exploiting DRAM row buffers for reducing last level (L2 or L3) cache miss penalty has been studied by several researches. Zhang, Zhu and Zhang [3] analyzed the sources of row-buffer conflicts in the context of superscalar processors, and proposed a permutation-based page interleaving scheme to reduce row-buffer conflicts and to exploit data access locality in the row-buffer. Prefetch reordering to exploit DRAM row buffers was previously explored by Zhang and McKey [10] . They interleave the demand miss stream and several strided prefetch streams generated using a reference prediction table dynamically in the memory controller. Lin, Reinhardt and Burger [8] proposed and evaluated a prefetch architecture, integrated with the on chip L2 cache and memory controllers, which aggressively prefetches large regions of data on demand misses. Yu and Kedem [16] described and evaluated DRAM-page based prefetching, which prefetches data from main memory to L2-cache. The scheme strives to fetch two cache blocks from the same DRAM-page at a time. The design utilizes DRAM timing to reduce the prefetch overhead and memory bus occupancy.
To reduce memory latency, Park and Park [15] proposed a memory control scheme that predicts whether the successive memory access leads to a page hit or not and changes the memory mode according to the prediction. Two-bit state machines are employed to predict the next memory mode based on the history of memory references. Cupu and Jacob [9] have investigated mechanisms to reduce request latency and the portion of the main memory systems overhead that is not due to DRAM latency, but rather to other factors. Rixner at all [11] proposed memory access scheduling that greatly increases the bandwidth utilization of DRAMs by buffering memory references and choosing to complete them in an order that both accesses the internal banks in parallel and maximizes the number of column accesses per row access, resulting in improved system performance.
Several researches have studied data cache prefetching using history data in a dynamically populated table [1] , [2] , [12] , [13] . Because efficiency of cache memories and DRAM row buffers are based on locality of program memory references, we have used some analogy with these solutions for prediction of DRAM row closing and opening. Lai, Fide and Falsafi [2] proposed the Dead-Block Predictors (DBPs), trace-based predictors that accurately identify "when" an L1 data cache block becomes evictable or "dead". They also proposed Dead-Block Correlating Prefetchers (DBCPs) that use address correlation to predict which subsequent block to prefetch when a block becomes evictable. Hu, Kaxiras and Martonosi [1] , [12] proposed a family of timekeeping techniques that optimize cache behavior based on observation of cache access interval or cache dead time. Nesbit and Smith [13] have used a FIFO table -the Global History Buffer (GHB), to hold address history. GHB history information is maintained in linked lists, which are accessed indirectly via a hash table. This method reduces stale history data and allows a more accurate reconstruction of the history of access patterns, and leads to more effective prefetching algorithms.
Conclusion
In this paper we have considered DRAM latency decrease techniques with a controller that uses various predictors which predict whether the opened DRAM row should be further kept open or it should be closed, and also which is the next row that should be opened. First we considered solutions for a dead-time predictor which predicts when to close the open row based on access interval values. The considered two solutions (with a common register, and with separated registers for access intervals storage) are rather simple, and give good performance improvements. Then we amplified this predictor with a predictor that predicts whether the live time will be a zero live time. The zerolive-time predictor completes the work of the dead-time predictor. The results are encouraging, since they show that zero-live time predictors correct the dead-time predictors shortcomings by using relatively simple strategies. However, the results also show that there is space for increasing the prediction accuracies, which will be subject of our future work. Finally, we added an open-page predictor, which predicts the next row to be opened. The considered solution gives performance improvements, both compared to the basic Open page strategy, and the strategy with only the closepage predictor. Exceptions are programs with very low latencies, which are close to an ideal case. In these programs the complete predictor does not spoil these excellent performances.
Contemporary DRAM controllers have buffers for pending memory references from processors/L2 caches. Depending on time intervals between incoming references to DRAM controller, and speed of their completion, the buffer may contain only one or several such references. In case of more references, some type of scheduling as proposed by Rixner at all [11] may be efficient. In case of only one reference in the DRAM controller buffer, such scheduling is not possible. In this case, our predictors enable the controller to prepare the DRAM for optimal response before the next reference. Our execution-driven simulations showed that these situations can occur quite often, which is corroborated by decreased DRAM latency when using predictors. It would be interesting to compare our results with those of [11] . However, authors of [11] did not have a goal to decrease DRAM latency, but to increase DRAM bandwidth, and they also did not use execution-driven simulation, so a valid comparison cannot be made. A good idea would be to try an approach of using the scheme like in [11] together with our predictors. These two mechanisms supplement each other, since they target different situations, and would probably be effective in DRAM latency decrease on large set of programs.
For the adopted DRAM structure of 512 MB with 4 ranks of DRAM chips, 4 banks per chip, 4 K rows per bank and k=4, implementation of all three predictors (deadtime, zero-live-time, and open-page) requires about 30 KB of memory and relative simple additional logic circuits. For comparison, the solution from [15] would require 16 KB of memory and additional logic circuits for the same DRAM structure. Based on their results in Table 2 , their average improvement over Open Page strategy is 15.8%, and ours are in the range 25.9% to 29.7%. In paper [11] the amount of schedulers hardware is not given, but its structure allows one to conclude that it is large. The same conclusion goes for of the shelf components described in [19] , [20] . These tendencies towards more complex DRAM controllers open a possibility for implementation of the proposed predictors in near future.
