Abstract-Scaling down DRAM technology degrades cell reliability due to increased coupling between adjacent DRAM cells, commonly referred to as crosstalk. Moreover, high access frequency of certain cells (hot cells) may cause data loss in neighboring cells in adjacent rows due to crosstalk, which is known as row hammering. In this work, the goal is to mitigate row hammering in DRAM cells through a Counter-Based Tree (CBT) approach. This approach uses a tree of counters to detect hot rows and then refreshes neighboring cells. In contrast to existing deterministic solutions, CBT utilizes fewer counters that makes it practically feasible to be implemented on-chip. Compared to existing probabilistic approaches, CBT more precisely refreshes rows vulnerable to row hammering based on their access frequency. Experimental results on workloads from three benchmark suites show that CBT can reduce the refresh energy by more than 60 percent and nearly 70 percent in comparison to leading probabilistic and deterministic approaches, respectively. Furthermore, hardware evaluation shows that CBT can be easily implemented on-chip with only a nominal overhead.
INTRODUCTION
DYNAMIC random-access memory (DRAM) has been a widely used storage element in computer systems. Over time, process technology has scaled DRAM cells towards greater information density by reducing the technology feature size. However, shrinking process technologies cause DRAM cells to become significantly less reliable [1] , [2] , [3] . As the chip density increases with technology scaling, the interaction between circuit components, such as transistors, capacitors or wires, increases leading voltage fluctuations. Specifically, when the cumulative interference effects of a word line in DRAM become strong enough, the state of nearby cells can change leading to memory errors. Fast, repeated accesses to a small number of words in memory exploits this crosstalk by row hammering. Vulnerability to row hammering exists in recent sub 40 nm commodity DRAM chips due to physical limitations of these technologies. This crosstalk is expected to increase as feature size descends [4] . Yoongu Kim et al. [5] showed that through frequently alternating the charge of specific memory locations, electromagnetic coupling can be used intentionally to affect the charge of adjacent cells.
One simple solution to mitigate row hammering is to increase the refresh rate for all rows. Although, this approach can be successful, it imposes an unnecessary power and performance overhead on the system [6] , [7] , [8] , [9] . Currently, there are two main approaches to mitigate row hammering in DRAM. The first is to use a probabilistic row activation approach [5] via a biased random number generator in the memory controller to refresh the word lines, or victim rows, that are adjacent to the most frequently accessed lines, or aggressor rows. Although the idea behind this approach is simple, it results in an early refresh of victim rows [6] . The second approach is to detect the aggressor rows and then refresh the victim rows. A simple method to deterministically recognize the aggressor rows, called Static Counter Assignment (SCA), is to dedicate a counter per row to keep track of the number of row activations. However, having one counter per row induces a significant area and power overhead to the memory system. In order to prevent this high cost, row activation counters can be stored in a reserved area of the main memory. To mitigate performance penalty of retrieving counter values from main memory, a dedicated on-chip counter-cache was proposed in [4] .
Due to row access locality in DRAM [10] , many counters in SCA would be unutilized. Thus, we propose a Counter-Based Tree (CBT) approach which dynamically assigns counters to frequently accessed "hot" rows. Hence, with a small number of on-chip counters it is possible to deterministically refresh victim rows, while achieving low latency, low power consumption and satisfactory area overhead. When CBT results in a highly unbalanced tree it provides a significant advantage in refresh energy over SCA, while CBT converges to a balanced tree when accesses in memory are uniform.
INEFFICIENCY OF STATIC COUNTER ASSIGNMENT
DRAM is built from a two-dimensional array of cells. It consists of memory cells at the intersections of bit lines and word lines. The memory cell encompasses an access transistor and a data storage capacitor. Due to capacitive coupling between cells on adjacent word lines, if an aggressor row is accessed frequently, voltage levels on victim rows can be affected leading to crosstalk (row hammering). The row hammering threshold is defined as the minimum number of row activations in the aggressor row before one or more cells in the victim rows are disturbed. Mitigating row hammering is possible by refreshing the victim rows before the aggressor rows reach the row hammering threshold. SCA deterministically prevents bit flips resulting from row hammering by counting the number of accesses to each row. Unfortunately, this solution requires a large area and power overhead [4] , [5] .
One intuitive solution, derived from SCA, is to use fewer counters by partitioning the rows in each memory bank into fixed-size groups and assign one counter per group. To illustrate this generalization of SCA, we assume that every bank in DRAM includes N rows and uses M counters. The row hammering threshold, T , determines the size of every counter as log 2 T -bits. This approach, called SCA M , divides the rows into M groups, each including N M rows. Specifically, Fig. 1 depicts M ¼ 8 active counters for a given DRAM bank with N rows that SCA 8 partitions into eight equalsize groups. For every row activation, the row address maps to the appropriate counter. Then, the corresponding counter counts the number of accesses. When the counter reaches the threshold, it is reset and a refresh signal is sent to the memory controller to refresh N M þ 2 rows; the N M rows in the group plus the two rows adjacent to the group, which guarantees the refresh of any row in or adjacent to the group subjected to row hammering.
While the area overhead of SCA is directly proportional to the number of counters, the energy overhead in SCA originates from activating the counters when memory is accessed and refreshing N M þ 2 rows when a counter exceeds T . Fig. 2 breaks down the energy overhead of SCA M during a 64 ms auto-refresh period when the number of counters M ranges from 16 to 65,536. 1 For a small number of counters, the energy resulting from refreshing victim rows dominates the total energy of activating counters in SCA. In contrast, the total energy of activating counters in SCA is the dominant energy when the number of counters significantly increases. The key observation from Fig. 2 is that the total energy can be minimized when SCA uses 64 counters. In this case, SCA 64 not only reduces the total energy overhead in comparison to SCA 65536 , but also decreases the area overhead by more than two orders of magnitude.
The analysis of row access frequency of DRAM banks on real workloads reveals that the row access frequency during the refresh interval is not uniform and mostly a small group of rows are activated in DRAM banks. For example, Fig. 3 depicts the row access frequency of a given bank for two typical real workloads, blackscholes and facesim, within a time period of one refresh interval (64 ms). It shows that not only a small group of rows are activated during the refresh interval, but also a small number of activated rows dominate overall accesses. Accordingly, allocating a large number of counters to the many infrequently activated rows is inefficient for both power consumption and area. However, it should be noted that SCA has a potential downside in that it limits the number of counters assigned to the aggressor rows and may use counters inefficiently if there is a poor match between group partitioning and the distribution of aggressor rows. To address this limitation, we propose a dynamic row partitioning scheme, called a Counter-Based Tree. The Counterbased tree takes advantage of the benefits of SCA but also dynamically adjusts the group sizes so that they are better correlated to access frequency in order to save more refresh energy.
THE COUNTER-BASED TREE APPROACH
In order to better assign row partitions to access counters, the Counter-Based Tree is a new and practical dynamic row partitioning technique considers access frequency of rows for improved energy and area efficiency assignment of counters. The key insight behind CBT, is that row partitioning can be made in a fine-grained manner by tracking the number of row accesses and finding hot rows with the highest access frequencies. To split an initial group of rows (e.g., a bank or some other uniform coarse partition) into groups of suitable sizes, CBT defines different sub-thresholds that identify access frequency stages prior to reaching the row hammering crosstalk threshold. These sub-thresholds are used to build a non-uniform binary tree structure that maps hot rows to smaller groups, while cold rows, i.e., rows with the nominal access frequencies, are mapped to larger groups. This aligns access counters to increasingly small groups of rows that contain an aggressor row to more precisely identify actual victim rows. Fig. 4 depicts two binary trees built by CBT, where a terminal nodes, , represents an active counter and non-terminal node,, represents an expired counter that had been split into two counters. The level of a node is defined as its distance from the root, with the root being at level zero. The levels of CBT are associated with unique thresholds such that when a node reaches the corresponding sub-threshold, it splits, generating two children counters initialized to the current count value (i.e., activates a second counter as a clone of existing counter). This approach grows the tree until all available counters are activated or a counter reaches the refresh threshold.
A Simplified CBT Example
More precisely, assume that we limit the number of levels in the tree to K, we define K sub-thresholds T 0 ; . . . ; T KÀ1 where T 0 Á Á Á T KÀ1 and T KÀ1 ¼ T . Each of the M counters in a bank, C 0 , . . . , C MÀ1 , has log 2 T bits and, initially, only C 0 is in active mode. When a counter at level k reaches sub-threshold, T k , it splits and two counters are activated at level k þ 1. This process continues until all the counters are activated. For example, Fig. 4 shows two CBTs for K ¼ 6 and M ¼ 8. The CBT in Fig. 4a results from a non-uniform row access pattern, which causes more counters to be allocated to the hot row area (smaller blocks) and grows the tree through level 6. In contrast, when the row access frequency is uniform, counters are distributed uniformly throughout the bank addresses as shown in Fig. 4b . In this case, the CBT approach grows the tree only through level 4 and mimics SCA (See Fig. 1) .
In CBT, all N rows in one bank are initially treated as a single group to which C 0 is allocated. As soon as C 0 reaches T 0 , CBT splits C 0 into C 0 and C 1 with the same starting value of T 0 . In this case, C 0 counts the number of accesses when the row address is between 0 to N 2 À 1 and C 1 counts the number of accesses when the row address ranges from 
3N
2 to ðN À 1Þ, respectively. CBT continues this process until it activates all counters and no group can be split into smaller sub-groups. At this point, the sub-thresholds of all counters are set to T . Note that the minimum number of rows in a given group depends on the number of defined sub-thresholds. Specifically, with K sub-thresholds, the minimum number of rows per group is N 2 KÀ1 . The CBT tree is rebuilt at intervals equal to the refresh interval (64 ms for DRAM generations [7] , [11] ). The active counters in CBT conservatively count the number of row accesses for the corresponding rows even if those rows are auto-refreshed.
CBT Algorithm
Algorithm 1 shows the process for refreshing rows under the CBT structure per memory bank. The CBT structure has two main parts: the Counter Module (CM) that records the number of row accesses and the Reconfiguration Counter Module (RCM) that activates and initializes counter modules. Assuming M counters in a given bank, CBT requires an array of M counter modules (ACM), and one RCM. Each counter module CM i maintains two registers, L i and U i to store the lower and upper row addresses assigned to this counter, and a register k i to store the index of the sub-threshold used for that counter. While the tree is being constructed, k i stores the level of the counter. After the tree is constructed (all counters are deployed) k i ¼ K À 1 to enforce the threshold T . Initially, only the first counter module, CM 0 , is activated with L 0 = 0, U 0 = N À 1 and k 0 = 0. The RCM module maintains a register, last activated (initialized to 0), as a pointer to the last counter that was activated. U last activated =U i ;
19:
; 20:
21:
if last_activated==M À 1 then
24:
for i=0:M À 1 do 25:
After initialization, every time a row is accessed, its address is located in the range L i -U i of some active C i , and this counter is incremented (lines 5-7). When C i reaches T k i , flag i is raised (lines 8-10), which triggers RCM to activate a new counter as long as the number of active counters is less than M and the counter level i < K À 1 (lines [15] [16] . When a new counter is activated, it is initialized by C i (line 17) and the interval between L i and U i is split into two equal-size ranges where the lower bound of C i remains unchanged and the upper bound of C i is assigned to the upper bound of the new counter. Then, U i shrinks to
and the lower bound of the new counter is set to U i þ 1 (lines 18-20) . The subthresholds of both counters are set to k i +1 (lines 21-22) . For C i , if lines 10-12 ), CM i has reached the highest threshold T k i ¼ T . Therefore, C i is reset and the signal R i is raised to cause the memory controller to refresh all existing rows in the address range of L i À 1 and U i +1.
2 Note, however, that when all counters are activated, CBT will set the index of all sub-thresholds to
Meanwhile, the lower and upper bounds of the counters remain unchanged until the CBT is rebuilt.
EVALUATION
To evaluate the effectiveness of our proposed technique, we performed simulations using the memory system simulator USIMM modeling 55 nm DRAM [11] for 18 workloads from the Memory Scheduling Championship [13] . These workloads cover a variety of benchmarks including commercial applications the PARSEC, SPEC, and Biobench suites. We simulated a system with the settings listed in Table 1 . Verilog implementations were synthesized using Synopsys Design Compiler targeting a 45 nm FreePDK standard cell library [14] . 3 For CBT, we considered 10 sub-thresholds and changed the number of counters to study the trade-off between performance and hardware overhead. In our study, we assume that T ¼ 32; 768 [4] which implies 15-bit counters. Table 2 shows the area overhead of CBT and SCA when the number of counters ranges from 16 to 256. Results show that the main part of the CBT overhead originates from registers required for counting and to record the lower and upper bounds of counters. Due to on-chip area and power overheads, only the CBT with a relatively small number of counters is practical. Since every bank requires M counter modules, the total CBT area overhead is correlated with the number of counters. Note that assuming 16 DRAM banks with 64 counters per bank, the total CBT area is 0.94 mm 2 which is about 1.29 percent of the die area of commodity DDR3 [15] . When 256 counters are used per bank, the CBT area increases by four fold to occupy 5.2 percent of the die area.
Hardware Overhead
With respect to static power, CBT and SCA achieve low static power. For example, CBT with 64 counters increases static power by 0.294 mW which is roughly 1.37 percent of static power consumption per bank [16] , [17] . Also, results show that with 16, 64, and 256 counters, the energy per access [7] , [17] , [18] is increased by 0.73, 2.86 and 9.44 percent, respectively. Our implementation shows that the maximum latency for CBT is 5.11 and 5.01 ns when 256 and 64 counters are used, respectively, which is much lower than the row activation latency in the DRAM memory system [19] . Note, however, that updating the CBT and activating the row can 2. We assume that either the memory controller knows which rows are physically adjacent to each other [5] or the DRAM chip is directly responsible for refreshing the row and its neighbors [12] .
3. It is commonplace for DRAM to trail CMOS by a technology generation.
Systems with 45 nm CPUs were concurrent with 55 nm main memories.
be done in parallel. Given the leading counter-based approach dedicates a 128 KB on-chip counter cache [4] (equivalent storage to 8,192 counters per bank), the 64 or 256 counters per bank for CBT should be feasible.
Refresh Energy Overhead
In the evaluation, the refresh energy overhead during a refresh period (64 ms) is the energy consumption due to refreshing victim rows (obtained via USIMM). Ten sub-thresholds are used for CBT and all results are normalized to PARA 0:001 . Fig. 5 shows the normalized refresh energy overhead for different techniques where the biased random number generators for PARA [4] , [5] are set to 0.001 and 0.005 and 64 counters are used for CBT and SCA. The figure shows that CBT achieves the best energy overhead reduction in comparison to other approaches. CBT reduces the energy overhead by about 62, 69 and 92 percent compared to PARA 0:001 , SCA, and PARA 0:005 , respectively. We also conducted a sensitivity analysis on the different number of counters since the refresh energy reduction in CBT is correlated with the number of counters. According to Fig. 6 , CBT with 16 counters and PARA 0:005 have similar refresh energy while SCA achieves the worst refresh energy overhead. When the number of counters increases, CBT can significantly reduce energy overhead in comparison to PARA whose implementation is independent of the number of counters and depends on the probability of the biased random number generator. Also, CBT outperforms SCA due to non-uniform partitioning of the row address space. When the number of counters ranges from 16 to 256, the 10 defined sub-thresholds allow CBT to split the row addresses more than SCA and reduce refresh energy overhead more than SCA. For example, CBT with 128 counters reduces the energy overhead by about 63, 81 and 96 percent against SCA 128 , PARA 0:001 and PARA 0:005 , respectively.
CONCLUSION AND FUTURE WORKS
Ensuring reliability with technology scaling is a key challenge for DRAM and exacerbates the word level crosstalk problem exploited by row hammering. We presented CBT, a counter-based tree approach that detects frequently accessed rows and refreshes heavily victimized rows. The suitable selection of counters and subthresholds cause CBT to deterministically mitigate row hammering while being feasible for implementation on-chip at the expense of a nominal overhead. Experimental results on real workloads show that on average CBT reduces the refresh energy about 62 percent against the leading approach while incurring area, latency and energy overhead per access of about 0.94 mm 2 , 5.01 ns, and 0.28 nJ, respectively. Note that as the core/thread count of processors increases, the number of hot regions in memory also increases. In our future work, we plan to experiment with larger numbers of cores. We also plan to find the optimum number and values of the sub-thresholds for CBT. Furthermore, we plan to study in more detail the impact of the mapping of the address space to physical rows in the memory hardware. Also, we will explore the tradeoffs between hardware overhead and CBT performance.
