ABSTRACT Recent advances in non-volatile memory (NVM) technology offer more capacity, scalability, and durability than regular volatile memory. However, adopting emerging memory introduces new challenges such as endurance issue and performance degradation. This paper proposes a software wear leveling technique that is designed specifically for a scenario, where the NVM is managed by generational garbage collectors that divide heap into separate young and old space. Prior work found that the majority of write traffic occurs to young generation heap, thus proposed a hybrid generational heap, where the NVM is utilized as old space and DRAM works as young space. We further investigate the access pattern in NVM-contained old generation. This paper first highlights a common observation across various workloads that the write distribution of old space is highly unbalanced. Specifically, among all writes to the NVM, 56%-96% occurs to 2% memory pages, and in worst cases, 83%-96% traffic occurs to only 0.5% pages. This leveling can be achieved by only swapping a very small fraction of NVM pages during the garbage collection cycle, which makes a low performance overhead runtime approach possible. We maintain a write intensity counter for each thread and insert a new garbage collection phase, where a few hottest pages are selected and remapped. Results show our approach significantly mitigates the skewness indicated by Gini coefficient dropping from 0.95 to 0.60, leading to extending lifetime 41 times on average. Meanwhile, it incurs a performance cost of 7.2% longer mutation time measured on emulated memory without performance penalty, and 3.31 % extra stop time for garbage collection cycles.
I. INTRODUCTION
DRAM scalability can no longer keep pace with the ever growing needs for more memory as the working sets of modern applications continue expanding [1] . This motivates researchers to explore opportunities integrating emerging memory technologies in order to benefit from the advantages that new memory offers: high density, scalability, lower standby power, and persistence. However, these advantages comes with prices, for example, the write latency of phase change memory(PCM) [2] , [3] , one representative NVM technology, is 2 to 5 times longer than DRAM and endures 10 8 writes per cell versus more than 10 15 times of DRAM. Prior work [4] - [7] show that existing hardware and software would suffer great performance degradation and short-term usability if integrating PCM without enhancement dealing with these shortcomings.
While manufacturing keeps evolving, research efforts are investigated on architectural and software hierarchies aiming to solve the endurance problem. In general, there are two methods: 1) to reduce total wearable memory writes and 2) to spread them evenly within the device, also known as wear leveling.
Traditional NVM wear leveling mechanisms are hardware based solutions implemented either on controller-or chiplevel. Basically, they fall into two categories: those using static algebraic to determine the target memory locations and those using associative mappings and dynamically selecting relocation targets. Start-gap [4] is a representative method for algebraic mapping. It periodically shifts a empty line (called the gap line) to its adjacent line at every 100 writes. All memory lines finally rotate to its neighboring one when the gap line traverses the whole memory. No address mapping table is needed, as physical lines are tracked by two pointers START and GAP, which point to the current locations of the first line and the gap line respectively. While inexpensive hardware resources are cost, start-gap alone is not an effective wear leveling method for the scenario where NVM is used to hold old space of generational heap (which will be explained soon in the following texts).
On the other hand, the second category methods distribute wear dynamically at the cost of more hardware resources, i.e. an associative mapping table storing the mapping from logical to physical address, write counters for triggering wear leveling, and circuits to implement complex controlling logic. Segment swapping [8] maintains counters per segment, and triggers swap when any counter crosses the threshold, then the least written segment is selected as migrating target. PCM-aware swap [9] sacrifices accuracy in exchange for lower hardware cost. It triggers a swap when the total writes cross the threshold, which only requires a single global counter, and instead of searching for the lowest segment counter, a random target location is selected.
Beside all prior work on architecture and software stack, We identify automatic memory management technology, also known as garbage collection can not properly handle the unique characteristics of NVM. To deal with the endurance issue, Gao et al. [10] proposed a hole-tolerant garbage collector which allows application to execute on PCM with imperfect pages. We consider this approach is not viable in cloud computing environment, where multiple tenants may run various types of applications, both native and managed ones may running on the same host, sharing the same physical resources. In addition, the collector lacks ability to utilize hybrid memory for performance gains. Write-rationing garbage collector [11] organizes DRAM-PCM hybrid memory with generational heap scheme, where small-sized DRAM contains young space and large-sized PCM holds the old space. It monitors objects that survive the latest minor GC and moves those that are identified as cold to PCM heap, assuming a perfect wear leveling is provided by underlying OS or hardware. Although it achieves reducing the total write traffic to PCM, the actual access pattern on PCM heap is ignored.
In respond, we construct a similar hybrid heap, and then further examine the access patterns in the old space by measuring write counts of each 4K page. The results show writes distribution are highly unbalanced: 96% writes occur to 2% pages, in some extreme cases as small as 0.5% pages endure 86% to 99% mature space writes. In contrast, we find nongenerational heaps receive much more balanced wears, but the total number is about 3.3 times more than that of mature space heap. This is due to the fact that it is the newly created objects residing in nursery space endure the most writes. The result is consistent with previous work. This paper presents a wear leveling system implemented in generational garbage collector(GC). The main idea is very similar to the second category hardware wear leveling: it captures heavily written segments and distribute them to cold ones at GC cycles. Those facilities such as associative mapping and write counters are ordinary software objects, which are much easier to deploy and flexible to change. This is viable because the obtained observation of the unbalanced access pattern can be exploited to relieve the concern of inherent performance issue of software approach. Specifically, the fact that those hot areas are rare and very prominent among all the segments (see the peaks in Figure 1a ) makes following operations low overhead. First, it is easy to capture those outstanding peaks without sorting the counters. In fact, we achieve sub-linear time complexity by exploiting the sparse access pattern. Second, wear leveling only needs to migrate a small number of pages during GC cycles. And lastly, it is reasonable to choose random migration targets without search for lowest segments since most of them are cold. However, there is one concern remaining: is the frequency of GC high enough for wear leveling to tackle the unbalance on time? Post-run simulations in sectionII-D confirm that this approach successfully reshape the distribution, and section III-B.1 gives an optimization to deal with this situation where there are too hot areas that cannot be handled on time.
We consider start-gap not the effective method to handle this situation. As Qureshi et al. [4] state, start-gap is vulnerable to the repeat address attack, which can cause line failures by repeatedly writing to the same line. This attack is possible if memory is large enough such that the line wears out before migrating to its next location. The unbalanced access pattern of our observation is similar to this kind of attack to some extent. In response, start-gap divides memory into multiple regions and wear leveling is done independently within each region such that each line is guaranteed to migrate at least once before reaching a threshold. The problem is, while it manages equalizes extreme unbalanced writes in a localized manner, other cold regions outside the hot areas cannot be exploited to endure the wear. Nevertheless, localized startgap is complement to our method, which aims to distribute writes globally and lacks the ability to further balance them within local regions. Results show that it benefits best if our global wear leveling is deployed on NVM equipped with this low cost hardware technique. Figure 1c illustrates the how proposed wear leveling reshape the distribution under workload h2. Figure 1b and 1d compares effects of start-gap on original heap and wear-leveling enabled heap. Note that for better visual effects, each bar represents averaged write number over one megabyte. Here the mature space size used for workloads h2 is 1000MB. Section IV-B.2 compares plain localized start-gap, plain GC wear leveling, and the combined method.
In order to monitor the hotness of mature space, runtime maintains write counter arrays on per-mutator basis. Each element accounts for accumulated number of the corresponding memory region since last swap. We enable the write barrier to sample the writes and to update the counters. Although JVM samples on the critical path, the performance cost is reasonable, because 1) writing to PCM is much slower than write to DRAM i.e. updating counters, 2) sampling interval can be fairly large, and 3) as mentioned above, only a small proportion of writes occur to mature space. In addition, the space overhead is negligible. Each counter is 64-bit wide and corresponding to 8k memory, thus introducing merely 0.1% extra space for each mutator.
We insert a dedicated phase for wear leveling operation into GC cycles. All counter arrays are scanned in parallel and a few most written pages are identified and swapped with randomly picked cold pages.
To quantitatively evaluate how effective write traffic is reshaped on PCM heap, we measure Gini coefficients to interpret equality. Mature space is measured 0.95 versus 0.65 of full heap for non-generational GC. We manage to mitigate the inequality of old generation space as mean Gini coefficient drops from 0.95 to 0.60. NVM lifetime is extended by 41 times on average, given worn-out threshold 1%. Meanwhile, it costs reasonable performance overhead, as 7.2% longer mutator execution time emulated by DRAM without inserting any simulated access delay, and 3.31% extra GC stop time by emulating 10 times lower write throughput for PCM.
To our best knowledge, this is the first attempt to solve the unbalanced wear problem on managed runtime system. In summery, the contributions are:
• Identifying the common write imbalance problem in mature space, which is exploited in the design of our algorithm.
• Proposing a novel garbage collector wear leveling mechanism that exploits the facility provided by then managed runtime system.
• Detailed experiments demonstrate it effectively reshapes the write traffic and improves lifetime while incurs reasonable performance penalty. The reminder of this paper is organized as follows. Section II describes the background of NVM, generational garbage collector using hybrid memory, and our motivation. Section III presents the design and implementation. Section IV gives the evaluation results and section VI summarizes this paper.
II. BACKGROUND AND MOTIVATION

A. Emerging Memory Technology
DRAM technology is facing the difficulty to scaling cells to smaller features. As cells further shrink, the less electric charge can be held by the capacitor, which causes reliability issue or even leading to security exploits [12] , [13] . Emerging memories such as PCM are drawing attention from industry and academia as they offer more storage density and nonvolatility. They blur the boundary between conventional concepts of volatile memory and persistent storage, since NVM can both be utilized as non-volatile facility such [5] , [6] , [14] or constituent to applications' virtual memory [15] - [18] . Non-volatile memory stores data by different principle than DRAM. For the instance of PCM, zeros and ones are encoded by resistances. Writing to PCM sets the resistance by heating up cells and then cooling down into an amorphous or crystalline state. Due to the physical nature of state transition on writes, there are two obstacles that hinder the practical use of PCM. First, access latency especially write is 2-5 times longer than DRAM, and write throughput is about 10 times slower, which resulting in drastic performance degradation if no optimization applied. Second, cells are much more vulnerable to wearing, for estimated average lifetime of PCM 10 8 writes per cell versus 10 15 + of DRAM. This paper focuses on enabling the managed runtime system to prolong the lifetime of its NVM heap.
B. Generational GC
Automatic heap management, as one of the key features that managed runtime systems provide, abstracts raw pointers away from programmers, whom is relieved of manual life cycle management of heap-resident data structures. Our JVM platform Jikes RVM [19] supports for 'stop-the-world' garbage collectors. Stop-the-world refers to a runtime execution scheme that all the application threads need to cease before waking up collector threads to do their jobs. Garbage collection related literature often uses term 'mutator' referring to the entity that allocates new objects and updates pointers. In this paper, mutators are equivalent to JVM threads that executes application codes.
Modern garbage collectors usually follow the generational hypothesis that most objects die in a short period right after their allocation. Consequently, the managed heap is divided into young and old generations , or so-called nursery and mature. Mutators by default allocate new objects into young generation by bumping the pointer that points to free area. Once nursery runs out, a minor GC cycle triggers, at which point, usually most young objects have died, and those survived are promoted into mature space. Then the entire nursery is reclaimed simply by resetting the bumper pointer to nurseries' beginning address. Previous work [11] found that young objects are more active than the elders, as 70% of total write traffic occurs to nursery space. To this end, it is intuitive to put nursery in DRAM and mature in PCM. Nevertheless, as we will describe in section III-B, old generation still occupies a small volume of DRAM to contain the hot old objects. Write barriers intercept all updates to references fields which point to nursery from mature, and record them in some data structures for the usage of next minor GC.
C. OS support for hybrid memory
Hybrid memory heap and the proposed wear leveling mechanism assume the underlying operating system is capable of mapping a portion of heap's virtual address space to a specified type of physical memory(DRAM or PCM), which enables JVM to organize generations in separate address ranges and at different memory substrate. Another requirement for OS is being able to dynamically switch one memory type of a given page to another, for the purpose to swap hot PCM pages with DRAM ones. Kannan et al. [17] proposes using NVM as a separate NUMA node and applying custom NUMA policies. Their design allows application to scale seamlessly across DRAM and PCM nodes, hence meets our requirements of underlying OS for allocating and switching to explicitly specified memory resources. However, to our best knowledge, There is no literature discussing how OS manages wear information of each PCM pages and its interaction to the upper wear-aware application. So far, there is no interface exists by which applications can pass the memory wear information to the operating system, nor can they demand memories of certain wearing degree from OS. This OS-JVM co-design for wear-aware memory management leaves to future work. In the current design, we assume that all newly allocated memory from OS are freshly intact. The wear leveling aims to balance the writes occurring locally within the JVM process.
D. UNBALANCED WRITE DISTRIBUTION
To examine how workloads spread writes over the virtual memory space, we develop a write monitor to count mutators' writes to each 4K page. An overview of this monitor is presented in section IV-A. We test the runnable workloads 1 from Dacapo benchmark version 9.12 with 3 generational collectors, gencopy, genms, genimmix, and 4 non-generational, mark-sweep(ms), mark-compact(mc), immix, semispace. Table 1 shows the ratio of writes to top heavily written pages. In general, mature heaps exhibit more concentrated wear distributions, resulting in more prone to worn-out holes. In worst case, as extremely small as 0.5% of mature space endures almost all wear, which motivates a GC-time wear leveling mechanism that only swap an small number of top hot pages with other locations.
To demonstrate this approach is viable, we implement an optimistic post-run simulation to estimate upper-bound results. First we run each workload without wear leveling, in order to record the numbers of GCs triggered during executions and the write distributions of NVM heap. Algorithm 8 takes the obtained data above as input 2 . It simulates ideal cases by selecting top hottest pages, dividing them into gc equal parts and then adding each part to gc randomly selected pages. Fundamentally, the difference between so-called postrun simulation and actual GC wear leveling is that the former has knowledge of the final distribution after workloads finish. Figure 2 shows the ratios of top 0.5% page's writes both on generational and non-generational heaps with a conservative top. This proof of concept demonstrates that it is possible to reshape the unbalanced distribution by merely 1 moving a very small number of pages at the frequency of GC cycles.
III. GC-TIME WEAR LEVELING
This section presents the design and implementation of GCtime wear leveling. It operates in three stages: 1) sampling the write intensity on per-mutator basis at mutation cycles, 2) detecting most written memory regions, and 3) copying and remapping selected regions. The latter two operate at GC cycles.
We insert a composite phase WEARLEVEL into the phase stack between completeClosurePhase and finishPhase. This means our wear leveling phase is scheduled right after regular collection tasks complete. WEARLEVEL consists of three simple phases: scanCardTable is executed in parallel by all collector threads within mutator context; prepareMap is a single thread task executed within global context; and mapCards operates in parallel by collector threads within collector context.
A. SAMPLING THE WRITE INTENSITY 1) CARD TABLES
Purpose of this stage is to record the accumulated wear intensity of each virtual memory region since its last swap. By using the term 'region' rather than 'page', we intend to count writes in a coarser granularity than page. We refer this basic unit to region, which in current implementation, is sized 8K. each counter is referred to by term card 3 , which itself is 64-bit wide, more than enough to store the number of writes. We refer the array of cards(counters) to card table. It is maintained on per-mutator basis, in order avoid synchronization during updating cards. Coarser counting granularity is adopted because: (1) it reduces additional space overhead as each card table requires less than 1% memory for 8K regions; (2) hot areas tend to clustered in space. Larger region leads to less cards touched. Since updating cards is executed on the mutators' critical execution path, keeping it cache-friendly is crucial to performance. On the other hand, over-coarse regions may result in a situation where regions containing cold pages constantly swapped, incurring unnecessary performance cost.
2) WRITE BARRIERS
We extend the write barriers to sample the workload's writes to mature heap and updating corresponding cards. Initially, our implementation does the bookkeeping on every interested write. We find the mutation performance degradation is unacceptable(see figure 12) . To reduce the overhead of bookkeeping, write barriers ignore most writes and sample at a fixed interval. Specifically, each write barrier keeps a counter. Upon invoked, it first increments by one, and checks whether it is multiple of the given sampling interval. If the condition met, barrier then determines if the current write is an interested one by checking whether destination address is in the range of mature heap and updates the cards accordingly. Results show that given sampling interval 32, this optimization reduces performance cost from 110% to 7.1% measured on DRAM-based NVM emulator.
Omitting some interested writes may leads to failing to capture certain hot regions. Besides sampling technique, we explored another option to only capture writes of selected types. Write-rationing garbage collector [11] reports that merely monitoring reference write has negligible negative effects in their design. Unfortunately, we found this approach incurs significant loss on the wear leveling effect (see figure 4 and 5), even though it successfully reduces performance penalty(results are omitted in evaluation). We also tried the combination of reference and int type. It performs better than former one, but still suffers relatively low lifetime improvements. Therefore, our final implementation adopts sampling approach which trades a small loss on leveling effects for significant performance boost.
B. DETECTING THE HOTTEST REGIONS
The rest text of this section describes the design of GCtime wear leveling operations. Phase scanCardTable is first scheduled when WEARLEVEL commences. In this phase, all mutators' card tables are scanned in parallel by all collection threads. The goal is to select top N cards of each table or swapping. Though many classic algorithms have been proposed to solve the selection problem [20] . Considering we need to pick out a limited number of cards whose values are significantly greater than others, we use a threshold-based approach which is much simpler and compatible with the access pattern. See algorithm 19. It scans the card tables linearly, puts any card into an array called candidate set if it is greater than threshold min_card, which is set to 1000 by empirical results. Candidate sets are also maintained on per mutator basis, and store at most N candidate cards. To handle the situation where a new card is going to insert into an exhausted candidate set, the algorithm picks a random member of the set, replaces it with the new one, if its value is less than latter's. This attempts are made at most 10 times(see line 8-16 in algorithm 19) . Note that min_card should temporarily updated to the newly inserted card to avoid unnecessary attempts for each selected card afterwards. This simple algorithm exploits the implication of mature space's highly skewed distribution that difference between top few cards and the rest is vast. Therefore a high threshold would filter out uninterested cards.
1) OPTIMIZATION 1: REMAPPING TOO HOT REGIONS TO DRAM
We find under certain workloads, basic algorithm 19 does not work as well as others. In other words, there are regions which is too hot such that interval between consecutive GC is too long, as too many writes accumulate before next GC triggers. In addition, NVM-resident hot regions cause major performance degradations. Therefore, we optimize the basic algorithm with a few extra operations: before determining whether card is a candidate(line 4), it first checks if card should be remapped to DRAM by comparing it with threshold TOO_HOT _COUNT , which is set to 10000. If the condition is met, system then checks whether the card has been remapped or is in the process by other thread. If not, system performs the remapping in 3 steps: 1)copies data of card's region to DRAM; 2)requests OS to remap the virtual address to DRAM; 3)zeros the card, and sets its highest bit to mark the 'in-DRAM' state of this card. These operations are wrapped together in a mutual exclusive lock to ensure they are executed atomically. When these marked cards are encountered at next GC cycle, all bits but the marker are cleared.
We reclaim DRAM resources at full GC, wherein each marked card is checked if it is still hot during the last mutation cycle by being compared to TOO_HOT _COUNT . Those cards that deemed not too hot are migrated back to PCM by a similar atomic operation. We find even the to-DRAM remapping is permanent, it consumes very little DRAM. For most cases, less than 4 MB DRAM is allocated for hot mature heap, except for xalan which requests 18 MB memory in total. Therefore, given the relatively low pressure on DRAM, it's reasonable to reclaim at frequency as low as full GC's.
2) OPTIMIZATION 2: SPEEDING UP LINEAR SCANNING
Time complexity of algorithm 19 is O(N ).
Results show that the card tables are very sparse, i.e., on average, only 3.5% cards scanned are non-zero at each GC time. This reveals an optimization opportunity to reduce time complexity to sublinear. In order to accelerate linear scan, zero cards should be skipped over as much as possible. To this end, second level card table, called dirty table, is introduced to mimic the function of dirty bits in page table, but in coarser granularity. Each element of dirty table is one-byte long, and corresponding to R cards. Write barriers are also modified to set these elements 'dirty' when any corresponding regions are sampled. Consequently, scanners instead scan dirty tables first, and once a dirty element is encountered, they turn to scan the corresponding R cards.
Parameter R is the resolution of dirty table which impacts the acceleration ratio of this optimization. The proportion of scanned array items (including card tables as well as dirty tables) p(R) is the estimation of acceleration. Suppose workload w's mature space has n cards, and function D w (R) denotes the ratio of marked bytes of dirty table at each scanning. Then original number of accessed items is n. With the help of dirty table, the total accessed items are n/R · D w (R) · R cards and n/R dirty table elements. Therefore the proportion is expressed as:
In principle, to maximize acceleration for workload w, p(R) should be minimal. As resolution R increase, D w (R) goes higher and 1/R goes lower. We find R = 64 works best for most workloads, expect avrora, which performs best at R = 32. Figure 10 presents detailed results. VOLUME 6, 2018
C. REMAPPING HOT REGIONS
Upon the completion of first phase scanTable, each mutator context holds its own lists of candidates to be processed. Before doing the actual remapping work, phase prepareMap is scheduled to build up a single global view of the entire candidates. First it copies the cards from each mutator' candidate set into a buffer and remove duplicates. This step is necessary because we need to ensure there are no races among the remapping threads, or any cards that are remapped twice. Next system generates random destinations. Following condition should be satisfied: if we denote the final swap in the form of mapping X → Y , where X is the set of regions to be mapped, and Y is the set of random destination regions, then X ∩ Y should be empty, and Y should have no duplicates. To this end, an auxiliary bitmap is introduced to assist the production of Y . It records members of X and already generated destinations, in order to ensure new destinations are neither a candidates itself or a duplicate before inserted to Y . After generating Y , the final step pushes all members of X and Y into two queues waiting to be retrieved by next phase mapCard.
In the last phase, swapping work is executed by all collector threads in parallel. To swap one region, the worker threads retrieve a source and a destination regions from the global queues respectively, swap contents of these two regions via a third DRAM buffer, and then ask operating system to swap virtual memory mappings of source and destination. For example, before swap, the virtual memory to physical memory mappings of source and destination regions are V src → P src and V dst → P dst , after swap, they become V src → P dst and V dst → P src . We emulate this operation by system call mremap(2), since we emulate PCM using regular DRAM and NVM-aware OS is not yet available. Currently, Linux kernel does not provide convenient ways for user space programs to manipulate virtual memory mapping information. We emulate one swapping by invoking system call mremap(2) 3 times: mremap(V src , p), mremap(V dst , V src ), mremap(p, V dst ). Results show this expedient method takes more time than copying data does. Nevertheless, it only extend GC stop time by less than one percent.
IV. EVALUATION
This section evaluates our wear leveling mechanism and answers the following questions:
• How effectively does it mitigate the concentrated wear?
And by how much does it improve the device lifetime?
• How write sampling approaches impact the system performance and wear leveling effects?
• How much does it impact on GC latencies? We first describe the experimental methodology and then analysis the results.
A. METHODOLOGY a: PCM EMULATION AND EXPERIMENTS SETUP
Since PCM devices are not yet commercially available, they are usually emulated by DRAM. Prior JVM-related work [11] , [21] report that a full system simulator sniper [22] , [23] works fine with Jikes RVM. And another option is NVM emulator Quartz [24] . However, we leave precise PCM performance emulation on JVM to future work, since neither simulators above can be adopted without major customization. For the current evaluation methodology, emulated PCM has the same performance model as DRAM.
We implement proposed wear leveling mechanism in latest Jikes RVM 3.1.4 using state-of-the-art generational garbage collector GenImmix [25] . We pick 6 workloads from bugfix version of DaCapo suite 9.12 [26] : avrora(50), h2(1000), lusearch(150), jython(200), sunflow(280), and xalan(400), all of which are multi-threaded, except for jython. The numbers in parenthesis are the sizes (in MB) of mature spaces used by each workloads. To eliminates uncertainties brought by the dynamic optimizing just-in-time(JIT) compiler [27] , [28] as much as possible, we follow the practice for Java performance evaluation [29] . At profiling run, each workload is executed for 10 times to fully warm up the JVM, and then an optimization plan called compilation advice data is recorded. Next, at testing run, each workload is executed for 6 iterations. At the first iteration, compilation advice is applied to the JIT compiler to produce optimal code all at one. The averaged statistics over the rest 5 iterations are reported. We leverage the callback mechanism provided by Dacapo to reset the performance-related counters and force a full GC at each end of iteration. All tests are evaluated on a host with one Intel Core i7-6700 processor that supports 4 hyper-threaded cores, and 4x32Kb 8-way L1 data/code cache and 32 16-way L3 cache.
b: EVALUATION METRIC AND LIFETIME MODEL
In economics, Gini coefficient is a measure of statistical dispersion to represent the wealth distribution of a nation's residents. The upper bound value 1.0 expresses completely inequity among numbers, e.g., for a large population, just one has the total income. Meanwhile lower bound zero represents the absolute equity. We borrow this idea to evaluate the effectiveness of our wear leveling approach. Note that calculation excludes those pages that endure no writes, since we only care about the equality among impaired pages.
Although Gini coefficient quantifies wear leveling effects. Reduction in this metric does not necessarily reflect magnitude of lifetime improvements. We consider PCM device become unusable when t% pages reach their endurance limits. This ratio is referred to as worn-out threshold. Suppose S * t and S t denote sets of top t% hot pages with and without wear leveling, ith page endures w * i and w i writes respectively, then we estimate the lifetime improvement(X ) as follows:
This model assumes a fault-tolerant OS that has the ability to replace worn out pages without terminating or crashing the involved processes. Figure 4 shows the Gini coefficients of 7 configurations at the sampling interval of 1. It can be concluded that writes are highly unbalanced by noticing in most cases no-WL are close to 1.0. WL reduces the number from 0.95 to 0.60 on average. jython is most immune to wear leveling both by actual experiments and simulation. On the other hand, this reveals the major limitation of our wear leveling approach: It has to observe PCM to endure writes for at least one mutation cycle before taking proper reaction. jython exhibits a large amount of dense 'one-shot' bursts of writes that occur in one mutation cycle that renders PCM defenseless. Under some workloads, WL works better than simulation. This can be explained by two reasons: 1) parameter top in Algorithm 8 is too conservative and 2) simulation lacks the capability to remap too hot regions to DRAM(see avrora, h2).
B. WEAR LEVELING EFFECTIVENESS 1) GINI COEFFICIENTS
WL-hot disables optimization-1, which makes evident impacts on avrora, h2, and sunflow. On average it increases 0.10 compared with WL. We find there is no clear correlations between the impact and the number of remapped regions. For example, WL-hot increases both avrora and h2 by 0.16, but it remaps about 9 times more regions to DRAM for avrora. Fortunately, this optimization only requires 1.1 to 18.2 MB additional DRAM.
WL-ref and WL-ref-int only sample selected types of write. We are surprised to see neither has comparable effect to WL. Especially, WL-ref nearly fail to reduce the coefficients compared with no-WL. Prior work suggests that frequent reference write predicts hot objects by watching the survivor objects from minor GCs. In contrast, Our results have an opposite conclusion that solely monitoring reference write can not capture most hot regions of mature space. Nevertheless, WL-ref-int works better than WL-ref as it reduces 0.15 compared to no-WL. Unfortunately, the absolute numbers are still high under most workloads, except for xalan. The results suggest that write barrier should monitor all types so as to capture hot regions accurately for the purpose of wear leveling. Figure 3 shows the impacts of sampling intervals. In general, lower sampling frequencies do compromise equality, but by a very small margins. Although higher frequency produces better equality, as figure 12 shows, it incurs unbearable mutation overhead. Therefore, we trade a little wear leveling degradation for much better performance gains by picking sampling interval of 32. Following experiments are all conducted under this parameter unless explicitly mentioned.
2) LIFETIME IMPROVEMENTS WL improves lifetime by 15x to 85x. We are not surprised to see the significant boost considering the extreme inequity of original access pattern. As the consequence of being unable to handle too hot regions on time, WL-hot drops mean ratio from VOLUME 6, 2018 85x to 13x. This reveals the vital importance of capturing and redirecting hot region. In addition, for WL-hot, the ordering of workloads sorted by Gini coefficient is consistent with the ordering sorted by lifetime improvement, e.g., lusearch and xalan have least Gini coefficients and simultaneously improve lifetimes most. WL-ref has very limited effect as the ratios rang from 1.09 to 5.86. In contrast, WL-ref-int increases lifetime by 5x to 59x. Figure 6 plots lifetime as a function of worn-out threshold t. It shows a tendency that as t increases, lifetime ratio first drops and then stabilizes. This can be explained by the fact that as t increases, the more pages included, the less difference of total writes between no-WL and WL. Hence, we see the lines drop at first. As t continue increasing, more irrelevant pages is included, the ratios become flat. All WL-hot lines eventually approximate to 1.0.
a: COMPARING TO IDEAL WL
we compare our WL mechanism with the hypothetical ideal WL(iWL) that distributes writes perfectly even over the PCM space. The lifetime ratio of WL to iWL is defined as follows:
where avg denotes average number of writes per PCM page, and P t equals to number of members in S * t . Figure 7 plots ratio X as a function of worn-out threshold t. At t = 10%, avrora and sunflow achieve 95% and 97% lifetime of iWL, and for all workloads, an average of 55% is achieved. It is worth noting that given worn-out threshold 15%, avrora and sunflow exceed 1.0, due to the capability of reduction in total write counts. 
b: COOPERATING WITH START-GAP HARDWARE WL
We implement localized start-gap in counting space. The size of independent wear leveling region is 8MB. We applied startgap to no-WL and WL respectively. Figure 8 compares the relative lifetime extensions on heaps with and without WL, given worn-out threshold 1%. Start-gap extends lifetime by 8.91 times for heaps governed by WL, while only 4.19x gained for no-WL. It clearly tells that this hardware method has more positive effects if combined with our software wear leveling, which indicates that start-gap works better on more balanced written heap. Table 3 shows lifetime relative to ideal WL (also 1% wornout threshold) with plain start-gap, plain WL, and the combined method. Start-gap alone only achieves 1-2% lifetime of ideal Wear leveling, while the combined method makes 33-400%. The results show that while start-gap equipped hardware handles the access patterns of mature space poorly, it sees significant improvements if the hardware cooperates with the software which not only migrates hot regions but also reduces total writes.
C. PERFORMANCE
Performance evaluation consists of two aspects. First we measure the GC delay, then we present mutation overheads.
1) GC STOP TIME
The prolongation of GC cycle originates from newly inserted phase WEARLEVEL. Figure 9 illustrates normalized time consumed by WEARLEVEL. This graph compares WL with WL-hot. In worst case, WL and WL-hot cost xalan 5.34% and 5.65% extra delays, and on average GC stop time is stalled by 3.31% and 3.78% respectively. WL-hot takes 0.02% to 1.6% more time than WL due to an extra small group of overly hot regions that otherwise mapped to DRAM has to be scanned and swapped to other PCM regions. We divide the delay into two parts, scan time which is induced by scanCardTable, and mapping time which is the sum of prepareMap and mapCards. In most cases, scanning operations dominates GC overhead. The most extreme case is jython, where least time is spent on mapping, leading to worst wear leveling effect. On the other hand, avrora is exceptional as it spends almost equal time in remapping hot regions.
WL-dirty disables dirty table and instead scans the entire sparse card tables. As described in section III-B.2, we propose minimizing the expression p(R) = D w (R) + 1/R leads to highest acceleration ratio. Figure 10b and 10c confirms this hypothesis. Most workloads hit bottom at R=32 or 64, and simultaneously they climb to peaks of acceleration ratio. The only exceptional case, avrora works best at R = 16. Furthermore it exhibits most drastic curve in former figure, while shows no evident changes in acceleration ratio. Although average speedup ratio reaches the maximum of 17.05 at R = 32. We pick 64 in our implementation thus achieving a slight less mean speedup of 16.86, considering only half space is allocated to dirty tables.
2) MUTATOR EXECUTION TIME
Write barriers' sampling operation described in section III-A extends mutation time. Figure 11 plots relative numbers of three sampling policies at default sampling interval of 32. Although the overhead of WL-ref is about half of WL's, it sacrifice much fidelity of capturing hot regions of mature space, as figure 4 indicates. Note that these results are measured by using DRAM as emulated PCM and no synthetic delay is inserted to simulate long write latency of PCM. Considering the 10 times write delay of real PCM device, mean 7.2% overhead of write barriers is acceptable. However, lack of PCM simulator prevents us evaluating the performance gain from remapping over hot mature regions to DRAM. We plan to build a PCM performance emulator for managed runtime system like Jikes RVM, which will take a similar approach like Quartz that manages NVM by a remote NUMA node. The major differences are it will support mmap(2) API and dynamically insert delays at GC time to avoid conflict with JVM's own timer system.
The impact of sampling interval is presented in figure 12 . If every write is sampled, the overhead is unbearable. Most workloads (sunflow, h2, xalan and avrora) takes more than 200% time of no-WL. Nevertheless, we see evident drop of the overhead as the interval grows from 1 to 16. Starting from 32, mutation time of all workloads stabilizes. In conclusion, Figure 12 and figure 3 prove that sampling at lower frequency can successfully throttle the mutation overhead while maintaining an acceptable wear leveling effect.
3) VIRTUAL MEMORY REMAPPING OVERHEAD
Swapping underlying physical memory of two virtual regions involves updating page table entries, multiple lock acquisitions, flushing CPU cache and translation lookaside buffer (TLB) invalidations, all of which are performed by the system call mremap(). First we evaluate the overhead caused by mremap itself. Table 4 shows the proportions of time spent on this operation in phase mapCards. It reveals the fact that overhead of virtual memory remapping overwhelms that of data copying, as over two-third time of mapCards is consumed by calling to tis system call, which are not designed for user space program to manipulate virtual memory mapping. We expect a new dedicated system call invented to mitigate this problem. In order to evaluate the impacts on mutators brought by cache and TLB miss, we compare the mutators behavior between the full-function JVM and the modified one that cancels last phase mapCards. We find under all workloads the differences in execution time vary within 0.2%. Therefore, we record numbers of level-1 data cache miss and TLB miss at each end of mutation cycle using Linux profiling tool perf. Table 5 lists the normalized results of full-function JVM as well as the fraction of swapped pages. The numbers show that swapping 1% virtual pages during GC cycles only causes 1.6% more L1 cache miss and 0.5% extra TLB miss, leading to negligible mutators performance penalties.
V. RELATED WORK
NVM lifetime improvement techniques. Existing wear leveling techniques fall into two categories: Algebraic-based and request-based [30] . The former aims to design a static algebraic-base mapping scheme that applies to all cases, while the latter seeks to make dynamic remapping decisions in response to runtime access patterns. Fine-grained static methods [8] , [31] , [32] redirect writes within line, row, and page respectively, which can be complement to our method that make writes rebalanced across pages. Start-gap [4] is typical algebraic-based method which moves one line to its neighboring location after every 100 writes, leading the whole memory repeatedly rotated, so that writes can be uniformly occurred to all lines. Although algebraic-based methods are relatively easy to implement and require less resources, they lack optimizations for worst case or highly repeated access patterns.
Zhou et al. [8] proposed a request-based algorithm that swaps memory segments of high-and low-write periodically. It relies on a custom memory controller which keeps track of write counts of each segments and stores the mapping between virtual and physical segment addresses. A similar hardware request-based proposal [33] maintains a table recording write numbers in the memory. When a page's write number exceeds the threshold, controller interrupts the processor which in turn allocates a new page via a DRAM allocator and copies the PCM pages to the newly allocated DRAM page. Considering data access pattern in embedded system is fixed, a software wear leveling algorithm [34] generates an assignment plan in PCM for each variable such that the write number of each address is less or each to a given threshold. Long et al. further explored a compiler level-technique which transforms frequent written variable into an array thus reshapes write traffic to the same memory location into evenly distributed one. Recent work [35] - [37] identified endurance issues of existing persistent memory file systems, and proposed new randomized allocator or space management schemes for data and meta-data organization. Our work is another request-based software method that applies to managed runtime system.
a: INTEGRATION OF EMERGING MEMORY TO MANAGED RUNTIME SYSTEM
Gao et al. [10] explored a novel approach that instead of applying wear leveling to avoid worn-out holes in PCM, heap manager is utilized tolerate the failures with help of hardware and OS. When a failure happens, controller interrupts the OS, which records the defective lines of each map. Then the runtime is notified, and those objects located in the affected lines are move to other lines. The limitation of this approach is that it only considers PCM-only system use case such that it can not benefit from performance advantage of hybrid memory. Moreover, it may render a holey NVM device that can not shared among other native application run on the same host, which leads a memory provisioning problem for cloud computing scenario. Jantz et al. [38] designed a cross-layer framework in order to optimize energy efficiency. Runtime system first gather access patterns of objects in an profiling run and then allocates hot/cold objects into separate ranks that has different power configurations with the support of OS. Although the framework targets to DRAM power-saving, the principles has potential to apply to PCM/DRAM hybrid memory system. Akram et al. [11] found most write traffic occurs to nursery space. They proposed a write-rationing garbage collector that uses PCM as mature space storage. It monitors survival objects from a minor collection. Those identified as cold are moved to PCM mature space and the rest are moved to DRAM mature space. This work inspired us to further investigate the access pattern in mature space. We focus one resolving the endurance issue this collector ignores. Beside using emerging memory as traditional volatile heap, Wu et al. [39] extended JVM to enable Java programmer to exploit the persistence that emerging memory provides. They designed a persistent heap to manage nonvolatile data as normal Java objects. Moreover this new heap is equipped with a recoverable mechanism to ensure crashconsistency for meta-data. Their method achieve significant boost in performance due to the bypass of serialization and system file involvement which is inevitable for traditional Java persistence programming. Yet, this persistent heap lacks optimization for endurance issue.
VI. CONCLUSIONS
We first discovered the highly unbalanced write distribution on PCM as the substrate for old generation heap. Then we present a pure software wear leveling solution for generational hybrid memory garbage collector. It samples runtime information during mutation cycle and swap hot memory regions at GC cycles. The inequality sees evident mitigation indicated by Gini coefficients dropping from 0.95 to 0.60 and lifetime improves 41 times on average. Given wearout threshold 10%, our system achieve 55% lifetime of ideal wear leveling that produce uniform write distribution. Meantime, performance costs for mutators is 7.2% longer execution time on regular DRAM, and 3.31% extra delay for GC cycles.
