Abstract-As the descendant of spin-transfer random access memory (STT-RAM), racetrack memory technology saves data in magnetic domains along nanoscopic wires. Such a unique structure can achieve unprecedentedly high storage density meanwhile inheriting the promising features of STT-RAM, such as fast access speed, non-volatility, zero standby power, hardness to soft errors, and compatibility with CMOS technology. Moreover, the recent success in planar racetrack nanowire promised its fabrication feasibility and continuous scalability. In this paper, we investigate the design and optimization of racetrack memory as last-level cache by embracing design considerations across multiple abstraction layers, including the cell design, the array structure, the architecture organization, and the data management. The cross-layer optimization makes racetrack memory based last-level cache achieve 6.4Â reduction in area, 25 percent enhancement in system performance, and 62 percent saving in energy consumption, compared to STT-RAM cache design. Its benefit over SRAM technology is even more significant.
Ç

INTRODUCTION
O N-CHIP cache memory has been widely adopted in computing and embedded systems for performance improvement by bridging the gap between main memory and CPU. The ever increasing huge amount of data being processed by CPUs demands a continuous increase in embedded storage resources. As such, traditional SRAMbased cache design has become the dominant contributor to the overall chip area, and therefore, power and thermal budget. The concern in the continuous scaling of SRAM technology triggered the investment in emerging memory technologies. Especially, spin-transfer torque RAM (STT-RAM) has been identified as a promising candidate for embedded caches for its high density, competitive access speed, and ultra low power consumption under the scaled fabrication process node [1] , [2] . In November 2012, Everspin began shipping working samples of 64MB STT-RAM [3] , announcing the commercialization era after many years of joint effort from both academia and industry [4] , [5] , [6] .
However, restricted by the theoretical limit in unit cell area of 9F 2 (F represents technology feature size), further shrinking STT-RAM cell size and hence improving performance and power consumption is difficult [7] . To offer a "faster-than-Moore's law" scaling path, a team led by Dr. Parkin in IBM proposed racetrack memory [8] . As the descendant of STT-RAM, the racetrack memory integrates many magnetic storage pillars (called as magnetic domains) in one strip (or racetrack) which is connected to a few access transistors. The data access is obtained by shifting the magnetic domains along the racetrack and aligning the target domain to an access device. The area of a magnetic domain can be as small as only 1F 2 . Moreover, the continuous progress in device physics [9] , [10] , [11] and the recent successes in fabrication process [12] , [13] , [14] promise the feasibility of racetrack memory.
Recently, TapeCache firstly presented the use of racetrack memory in data caches [15] . As an early stage of estimation, TapeCache investigated the sub-array design, data organization, and tape head management in racetrack based last-level cache (LLC) design, which obtained 2:3Â higher density, 1:4Â power reduction and same performance compared to STT-RAM based LLC. However, without comprehensively considering the design requirements across different abstraction layers, the potential of the racetrack memory has not been fully explored.
Notably, compared with array-style random access memory, including STT-RAM, integrating the tape-style racetrack memory faces several unique design challenges: (1) To effectively utilize the stripe structure, new circuit layouts and optimizations distinct from array-based memories might be necessary. ( 2) The stripe-based memory structures might require new logical abstractions of memories. (3) Moving from random accesses (i.e., through wordlines and bitlines) to sharing limited access devices (i.e., writing/ reading requires shifting) desires careful data management and scheduling.
In this work, we explore these design considerations across different layers and propose an ultra-dense on-chip racetrack memory, enabling significant improvement in system performance and energy saving: compared to STT-RAM cache design, the racetrack memory based cache achieves 6.4Â area reduction, 25 percent performance enhancement, and 62 percent energy saving. The benefit over SRAM technology is even more significant. More specificly, the primary design considerations and contributions of this work can be summarized as:
1) The efficient cell and array designs are presented that eliminate the area constraint of the access transistor size and enable the uniform access ports for read and write operations. 2) An optimized racetrack architecture HDART is presented after carefully exploring sub-array and architecture configuration. Unlike TapeCache which simply distributes a cache block into different subarrays, the proposed work comes out a new physical-to-logic mapping scheme. Such a general set/ way mapping strategy can greatly benefit the architecture development of racetrack based cache design. 3) An application-driven data management policy is proposed. Besides the tape head selection used in TapeCache, this work targets at allocating the accessintensive data blocks close to uniform access ports so as to minimize the racetrack shifting and the induced overhead. 4) The impact of racetrack cell size under the proposed architecture is also evaluated by comparing two racetrack geometrical dimensions. The benefits and potential design challenges are discussed. In the rest of the paper, we will give a brief introduction on the fundamental of racetrack memory in Section 2. The proposed cell and array designs, architecture, and data management policy will be presented in Sections 3, 4, and 5, respectively. Section 6 introduces the simulation setups. And the simulation results are reported and analyzed in Section 7. A summary of the related works is given in Section 8. At last, Section 9 concludes the paper.
BACKGROUND
The racetrack memory comprises an array of magnetic stripes, namely, racetracks (RTs), arranged vertically [8] or horizontally on a silicon chip [12] . Fig. 1 illustrates a horizontal RT structure that will be discussed in this work. It consists of many magnetic domains separated by ultra-narrow domain walls. Each domain has its own magnetization direction. Similar as STT-RAM, the binary values can be represented by the magnetization direction of every domain. And several domains share one access port for read and write operations. A select device together with a magnetic tunneling junction (MTJ) sensor is built at an access port. The read operation is similar to STT-RAM, if the magnetization direction of the corresponding domain cell is anti parallel or (parallel) to the magnetization direction of the pinned layer at access port, it will demonstrate high or (low) resistance. Current introduced switching is realized in [16] , so bi-directional current can be used to switch the magnetic direction of each domain cell for write operation.
Accessing a domain cell needs two steps: (1) Shifting: Based on the position of the target domain cell, the whole RT is shifted until the target domain cell is moved to an access device. The RT shifting can be manipulated by applying I shift at the head or tail of the RT. (2) Read/write: Once the domain cell is shifted to an access point, it can be either read or written by applying corresponding current (I R or I W ) with appropriate amplitude and duration, controlled by read driver and write driver.
Similar as STT-RAM, the device engineering of RT memory can be classified into two types-in-plane and perpendicular-according to the anisotropy direction of its magnetic layer. The low intrinsic energy barrier E of inplane materials leads to lower retention time, which is not suitable to construct memory device in nanometer scale. In contrast, the perpendicular magnetic anisotropy (PMA) RT memory composed of CoFeB magnetic stripe and CoFeB=MgO MTJ can provide a higher E even when the volume of domain cell is very small, enabling the continuous scalability for racetrack memory [16] .
CELL DESIGN AND ARRAY ORGANIZATION
Racetrack Cell Design
The accelerated storage density improvement introduced by the advanced racetrack memory enlarges the area gap between small memory element and relatively large NMOS select transistor. In Figs. 2a and 2b, we illustrate the schematic of a column of the baseline racetrack memory design and the corresponding layout. The table at the bottom of the figure summarizes the components and symbols in use. The layout design shows that a RT (the blue strip) covers only a small portion of the space above access transistors. The area highlighted in the gray shadow, however, is wasted.
A straightforward way to improve racetrack layout area efficiency is decreasing the size (width) of NMOS select transistors. However, the driving current provided by the reduced transistor decreases and therefore is not sufficient to switch magnetic domains. Such a design can be used for only read operations that require a small amount of current for data detection. We name it as a read access point, or Rports. To support write operations requiring large switching currents, at least one read-and-write access points (RW-ports) associated with a large access transistor is necessary for a RT. This is how macro cell was designed in TapeCache [15] .
However, such a macro cell with one RW-port and multiple R-ports encounters the following design challenges: (1) Since there is only one writable port, at least the same amount of dummy bits (named as RT-overhead) as RT capacity shall be added at the ends of RTs to save shifted-out data during write accesses, which greatly expands the array size and increase average shift distance. (2) As we shall show in Section 7, the RT shift energy is a big contributor to the total cache energy. Because the shift energy is proportional to the RT length, prolonging the RT results in more shift energy consumption in read and write accesses. (3) The wide transistor of a RW port makes the space between adjacent RTs larger, further jeopardizing the area efficiency of RT based cache design.
Without re-engineering the racetrack layout design, it is difficult to increase array area efficiency and read/write accessibility at the same time. Here, we propose a new cell design and array organization to achieve better optimization. Fig. 2c depicts the layout of the proposed racetrack memory design, which adopts only large select transistors supporting both read and write operations. Multiple RTs are arranged side by side to cover the whole space above select transistors. Their access points to the corresponding select transistors are placed in a diagonal manner. The 3D structure and cross sections in Fig. 2d illustrate the placement and route of metal wires and RTs.
In this design, the number of RTs per column is determined by the widths of RT and selection transistors. For example, Fig. 2c assumes four RTs per columns, resulting in 4Â memory density compared to the baseline design of Fig. 2b . Note that the four transistors share one source-line (SL) but correspond to different RTs and hence different bitlines (BLs). The proposed RT design maximizes the utilization of the space above CMOS layer. The size of the access transistors, as far as it is large enough for write operations, is not the limiting design factor any more. In fact, the selection of the access transistor size becomes more flexible and can be used to facilitate architecture optimization.
Note that the 3D structure in Fig. 2d is used to illustrate the relation of RT and other metal layers. Similar to STT-RAM, the RT fabrication process is well compatible to CMOS technology and does not involve any special 3D technique (e.g., chip stacking). Many researches have successfully demonstrated the real RT nanowires integration on silicon [12] , [13] , [14] .
Scaling of Magnetic Domains
Note that the CMOS and the magnetic racetrack technologies may have the different minimal dimensions, which are determined by the development status of fabrication processes. More specific, the advanced racetrack memory is still at the early stage of device enhancement and small array demonstration [17] . Thus, enlarging magnetic domains may help improve device reliability though sacrificing its potential in storage density.
In the work, we examine two types of magnetic domains with areas of 4F 2 and 1F 2 , where F represents the feature size of the CMOS technology in use. Fig. 3 illustrates both the layouts and the cross sections of these two racetrack memory cell designs. The 4F 2 design refers to the racetrack which has 2F in width and 2F in length of each domain. 1F 2 racetrack domain is the minimal allowable design, of which both width and length are only 1F . The domain wall can be as thin as 1nm, which is negligible [8] .
Once the size of NMOS select transistor is determined, the number of domains contained within the region above an access transistor is determined by the size of magnetic domains as well as layout strategy of racetrack nanowires. The process of these nanowires shall follow the given design rules. Based on the spacing rule [18] , two adjacent objects in the same layer require a spacing of 2 $ 3 (1 ¼ 0:5F ). For the 4F 2 design, we propose to integrate RTs on different layers to avoid the gap between adjacent RTs, as shown in Fig. 3a . The cross-section in Fig. 3b indicates that at least 1F spacing between two adjacent RTs in the 1F 2 design is necessary, no matter allocating RTs on a single layer or two different layers.
Compared with SRAM or STT-RAM technologies, both RT designs in Fig. 3 obtain much higher storage density. Moreover, the two cell designs adopt the same select transistor offering the same switching currents in write operations, while the difference in domain volumes results in different requirements in switching time. The detailed comparison and analysis in area and access timing are summarized in Table 1 (refer Section 6).
Racetrack Array Organization
Fig . 4 shows the circuit schematic of a basic RT memory array, which supports the following three basic operations:
Shift. Shifting a RT up (su) or down (sd) is realized by a bi-directional shifting current (I shfit ), which is controlled by signals 'su½n : 1þ', 'su½n : 1À', 'sd½n : 1þ', and 'sd½n : 1À'. Here, n is the number of RT nanowires placed above one transistor width.
Write. The write current (I W ) in a write-1 (or write-0) operation is provided by enabling 'wr1þ' and 'wr1À' (or 'wr0þ' and 'wr0À'). Read. A small read current I R can be supplied to a target cell by turning on 'rd' and 'wr1À'. The voltage generated on its BL will then be delivered to a sense amplifier (not included in the figure) for data detection. Except shift drivers, the circuit scheme of RT array is very similar to that of SRAM/STT-RAM designs. The number of the shift control signals determines how many RT nanowires within a sub-array can be shifted simultaneously. For example, assume there are m RTs in a subarray and each group contains four RTs, that is, n ¼ 4. Then the number of RT that can be shifted together is m=n.
RT-overhead denotes the extra magnetic domains at the end(s) of a RT. The RT-overhead provides the extra space to store the bits shifted out of the original data portion during accesses. Thus, the minimal requirement of RT-overhead of a RT shall be the number of magnetic domains of two adjacent access ports of the same type (named as port distance).
For example, the proposed design in Fig. 2c has each access port in charge of more magnetic domains and hence requires longer RT-overhead than the baseline design in Fig. 2b . For the same reason, the RT-overhead in TapeCache is not determined by the R-port distribution: the only RWport requires the same amount of RT-overhead as its storage capacity. Based on the proposed racetrack array organization, we propose the indent racetrack overhead design to reduce the runtime track shifting. As shown in Fig. 5a , the design allocates RT-overhead at both ends of a RT. The capacity of the RT-overhead at the top end is the same as port distance while the RT-overhead at the bottom end cuts its size to half. The design uses a track status register T reg together with a shifting direction sign bit to represent the position of a RT. Originally, all the RTs without shifting sit at location '0' with T reg pointing to '0'. Two examples of RT shifting are presented in Fig. 5b . When accessing way 7 of set 7, the 1st track need move up seven units in order to align the target data bits to an access port. Similarly, reading the data in way 30 of set 0 has to shift the fourth track all the way up (six units) if there is no bottom RT-overhead. In the proposed indent racetrack overhead design, instead, we can simply push it down by only two units to approach a closer access port. As demonstrated in the examples, the extra RT-overhead at the bottom provides more flexible shifting operations and hence reduces the overhead in access latency and energy. The indent design might not be the optimal solution in terms of area efficiency. In fact, it can be crafted in an aligned manner by applying different shift offsets to the racetracks within a single group.
In summary, our proposed memory cell and array designs significantly improve the area efficiency of RT memory, leading to a unprecedentedly high density. The design is the first one enabling read and write operations at every access port without producing side effect on the area efficiency.
HDART ARCHITECTURE
In this work, we explore the RT memory architecture based on the proposed cell and array design. The efficiency of RT architecture is determined by the basic array configuration, the architectural organization, and the physical-to-logical mapping. The complexity of hardware design is also an important factor. After comprehensively investigating these design considerations and their impacts on the floorplan utilization, the performance optimization, and the energy consumption, we propose an optimized hierarchical and dense architecture for RT, named as HDART. Fig. 6a illustrates the overall architecture of HDART. To ease the technology adoption, HDART maintains the same I/O interface as the existing memory hierarchy. However, a more flexible bank organization is provided within the architecture. For instance, an entire cache architecture can be physically partitioned into N B banks, each of which has its own I/O ports to support concurrent transactions. The details of the HDART architecture will be presented and discussed in this section.
Sub-Array Design and Architecture Exploration
Basically, the RT sub-array can be very similar to typical SRAM/STT-RAM design. The only extra component required in a sub-array is the shift drivers as shown in Fig. 6b . As we shall explain in Section 5, BCT (block counter) records the data access intensity used for data management. However, it does not have to be embedded within each sub-array. As the smallest component in architecture construction, a sub-array can significantly affect the overall performance of the entire architecture. The sub-array based on the memory array in Fig. 4 (e.g., the RT length) shall be carefully configured according to design requirements. We use the following three parameters to evaluate and compare the efficiency of various sub-array configurations: (a) the RT shifting energy, (b) the sub-array area efficiency defined as the ratio of the data array area and the peripheral circuit area, and (c) the RT overhead ratio which is the ratio of the RT-overhead and the total length of RT.
The RT length directly determines the RT shift energy: a longer RT produces higher runtime shifting energy, while a shorter RT degrades the sub-array area efficiency and has the higher RT-overhead ratio. We can also divide a long RT into several segments. Each segment requires its own shifting controller while the entire RT shares one set of read/ write drivers. In general, more segments indicate the lower shifting energy as well as the lower sub-array area efficiency and the higher RT overhead. Fig. 6c compares the different sub-array configurations in normalized scale.
Here, we use a simple metric to determine the optimal configuration, which can be derived by formula x Á b=ða Á cÞ. Where, a refers to RT shifting energy, b is the data array area/peripheral circuit area, and c is the RT overhead/RT total length. x is a normalized coefficient. Based on result in Fig. 6(c-2) , the 64-bit long RT is chosen as the optimal subarray configuration in the following analysis.
Based on the basic array design, the memory architecture configuration, including banks, sub-banks, arrays etc., can be adjusted to satisfy the different design specifications including the criteria of performance, energy, and area constraint. We evaluated and compared the 4MB RT LLC designs based on three typical basic array configurations, representing the designs with (1) more segments in each RT, (2) mediumlength RTs (e.g., 64 bits long), and (3) long RTs, respectively. The hexagon graphs in Fig. 6d summarize the detailed comparisons in terms of area efficiency, access latency parameters, and energy consumptions at the LLC architectural level. The dotted border of a hexagon map indicates the optimal expectation. Notably, a design cannot achieve the optimal expectation at all the design metrics because they are all related.
Configuration (1) divides a RT into multiple segments. More shift drivers are required to individually control the shifting of every RT nanowire. On the one hand, short RT nanowire requires less shifting energy and hence the dynamic energy consumption can be reduced. On the other hand, more transistors of the shift drivers result in higher leakage power consumption. In addition, more transistors in peripheral circuitry increase design size and induce longer routing design. Configuration (3) represents another extreme configuration with longer RTs, leading to less cost in shift drivers and lower leakage energy consumption. However, it incurs more dynamic energy consumption due to the increased runtime shifting energy which is proportional to the length of the RT. Although the configuration requires less shift drivers, the area gain is compensated by the decoder-cell mismatch (e.g., low floorplan area utilization). So the routing delay doesn't show much improvement. Here, configuration (2) represents a trade-off design in between (1) and (3) and is selected as the optimized architecture for its lower access delay and energy consumption.
Physical-to-Logic Mapping Strategy
Previously, a bit-interleaved data array organization was demonstrated in TapeCache [15] , which distributes one cache block into multiple macro-cells. The approach can also be extended to different sub-arrays as shown in Fig. 7 : every sub-array contains all the ways of set 0 to set 7. In the example, a sub-array contains 64 groups of RTs, each of which is composed of four RTs. An RT includes 64 bits associated with eight access points. The magnetic domains connected to a single RW port in the physical design corresponds to the same bit number of cache blocks within the same set from different ways. For instance, as illustrated in the figure, Bit 0's (b0) of all the 32 ways (w0 $ w31) within Set 0 (s0) are all mapped to the first group of RTs in SubArray 0 of Array 0. Similarly, all the b0's in s15 are placed in the first group of RTs in Sub-Array 0 of Array 1. The RT shifting during an access can be controlled by a physical-login mapping unit, e.g., the look-up table (LUT) in Fig. 7 . The shifting distance shall be determined by the difference of the block's way number and the current track position stored in T reg .
Because all the same bits from the different ways are within the same array and controlled by a single access port, such a design is in favor of reordering data blocks among different ways which can be easily implemented with the tag mechanism. Nevertheless, the set reordering requires a set re-mapping table which results in extra area, delay and energy overheads. Thus, we select the way reordering when managing and optimizing data arrangement that shall be discussed in Section 5.
Notably, more flexible set/way mapping crafted in various forms can be realized in RT memory design. For instance, the sub-arrays can be constructed to contain all the sets of way 0 to way 7 so that bits of the same way come from the same RT nanowire. More thorough comparison of different physical-to-logic mapping schemes can be found in our latest research [19] .
Hardware Design Complexity
Besides the racetrack memory arrays, the hardware overhead of all the related components shall be considered. For example, the RT memory offers several ten times more storage density than SRAM design. Although the tag array contributes only 5 percent of total chip area in SRAM-based LLC, it is infeasible to adopt SRAM in RT memory considering the unbalanced scaling trends of data storage and tag arrays. To alleviate the impact, our design utilizes the STT-RAM technology which is well compatible to racetrack technology and offers random accessibility.
The proposed HDART architecture introduces several new components, including a track status register (T reg ) to trace RT positions, a look-up-table to assist physical-to-logic mapping, and a block counter (BCT) monitoring the data access intensity used for data management in Section 5. The hardware overhead and the associated impact on access latency and energy consumption have been included in system evaluations.
DATA MANAGEMENT VIA HBWBR
In a traditional random access memory, every storage element has its own access path. In contrast, many magnetic domains in a RT memory share one RW-port. Accessing a domain need shift it to a RW-port, inducing extra overhead in access latency and energy consumption. Similar to the head management in TapeCache [15] , the following two basic track shifting policies are adopted in the design: TS1. After an access is completed, a RT stays where it is. A register is needed to record the track position. TS2. A RT always returns to its original position after being accessed. Comparing the two track shifting policies, TS1 benefits more when the cache accesses show a strong spacial locality, however generates many extra RT shifts when randomly distributed cache accesses dominates. Resetting RT in TS2 potentially increases the frequency of RT shifting. Nevertheless, the fixed relation of the memory cells to their RW ports makes the data management of TS2 much easier.
We propose a RT data management policy based on TS2, namely hardware based way block reorder (HBWBR), to reduce the RT shifting cost in HDART design. Because the cache accesses in many applications are unevenly distributed and only a small portion of cache blocks are frequently accessed [20] , HBWBR traces the data access pattern to identify the cache blocks with intensive accesses and then place/swap these cache blocks to the physical locations close to RW ports.
Driven by the large variation in interconnect latency among different banks, DNUCA [21] has been widely adopted in SRAM-based LLC. It tends to allocate accessintensive data to physical locations closer to upper level cache and hence reduce the access latency on both tag accesses (due to tag accesses to multiple caches) and data transfers (across long interconnects). In our design, the tag access latency is constant since we use STT-RAM-based tag structure. Our attempt of HBWBR is to reduce the shift latency which is similar to the data transfer latency in DNUCA. The RT shift latency of a data block is determined by its physical address, the given array design, as well as the track position that could change anytime. Fig. 8 depicts the data access flow when applying TS2 and HBWBR in HDART LLC cache. Note that track shifting policy (TS2) and data management (HBWBR) are two orthogonal schemes that can be executed simultaneously. As such, the access to the block counter (BCT) for data intensity prediction does not introduce any extra latency overhead. An access hit in HDART triggers the examination of BCT. If the access block is predicted to be access intensive, we swap it with the one at RW port. During a cache miss, the least-recently used (LRU) policy is adopted.
The effectiveness of HBWBR relies on the efficiency of both the intensive block prediction and the data swap. We use a simple counter-based scheme for cache access intensity prediction. In the design, each data block is associated with a block counter (BCT). When a cache hit occurs, the corresponding counter of the data block increments by one. All the counters are self-decremented periodically till it becomes '0'. A data block is considered as an access-intensive block once its counter exceeds the predefined threshold.
Remind that the physical-to-logic mapping method adopted in HDART (refer Section 4) maps the same bits of all the ways within one set into one racetrack sub-array. Such a mapping scheme results in unbalanced access latencies on different ways. More specifically, a small portion of cache ways sit right on RW ports. The data associated with these locations can be read or written directly, without requiring any track shifting. We name these ways as 'fast ways'. Therefore, we propose to swap an access-intensive block with those on fast ways. Such a way-based block swapping exchanges data within the same sub-arrays, which is convenient and energy efficient. When a cache block is regarded as an access-intensive one, it swaps with the data block that belongs to the same set and has the least accesses.
As shown in Fig. 7 , each set contains four RTs. Accordingly, the data swap could occur within the same track or between two different tracks. Fig. 9 depicts the execution details of the two swap operations. The related latency and energy overhead caused by these two types of data swap operations will be included in system evaluation in Section 6. Fig. 10 shows the cache access timing diagram when utilizing TS2 and HBWBR in HDART design. Its timing flow and the critical path is similar to those of STT-RAM cache access, except extra RT shifting delay shall be included in every data access. The track reset delay can be hidden within the routing to output (RTO) delay during reads. A data swap induces a relative big delay overhead, but it occurs much less frequently with regard to the total access number. The detailed explanation of timing components is listed at the bottom of Fig. 10 for reference. And the timing component parameters of various designs can be found in Table 1 .
HBWBR is an efficient data management scheme for RT memory that has variable access latencies. In the previous RT memory design, a RT macro cell has only one R/W port but multiple R-ports [15] . The data management in such a design requires to distinguish the read-and write-intensive data blocks and allocate them to different ports. Especially, the limited W-port number constrains the optimization space. So it is less adaptable for efficient data management. In contrast, our proposed HDART supports both read and write operations at every access ports, easing the design complexity and enhancing the efficiency of data management. Therefore, the data management can be naturally integrated on the proposed HDART.
SIMULATION SETUP
We evaluate and compare 4MB LLC design by using different memory technologies, including SRAM, STT-RAM, and RT memory. The cache configuration is set as N B ¼ 4, N SB ¼ 8 and N S ¼ 8 (refer Fig. 6 ). The cache area estimation, latency and energy parameters obtained from the modified NVsim [22] and SPICE simulation are summarized in Table 1 .
NVsim that has been widely adopted is the best toolset for area, delay, and energy estimation in nonvolatile memory architectures, such as PCM and STT-RAM. We modified NVsim to obtain the area estimation of peripheral circuits including x-y decoder, sense amplifier, and write driver. The area overhead induced by the shift drivers and the global controller logic was also included. The explicit explanation of access delay breakdowns in Table 1 can be found in Fig. 10 . Here, the same sense amplifier circuit is utilized to all the memory technologies for fair comparison. The domain wall shifting energy was calculated from micro-magnetic simulations. The energy per access refers to the total energy consumed on the accessed cell and the associated peripheral circuit component. The leakage power consumption also includes both RT arrays and the peripheral circuit.
We performed the evaluations on an eight-core Ultra-SPARC T1 processor by adopting various memory technologies as 4 MB LLC. Table 2 summarizes the process configurations. The cache model of Simics toolset [23] was modified according to the different memory requirements. The multi-threaded benchmarks from Parsec Benchmark Suite [24] were adopted in simulations. For each benchmark, we fast-forward to region of interest, warm up the cache for 200 million instructions, and then execute 500 million instructions.
The following baseline memory technologies were selected for comprehensive comparisons.
SRAM. SRAMs have the fastest access latency but incur much higher leakage power consumption and larger chip area. Here, we select SRAM-based cache as the performance baseline. STT-RAM. As the most popular replacement of SRAM technology, STT-RAM demonstrates ultra low leakage power. However, the long write operation severely limits its application. OP-STT represents the STT-RAM design with enhanced write performance, resulted from latest technology development, such as the perpendicular MTJ [25] and the retention-relaxed MTJ [20] . Baseline RT is the racetrack memory design in the baseline layout in Fig. 2b . 2 to reflect the current device engineering and an aggressive design with a domain size of 1F 2 (refer Section 3.2). We use the design in Fig. 2b as the baseline because the recently proposed TapeCache [15] did not release cell layout details. Fig. 11a illustrates the side and layout views of a structure similar to TapeCache which contains one RW-port and multiple R-ports (1 RW-port þN R-ports). As comparison, the baseline design is shown in Fig. 11b . As aforementioned, the write access in (a) can cause more shifts while the design in (b) potentially has lower runtime shifts. Moreover, the baseline RT demonstrates a better area efficiency, eventually resulting in lower access latency and energy than the structure in (a).
After considering different combinations of the domain sizes, the tracking shifting policies, and data management scheme, totally six RT memory configurations were examined:
7 SIMULATION RESULTS
Comparison Among Various Memories
To demonstrate the potential of RT memory, we first compare the HDART with baseline memory technologies. Fig. 12a shows the performance results represented by instruction per cycle (IPC) normalized to that of SRAM-based cache design. HDART (4F 2 ) with the simple track shifting policy TS1 achieves 10 $ 15 percent IPC enhancement over SRAM and OP-STT LLCs. The smaller chip size of HDART and the induced shorter routing latency dominates the performance improvement in read and write operations, even though the extra delay caused by track shifting slightly offsets the benefit. Fig. 12b shows the average energy consumption of various LLC designs among all the selected benchmarks, normalized to the energy consumption of SRAM-based LLC. The energy of HDART (4F 2 ) is 40Â less than that of SRAM cache. Even compared to the most advanced OP À STT, HDART þ TS1 can achieve an average 19 percent energy saving. The detailed energy breakdowns of the three most energy efficient memory designs are shown in Fig. 12c .
The leakage power saving of HDART over the baseline RT mainly comes from less access transistors and smaller area. As illustrated in Figs. 2b and 2c , to contain the same data bits, the baseline RT requires 4X access transistors than HDART. The port sharing scheme of HDART makes it smaller and more efficient in leakage power reduction.
The baseline RT has more RW ports and hence consumes $ 2Â less track shifting energy than HDART. However, HDART (4F 2 ) þ TS1 still obtains 18 percent reduction in overall cache energy consumption compared with the baseline RT. The saving comes from less leakage and dynamic energies induced by its higher storage density (refer Table 1 ).
Effectiveness of HBWBR 7.2.1 The Selection of Swap Threshold
In HBWBR, the data swap is not free but associated with delay and energy overheads. Thus, the control scheme shall avoid unnecessary data swaps, which depends on the prediction of access-intensive blocks. From the designer prospective of view, the swap threshold directly determines the effectiveness of the data intensity prediction and the frequency of runtime data block swaps. Intuitively, a large threshold increases the difficulty to swap data blocks and results in less swaps. However, increasing the threshold could also reduce the accesses to fast ways, inducing more RT shifts. Thus, the swap threshold shall be carefully selected to speed up the read and write operations of frequently accessed cache blocks while minimize the delay and energy overheads induced by unnecessary data swaps.
We vary the swap threshold from 3 to 32 and analyze its impacts on the cache hits on fast ways, the RT shifts, and the data swap number. The simulation results of all the benchmarks are summarized in Fig. 13 . The results show that the shift number is directly relevant with the hit number on fast ways. When the swap threshold exceeds 10, the hit number on fast ways decreases dramatically, resulting in significant increase of racetrack shifts. Moreover, the data swap number reduces fast when the swap threshold is small and changes gradually as it gets longer than 10. The evaluations restrict the swap threshold within the range of 7 $ 11. As demonstrated in Fig. 13 , the optimal threshold falls into the selected range. Here, the configuration with a swap threshold of 6 is used as the normalization baseline. Fig. 14 shows that as the swap threshold increases, the average IPC performance improves due to the reduction of swaps. However this benefit is gradually canceled out by more RT shifts. When the swap threshold approaches to 11, the overall system performance begins to degrade considering the average of all the benchmarks. Moreover, the cache energy consumption grows with the increasing of swap threshold because shifts in almost every access have bigger accumulative impact than a small number of data swaps. Our design chooses the swap threshold of 10 as the optimal setup, which obtains the lowest EDP as shown in Fig. 14c. 
Comparison of RT Memory Configurations
Section 7.1 compares various memory technologies to HDART design equipped with a simple track shifting policy-TS1. A more thorough comparison of racetrack memory configurations and the effectiveness of HBWBR will be demonstrated in this section. Fig. 16 shows the IPC performance comparison of HDART under different configurations. Accordingly, the detailed energy breakdowns can be found in Fig. 17 . The track shifting policy TS2 alone suffers more from track shift operations, resulting in 2.5 percent performance degradation than TS1. TS2 þ HBWBR together can effectively reduce shifting overhead and improve IPC 7.5 percent on average. In summary, the HDART design 4F 2 þ TS2 þ HBWBR achieves 4.2Â area reduction, 20 percent improvement in IPC performance, and 49 percent saving in LLC energy consumption, compared to STT-RAM cache design. Compared to SRAM, the performance is improved by 13 percent and the energy consumption is reduced by 40Â.
Recall that Fig. 12c shows a big portion of HDART energy consuming on track shifting. Thus, reducing the RT shifting is necessary to enhance energy efficiency. Fig. 15a shows the trend of RT shift numbers at the beginning of application execution for the selected benchmarks. Sratio here is a ratio of the shift number that occurs within a time interval T over the total shift number within the entire observation window. To better illustrate the change of the shift number in each benchmarks, we set the different observation windows. The time interval T is 1/1,000 of the length of the observation window.
At the beginning of a benchmark execution, the data access pattern is unrevealed. Thus, the data management scheme allocates cache blocks by following the common LRU policy. As more instructions and data are processed, clearer data access pattern can be discovered. Thus, HBWBR starts demonstrating its efficiency by allocating the access-intensive cache blocks close to access ports. The shift numbers in some benchmarks (e.g., x264 and fluid) drop quickly with execution time, indicating that the data access is concentrated within a small portion of caches and hence HBWBR has high efficiency. Those benchmarks without large occupation of cache blocks, such as body and swap, need quite a long time to settle and the observation window expands to 50 percent of the execution time. 
Further Analysis on Cache Access Statistics
The cache access statistics of a benchmark can reflect its execution behavior and performance. We calculate the ratio of read and write numbers (r/w ratio) of all the benchmarks and summarize the results in Fig. 18a . The r/w ratio largely affects the performance of racetrack memory. Since writing to a magnetic domain in general takes longer time than detecting data from it, the benchmarks with more writes have less performance improvement in HDART. For example, swap with the smallest r/w ratio has the least performance enhancement. Relatively, the frequency and latency of read operations have more impact on the system performance. Thus, the HDART design with HBWBR demonstrates higher IPC improvement in the benchmarks with large read operation numbers.
Moreover, Fig. 18b shows statistics of the cache access intensity, which is characterized by the ratio of the cache access number over total instruction number. Here, we use black as the baseline for statistic normalization. We note that the access intensity determines the effectiveness of the proposed data management scheme in HDART: the benchmarks with higher access intensity has more chance to reduce cost on RT shifts. For example, HBWBR can cut off more than 60 percent RT shifts in ferret but only 45 percent in black. Fig. 19 shows the access distribution of all cache blocks of three benchmarks which have different characteristics. The access distributions can reflect how much data block has the potential to be moved into fast way as we discussed before. Benchmarks with strong spatial locality have more chance to hit in the fast way with assistance of HBWBR. swap has lower relative number in fast way than freq. Benchmark like body also has relative lower hit ratio in fast way due to its bad spatial locality.
Impact of RT Geometrical Dimensions
All the above simulations assume a moderate domain size of 4F 2 , which reflects the current device engineering (refer Section 3.2). In this section, we further investigate an aggressive design with a domain size of 1F 2 . All the evaluations are conducted under the same technology node (e.g., F ¼ 45 nm).
Smaller magnetic domains lead to an even more compact LLC with smaller chip size, faster access latency, and less energy consumption (including both dynamic and leakage energies). Moreover, smaller domain dimensions also help lower the energy per shift. On the other hand, for a given layout design, each RW port has to be shared by more magnetic domains, resulting in more RT shifts in data accesses.
We investigate the HDART design under TS2 þ HBWBR configuration with domain sizes of 4F 2 and 1F 2 . The performance and energy comparison can be found in Figs. 16 and 17, respectively. Simulation results show that the design of utilizing 1F 2 domain size obtains only slight improvement in IPC performance (3 percent) but more significant savings in LLC energy (11 percent), compared to the design with 4F 2 magnetic domains. Compared to STT-RAM cache design, the HDART (1F 2 ) þ TS2 þ HBWBR achieves 6:4Â area reduction, 25 percent performance enhancement, and 62 percent energy saving. The benefit over the conventional SRAM is even more significant.
RELATED WORK
Over the last decade, we have witnessed the success in magnetic memory (MRAM) technologies, including the togglemode MRAM and STT-RAM. The nonvolatile MRAM can achieve significant improvement in energy efficiency and hence is taken as the alternative of SRAM for future onchip caches. For instance, Dong et al. [26] modeled the circuit design parameters of STT-RAM and analyzed its potential by comparing with embedded SRAM. Wu et al. [27] evaluated utilization of STT-RAM as last-level cache. To overcome the system performance degradation induced by the slow write of MRAM and STT-RAM, comprehensive efforts across device, circuit design, and architectural exploration have been conducted. Examples include the SRAM/ STT-RAM hybrid cache hierarchy [2] , [28] , the retentionreduced STT-RAM hierarchy [1] , [20] , [29] , and NoC design for STT-RAM [30] .
Recently, racetrack memory as the third-generation magnetic technology is being widely investigated. Racetrack memory uses a spin-coherent electric current to move magnetic domains along a nanoscopic permalloy wire for data storage [8] . Various forms of storage applications based on racetrack memory have been demonstrated, such as the array integration at standard IBM 90 nm technology [12] , the content addressable memory (CAM) design and fabrication [14] , and shift register realized by PMA racetrack technology [17] . These works demonstrate the fabrication feasibility of racetrack memory and trigger the circuit and architecture level design exploration.
Venkatesan et al. [15] firstly pointed the impact of shift operations and analyzed the effect of shift status update in racetrack memory, named as TapeCache. The shift policies that return or do not return the racetrack memory to the original position after every access were proposed and compared. Sun et al. [31] initialized the cross-layer design for racetrack memory. An in-depth exploration on layout and architecture co-optimization was then presented in [19] . The work studies and compares several different physical layout strategies and array organizations. From this evaluation, a workloadoriented racetrack LLC architecture was proposed that combines different array types, each of which is tailored to a specific data access pattern. Further, a resizable cache access strategy was applied to reduce shifting overheads at runtime.
Besides, more racetrack memory design applications have been investigated. An all-spin cache design was developed by using domain wall shift based writes [32] . The devices can also be applied to big-data computing [33] , image processing [34] , and energy efficient recognition and mining [35] . Furthermore, a multi-level magnetic RAM using domain wall shift was proposed, potentially offering even higher storage density [36] .
CONCLUSION
In this paper, a cross-layer design exploration and optimization is performed for the Racetrack memory. We initialize the design exploration with a novel layout approach which enables an all R/W ports memory array structure. A flexible hardware architecture (HDART) is also proposed based on the memory cell and array design. Based on the proposed hardware architecture, a data management scheme is proposed to further improve the efficiency for the RT based LLC. State-of-art memory technologies are selected to compare the RT based HDART. The RT based HDART with data management can achieve 6.4Â area reduction, 25 percent performance enhancement, and 62 percent energy saving, compared to STT-RAM cache design. The improvement obtained from the proposed HDART is much higher than TapeCache [15] . " For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
