Recent trends of increasing core-count and bandwidth/memory wall have motivated researchers to explore novel memory technologies for designing processor components such as cache, register file, shared memory, and so on. Domain-wall memory (DWM), also known as racetrack memory, is a promising emerging technology due to its non-volatility and very high density. However, use of DWM presents challenges due to characteristics of both DWM itself (e.g., requirement of shift operations, variable latency) and processor components. Recently, several techniques have been proposed to address these challenges. This article presents a survey of architectural techniques for using DWM for designing components in both CPU and GPU. We discuss techniques related to performance, energy, and reliability and also discuss works that compare DWM with other memory technologies. We also highlight the opportunities and obstacles in using DWM for designing processor components. This survey is expected to spark further research in this area and be useful for researchers, chip designers, and computer architects. 
INTRODUCTION
As the quest of higher performance and core-count confronts the formidable challenge of plateauing resource budgets, research into high-density and low-leakage memory technologies has become not merely attractive but even imperative. Conventional memory technologies such as static random access memory (SRAM) and dynamic random access memory (DRAM) suffer from high leakage power and requirement of refresh, respectively, and this presents a serious obstacle in their use for next-generation processors. These factors have motivated the researchers to explore emerging memory technologies such as domain-wall memory (DWM), spin transfer torque RAM (STT-RAM) and resistive RAM (ReRAM) .
Among these, DWM [Parkin et al. 2008 ] presents as a promising memory technology due to its attractive properties. DWM is close to SRAM in its read/write latency and has better write performance than STT-RAM [Venkatesan et al. 2016] . Also, it can provide up to 28-32× and 8-12× area advantage compared to SRAM and STT-RAM, respectively [Zhang et al. 2015a; Ranjan et al. 2015] . Use of DWM has been proposed for all levels in memory hierarchy, ranging from scratchpad memory, register file, first-and last-level cache, main memory, and storage (refer to Table II in Section 3) , and, thus, DWM promises to be a truly universal memory technology. In fact, it is predicted that, assuming current scaling trends, DWM can provide terabyte of on-chip memory with petabit/second bandwidth within a power budget of a few watts [Ghosh 2013 ]. Thus, DWM is expected to provide better-than-Moore's law scaling path .
Use of DWM, however, also presents unique challenges. Different from conventional array-style random access memory (RAM), for example, ReRAM or STT-RAM, that provide random access to data, DWM is a tape-style racetrack memory where the access port is shared between multiple cells and, hence, reading/writing requires shifting operations. Shifting operations incur latency/energy overhead and present reliability challenges. For example, the worst-case shift+read latency of a multibitDWM (with 32 bits per tape) can be 26 times that of an iso-capacity SRAM . In fact, shifting operations may consume more than 50% of the energy in DWMbased LLCs 1 . Further, on incomplete shifting, domains may not align with access ports, which leads to read/write failure [Zhang et al. 2015b ]. Clearly, architecture-level techniques are required so DWM may augment and even possibly replace conventional memory technologies in the design of processor components. Several techniques have been recently proposed to address these needs.
Contributions: In this article, we present a survey of techniques for using DWM for designing processor components. Figure 1 presents the overview of this article.
Section 2 discusses the background on DWM design and some recent DWM prototypes. It also compares DWM with other memory technologies and presents the factors involved in design of DWM-based components. Section 3 presents a classification of research projects based on key parameters and also summarizes main ideas of these Fig. 2 . Schematics of (a) 1-bit DWM and (b) multi-bit DWM [Venkatesan et al. 2016] . projects. Sections 4 and 5 review techniques for using DWM to design processor components in CPU and GPU and application-specific processors, respectively. Sections 6 and 7 look at these research projects from perspective of performance and reliability improvement techniques, respectively. Note that the research projects reviewed in these sections are deeply intertwined and although they are summarized in a single category, many of these projects fall under multiple categories. Section 8 concludes this article with a discussion of future challenges.
Scope of the article: For sake of concise presentation, we limit the scope of this article in the following manner. We discuss the use of DWM in CPUs and GPUs but not in field-programmable gate array (FPGA). We discuss microarchitecture-and systemlevel techniques and not circuit/device-level techniques. Different research works use different experimentation platforms, and, hence, we primarily focus on their key ideas and insights and include only selected quantitative results to show scope of improvement. We hope that this survey will be useful for chip designers, system engineers, researchers, and others.
BACKGROUND
We now provide a brief background on DWM and refer the reader to prior work for more details [Venkatesan et al. 2016; Parkin et al. 2008; Zhang et al. 2016] .
DWM Design and Organization
DWM works by controlling domain-wall motion in ferromagnetic nanowires [Parkin et al. 2008] . A DWM cell can store single or multiple bits. We now discuss both 1-bit and multi-bit DWM. 1-bit DWM: As shown in Figure 2 (a), 1-bit DWM cell has a ferromagnetic wire, two access transistors, and an MTJ [Venkatesan et al. 2016] . Depending on whether the magnetic orientation of the free layer is parallel or anti-parallel to that of the fixed layer, the resistance of MTJ is low (value "0") or high (value "1"), respectively. During R/W operations, the direction of current is controlled using T1 (read access transistor) and T2 (write access transistor), respectively. For reading the cell, T1 is turned ON and a suitable read voltage is applied to bit line (BL). The magnitude of current flowing from BL to ground (GND) depends on the MTJ resistance, which determines the data stored in the cell. For writing "0," T2 is turned ON, bitline BL is driven high, and source line (SL) is connected to GND. Thus, the domains get shifted towards the left, which writes 0 in the cell. By reversing the voltage conditions of bitlines, value "1" can be written. Thus, the write operation is realized by shifting the suitable magnetization from the fixed domains to the free domain of the ferromagnetic wire [Venkatesan et al. 2016] . 
Multi-bit DWM:
In ferromagnetic wire, multiple domains can be separated by domain walls and can be separately programmed to a specific magnetization direction for storing a single bit. This allows the design of a multi-bit DWM, and its schematic is illustrated in Figure 2 (b). The multi-bit DWM consists of a ferromagnetic wire that can store multiple bits, shift ports, and a R/W port [Venkatesan et al. 2016] . The R/W port consists of two fixed domains, four access transistors, and one MTJ. Shift ports consist of an access transistor at each end of the wire. Multi-bit DWM allows read, write, and shift operations. R/W operations to the domain at R/W port take place similar to those in 1-bit DWM cell. To shift the bits towards right, BL is connected to VDD and SL to GND. By reversing the voltages of the bitlines, bits can be shifted in opposite direction. Thus, bit-shifting in multi-bit DWM is performed by turning ON the shift access transistors and driving the bitlines to suitable voltage values [Venkatesan et al. 2016] . Figure 3 presents the conceptual view of a multi-bit DWM. From an architecture viewpoint, DWM appears like a tape that can store multiple (up to hundreds) bits with one or more access port(s). Since the bits can be shifted in either direction, access ports can be shared between bits in a tape, leading to much higher density than with SRAM and even STT-RAM.
The effective latency and density of DWM depends on its organization, the number of access ports, and other optimizations. Different DWM designs with cell-size per bit of 1F 2 , 2.5F 2 , 4F 2 , 4.5 F 2 ], 5F 2 [Zhang et al. 2015a ], 6F 2 , 24.7F 2 [Chung et al. 2016; Venkatesan et al. 2013 ], 32F 2 [Zhang et al. 2015a ], 40F 2 ], 48F 2 [Venkatesan et al. 2013] , and so on, have been studied that provide different tradeoff of capacity (e.g., single vs. multiple bit), performance, area, and energy.
Recent DWM Prototypes
In recent years, several DWM prototypes have been demonstrated [Annunziata et al. 2011; Fukami et al. 2009; Fukami et al. 2013] . For example, Fukami et al. [2013] study the write and retention properties of DWM devices with dimensions down to 20nm. They show that write current and write time (which depends on domainwall motion velocity) continue to scale with the size of the device. At 20nm, write time becomes nearly 1ns, which makes DWM a suitable replacement of SRAM. Also, write power consumption at 20nm is only 1.8fJ, which is an order of magnitude smaller than that for . This is due to the smaller resistance of the write-current path of the three-terminal domain-wall motion device [Fukami et al. 2013] . Also, the DWM device shows good thermal stability and small error rates even at small feature sizes Fukami et al. [2013] confirms the superior performance, energy efficiency, and reliability of DWM.
Comparison with Other Memory Technologies
To assess the opportunities and obstacles in use of DWM, it is illuminating to see the properties of DWM vis-à-vis other memory technologies. Table I compares several memory technologies on key parameters (PCM = phase change memory). Note that the actual speed and energy with DWM depends on the number of shift operations. From Table I , it is clear that although DWM offers many advantages over other memories, addressing its shift overhead is vital to fully realize its potential.
In what follows, we first compare DWM with other non-volatile memory (NVM) in terms of their working mechanisms. We then present the comparison with two memories that bear similarity to DWM, viz. STT-RAM and SRAM/DRAM (S/D) memory [Yu et al. 2011] .
Comparison with other NVMs: Different from charge-based memories such as DRAM, NVMs store data as a change in physical state [Mittal 2014c; Mittal and Vetter 2015a] . Since write operations involve changing the physical state, read and write operations to NVMs generally have asymmetric latency and energy. Similar asymmetry may also exist for accessing 0 and 1 values . Similarly to other NVMs, DWM can store multiple bits in a cell, which increases its density significantly. By virtue of their non-volatility property, DWM and other NVMs are useful for realizing high-density memory systems to reduce energy consumption and checkpointing overhead . The key difference between DWM and other NVMs (STT-RAM, ReRAM, and PCM) is that accessing DWM requires shifting operation since many magnetic domains share one read/write port for achieving high density. By comparison, in other NVMs, each storage element has its own access path and, thus, an entire block (e.g., 64B) can be simultaneously accessed.
Comparison with STT-RAM: STT-RAM uses MTJ as the main storage element. MTJ is composed of two ferromagnetic layers: a free layer and a reference layer, and a thin dielectric tunneling barrier. The magnetization of the free layer can be changed using spin-polarized current while that of the reference layer is fixed. The MTJ resistance is the function of relative magnetic orientation of the two ferromagnetic layers. When the spin alignment of two layers is in anti-parallel and parallel directions, the junction resistance is high ("1" state) and low ("0" state), respectively. Both STT-RAM and DWM are spin-based memories. The limitation of STT-RAM is that its R/W paths cannot be separately optimized, which harms its performance and reliability . DWM can provide higher density than STT-RAM by virtue of sharing access ports, although STT-RAM has the advantage of not requiring shift operations. By mitigating shift overhead, DWM can achieve better write performance than STT-RAM. Yu et al. [2011] propose an SRAM-DRAM memory array that integrates N DRAM branches (2N 1T1C DRAM cells) into every SRAM cell (note that S/D memory is not an NVM). In S/D memory, data can be locally copied between SRAM cell and one selected DRAM branch within a cell. The externally accessible active context is stored in a SRAM cell. To access the DRAM branch, its value needs to be copied into SRAM since the DRAM branch is externally inaccessible. Thus, S/D memory hides DRAM latency by only exposing SRAM to the core. In both DWM and S/D memory, multiple contexts exist, of which one is active, which can be accessed with low latency. Shift operations are required to access other contexts and, thus, these memories trade off data accessibility for efficiency. DWM differs from S/D memory in that, for S/D memory, the switching latency/energy between the active and requested context are fixed, whereas, for DWM, they vary based on the distance between the active and requested context.
Comparison with S/D memory:

Factors and Tradeoff in DWM
As we show below, leveraging the full potential of DWM requires accounting for several tradeoffs.
Performance/energy impact of shifts: Accessing a DWM bit requires shifting the tape until the desired bit reaches an access port. Each shift operation consumes some time (e.g., 0.5ns [Sun et al. 2014 ] to 2ns ) and, thus, worst-case access latency of DWM can be very poor. Shift operations also consume energy, for example, read, write, and shift operations may consume 0.084, 0.62, and 0.62nJ in a 4MB DWM cache [Sun et al. 2014] , and, thus, shift energy can be equal to the write energy and 7 times larger than the read energy. Due to this, the performance and energy efficiency of a DWM device depends on the number of shift operations required per access. Further, the optimal data placement problem on DWM for minimizing the number of shift operations is an NP-complete problem and, hence, only heuristic-based solutions may be feasible to keep their computation time manageable.
Variable latency due to shifts: Shift operations not only increase the latency but also lead to variable latency, which presents many challenges. For example, it complicates scheduling of dependent instructions and makes use of DWM difficult for realtime applications that demand timing predictability for providing quality-of-service guarantees. Thus, DWM may be more suited for components where non-uniform latency is acceptable such as NUCA (non-uniform cache access) caches. Since first-level cache generally shows higher temporal locality than last-(or medium-) level cache [Mittal 2014b ], shift overhead is lower in first-level cache than in LLC. However, for CPUs and GPUs, first-level cache (L1) is optimized for latency, and, hence, it may be designed with 1-bit DWM [Venkatesan et al. , 2016 ] to avoid the shift overhead completely.
Choice of number and type of access ports: An access port is used to read and write the bits aligned with the port. Although using larger port count reduces the shift distance, it may increase the latency of peripheral circuitry and shift operations Mao et al. 2014] . In fact, since access transistor dominates the area of a DWM tape [Venkatesan et al. 2016; Xu et al. 2016] , a large port count increases area and power consumption. Similarly, there are tradeoffs in using a read-only port or a read/write port [Sun et al. 2014] . These factors call for a careful choice of number and type of access ports.
Choice of DWM parameters: Several higher-level objectives such as lowering read/write latency/energy or leakage may dictate the choice of optimal DWM parameters, such as transistor size, partitioning of DWM array into sub-arrays, peripheral circuitry, and so on [Poremba et al. 2015; . For example, increasing the write current reduces write latency at the cost of increased power consumption and vice versa . Similarly, the characteristics of processor component (e.g., register file vs. cache, L1 vs. L2 cache, tag vs. data array) where DWM is used has crucial impact on the design and choice of optimal DWM bitcell (refer to Sections 4 and 5). These design considerations may force designers to use different DWM variants (e.g., 1bitDWM and multibitDWM) and different memory technologies for designing different process components, cache levels, or cache portions (tag and data), for example, some researchers use STT-RAM or SRAM for designing tag array (refer Table III) and DWM for designing data array of the cache. This, however, increases the design complexity significantly.
Dependence on application characteristics: The shift overhead of DWM is typically mitigated by leveraging temporal locality of applications (refer Section 3). Some memory structures inherently perform serial access (e.g., FIFO and LIFO [Chung et al. 2016] ) and, hence, they are well suited for DWM. However, for applications with little locality (e.g., random memory access), DWM incurs large shift overhead [Park et al. 2014] and the benefit of these techniques reduces. Thus, a co-design approach may be necessary to match the characteristics of DWM with the requirements of the applications.
Requirement of overflow bits: Shift operations necessitate additional overflow bits outside the shift ports to avoid data loss [Venkatesan et al. 2013 [Venkatesan et al. , 2016 Zhang et al. 2015b; . The number of these bits may be as large as the number of actual (useful) data bits . Nevertheless, overflow bits may not lead to area overhead since complementary metal-oxide semiconductor (CMOS) access transistors dominate the cell area. Also, overflow bits can be reduced by using larger port-count and intelligent policies for deciding port from which an access is performed [Venkatesan et al. 2013] .
Reliability Challenges: DWM also suffers from reliability issues. For example, process variation, which refers to deviation in device parameters from their nominal value [Mittal 2016b ], can lead to large variation in read and write latency of DWM. This can degrade the performance of a DWM cache significantly (e.g., up to 13% ). Similarly, Zhang et al. [2015b] note that on incomplete shifting, the domains in DWM may not align with access ports, which leads to failure in reading and writing correct data. This is referred to as "position error." Also, if the misaligned bit has the same value as the stored bit, conventional ECC cannot detect it timely. Even if ECC detects the misaligned bit as incorrect, it cannot determine the direction and steps of shift errors. Further, correcting the error would require accessing the stored bits, which itself requires shift operations, and these operations may further introduce position errors. Thus, due to the unique shift-based functioning of DWM, conventional reliability techniques such as ECC or duplication [Mittal and Vetter 2016] may not be effective for DWM. In fact, due to the position errors, the mean-time-to-failure (MTTF) for DWM may be as low as few microseconds, compared to the goal of 10 years [Zhang et al. 2015b] .
The techniques presented in next sections seek to address these challenges.
CLASSIFICATION AND OVERVIEW
A Classification Based on Key Characteristics
Table II organizes the works based on their optimization objective and the processor component where DWM is used. It is clear that DWM has been used at all levels of memory hierarchy. Table III shows the works which compare DWM with other memory technologies. Also, since drop-in replacement of DWM is unlikely to be efficient for all processor Optimization objective Performance Atoofian and Saghir 2015; Mao et al. 2014; Zhang et al. 2015a Venkatesan et al. 2013; Atoofian 2015; Moeng et al. 2015; Ranjan et al. 2015; Wang et al. 2015; Gu et al. 2015; Liang and Wang 2016; Mao et al. 2015; Park et al. 2014; Xu et al. 2016; Zhang et al. 2016 ] Energy Mao et al. 2014; Zhang et al. 2015a Venkatesan et al. 2013; Atoofian 2015; Moeng et al. 2015; Ranjan et al. 2015; Wang et al. 2015; Chung et al. 2016; Park et al. 2014; Xu et al. 2016; Zhang et al. 2016 ] Reliability Zhang et al. 2015b] Processor component where DWM is used First-level cache [Venkatesan et al. 2013 Moeng et al. 2015; Xu et al. 2016 ] Last-level cache Atoofian and Saghir 2015; Venkatesan et al. 2013; Xu et al. , 2016 Zhang et al. 2016 ] Scratchpad memory [Mao et al. 2015; Moeng et al. 2015; ] CAM [Zhang et al. 2012 ] Main memory [Wang et al. 2015; ] Storage [Park et al. 2014 ] GPU cache Atoofian and Saghir 2015; ] GPU RF [Mao et al. 2014; Atoofian 2015; Moeng et al. 2015; Liang and Wang 2016] Comparison with SRAM Atoofian and Saghir 2015; Zhang et al. 2015a; Venkatesan et al. 2013; Atoofian 2015; Sun et al. 2014; Ranjan et al. 2015; Zhang et al. 2015b; Chung et al. 2016; Venkatesan et al. 2016; Zhang et al. 2012 Zhang et al. , 2016 ] STT-RAM Zhang et al. 2015a; Venkatesan et al. 2013; Sun et al. 2014; Ranjan et al. 2015; Zhang et al. 2015b; Chung et al. 2016; Venkatesan et al. 2016 ] ReRAM ] eDRAM [Zhang et al. 2015a ] PCM [Zhang et al. 2015a ] DRAM [Wang et al. 2015 ] NAND Flash [Park et al. 2014; Zhang et al. 2015a ] S/D memory [Moeng et al. 2015] Tags in DWM cache designed with 1bitDWM [Venkatesan et al. , 2016 Ranjan et al. 2015; Venkatesan et al. 2013 ] SRAM ] STT-RAM Pre-shifting [Liang and Wang 2016; Venkatesan et al. 2013; Atoofian and Saghir 2015; Venkatesan et al. , 2016 Moeng et al. 2015 ] Tape head management Venkatesan et al. 2016 ] Data placement [Mao et al. 2015; Liang and Wang 2016 ] Use of data compression ] Reducing shifting granularity Moeng et al. 2015] Use of bit-interleaving [Venkatesan et al. , 2016 Wang et al. 2015; Ranjan et al. 2015; Mao et al. 2014; Atoofian 2015; ] Write-back buffers to coalesce the accesses [Mao et al. 2014; Moeng et al. 2015] Data comparison write ] Activating fewer tracks on each access Access frequency-based management Venkatesan et al. 2016; Different optimizations for reads/writes [Venkatesan et al. 2013; Mao et al. 2014] Different optimizations for instruction/data blocks [Sun et al. 2014] Cache reconfiguration Set-based [Ranjan et al. 2015] and way-based [Sun et al. 2014 ] Reducing shifting distance by GPU warp scheduling [Mao et al. 2014] Shift and write current boosting Addressing process variation Optimization heuristics Genetic algorithm [Mao et al. 2015] , integer linear programming , integer non-linear programming , greedy strategy and graph algorithms [Liang and Wang 2016 ] Use of DWM in domainspecific computing Gu et al. 2015; Chung et al. 2016; Park et al. 2014] components, some works suggest using DWM for only certain components/portions. For example, since tag array contributes only around 5% of the cache area, designing it with multibitDWM does not provide large area advantage but only incurs latency penalty due to shifting operations in multibitDWM. Since access to tag array happens on critical access path, SRAM or 1bitDWM are better suited for designing tag array. STT-RAM is also suitable for tag array due to its compatibility with DWM and property of allowing random access. These works are also highlighted in Table III . Similarly, in a domain-specific recognition-mining processor, the first-level FIFO memory can be designed with DWM, whereas second-level data memory can be designed with STT-RAM . Venkatesan et al. [2016] design few ways of L2 data array with 1bitDWM and remaining ways using multibitDWM.
Main Ideas of DWM Management Techniques
Table IV highlights many key ideas that are common to several DWM management techniques. We now discuss them briefly.
1. Tape head management and pre-shifting: Shift latency can be hidden by performing pre-shifting [Liang and Wang 2016; Venkatesan et al. 2013; Atoofian and Saghir 2015; Venkatesan et al. , 2016 . For example, in GPU RF, only few banks are accessed at a time and remaining banks are used at a future time instance. Based on this, DWM shifting latency can be hidden [Liang and Wang 2016] . Similarly, shifting latency can be overlapped with operand collector queuing delay by using pre-shifting in GPU RF [Moeng et al. 2015] . Also, address predictors can be used to predict addresses that will be accessed in the near future [Atoofian and Saghir 2015; .
After an access, the tape head can be moved to the original location to reduce metadata overhead Venkatesan et al. 2016] . Alternatively, the tape head can be left at the same location since due to data locality, the next access is likely to be to the same or nearby block Venkatesan et al. 2016] .
2. Data placement and migration: Intelligent data placement can be performed to group data items that are frequently accessed together [Mao et al. 2015; Liang and Wang 2016] . In a merged two-level (L1/L2) cache, blocks are migrated to effectively transfer a block from L1 to L2 and vice versa [Xu et al. 2016] .
3. Reducing shifting operations: By using data compression, the number of bits stored in and accessed from each block can be reduced . Also, shifting granularity can be reduced to lower the shift overhead Moeng et al. 2015] .
Bit-interleaving:
In comparison to mapping all the bits of a block to a single tape, interleaving them across multiple tapes allows parallel access to those bits but causes contention on concurrent accesses to different cache lines in the same DWM tape . Also, storing most frequently accessed bits close to ports reduces access latency to them. Several techniques use these ideas to reduce shift overhead [Venkatesan et al. , 2016 Wang et al. 2015; Ranjan et al. 2015; Mao et al. 2014; Atoofian 2015; .
5. Reducing write overhead: To reduce writes to DWM, some researchers use write-back buffers to coalesce the accesses to DWM [Mao et al. 2014; Moeng et al. 2015] , whereas others perform data comparison write . In cache writes, generally least significant bits (LSBs) change more frequently than most significant bits (MSBs) and based on this, LSBs can be mapped to lower energy states of multi-bit DWM to save energy . DWM access energy can also be decreased by reducing the number of tracks activated on each access .
6. Access frequency-based management: It is well known that the access frequency of different cache blocks Venkatesan et al. 2016; differs. Based on this, frequently accessed blocks can be migrated closer to access port , from multibitDWM to 1bitDWM [Venkatesan et al. 2016] and from slower DWM ways to faster DWM ways . This is illustrated in Figure 4 .
Criticality and access-pattern-based management:
Since reads are more performance critical than writes, some researchers propose different optimizations for them [Venkatesan et al. 2013; Mao et al. 2014] . Similarly, instruction and data blocks have different R/W characteristics and, based on this, cache can be partitioned, and different types of ports can be used in different cache partitions to reduce shift overhead for them [Sun et al. 2014] .
8. Cache reconfiguration: Some researchers use reconfiguration in DWM caches to reduce access latency/energy of active data bits [Ranjan et al. 2015; Sun et al. 2014] . Note that DWM has near-zero leakage power and, hence, unlike SRAM caches, where reconfiguration is used primarily to reduce leakage energy by power-gating unused blocks [Mittal et al. 2013; Mittal 2014b] , DWM cache reconfiguration is performed for reducing access latency and dynamic energy. Also, while Ranjan et al. [2015] perform set-based cache reconfiguration, Sun et al. [2014] perform way-based cache reconfiguration.
9. GPU warp scheduling scheme: In GPUs, warp scheduling can be altered to preferentially issue a warp that generates RF requests with the smallest distance to the present position of access ports [Mao et al. 2014] .
10. Shift and write current boosting: Shift and write current can be changed to trade off shift/write latency with their power consumption .
11. Improving reliability: Decoupling read/write paths allows separately optimizing them for higher reliability and performance . Since process variation is a crucial reliability challenge [Mittal 2016b ], techniques to address it have also been proposed .
12. Search heuristics: Many works use heuristics/algorithms for performing DWM management such as reducing shifting overhead. These heuristics include genetic algorithm [Mao et al. 2015] , integer linear programming , integer nonlinear programming , greedy strategy , and graph algorithms [Liang and Wang 2016] .
13. Application domain: Some works use DWM for specific domains or applicationspecific processors, for example, recognition and mining processor , application-specific embedded systems , digital signal processor [Chung et al. 2016] , non-volatile sensor node , and graph processing [Park et al. 2014] .
14. Others: Some researchers use domain-wall nanowires for both memory and logic/computation [Wang et al. 2015] .
USING DWM IN CPUS
We now discuss use of DWM in CPU components. Note that many of these techniques may also be applicable to GPU components. Venkatesan et al. [2016] present a cache design that uses management schemes at the device, circuit, and architecture levels to address shift overhead of DWM. At the device level, they use domain-wall-based shift to perform write operations [Fukami et al. 2009 ], which consumes much lower time and current compared to conventional MTJ-based write approach where domain wall shifts are used only for aligning desired bit with the access port. At the circuit level, they propose two DWM bitcell designs, viz., 1bitDWM and multibitDWM, which have cell area of 40F 2 and 18F 2 per bit, respectively (refer Section 2.1 for a background). MultibitDWM stores multiple bits in a single cell and requires shift operations. 1bitDWM has low latency since it does not require shift operations and, hence, they use it for designing both tag and data arrays in latency-optimized L1 cache. For L2 cache, tag array is designed using 1bitDWM to quickly make a hit/miss decision. Also, few ways of L2 data array are designed with 1bitDWM (called fast ways) and remaining ways are designed using multibitDWM (called dense ways). Storing hot cache blocks in fast ways allows reducing access latency to them, and using dense ways allows increasing the cache density. On two consecutive accesses to a dense way, its data are swapped with the least-recently used (LRU) among fast ways. This data migration policy accounts for changing working set size and primarily uses fast ways for storing hot blocks.
DWM-Based Processor Cache
They also explore the impact of bit-mapping policy. A naive scheme that maps all bits of a block to a single tape leads to very high shift latency. Instead, they map K 64-bit blocks in 64 DWM tapes, each with K bits of data. Thus, all bits of a block can be accessed in parallel. This scheme requires between 0 and K shift operations, whereas naive scheme always requires K shift operations. A limitation of their bit-interleaving scheme is that it requires shift operations in multiple tapes for each access; however, this does not lead to energy penalty due to efficiency of shift operations.
They further propose a static and a dynamic policy for deciding which tape heads on a cell are utilized for accessing a specific bit in the multi-port multibitDWM. In static policy, a tape head is statically assigned to a cache block based on its initial location, thus, in this policy, the cache block address alone decides the tape head. In dynamic policy, the head nearest to the requested block at runtime is chosen to access the bit. For deciding tape head position after cache access, they propose three policies, viz., eager, lazy, and pre-shifting. In eager policy, a tape head is moved to its default location after each access, whereas in lazy policy, it is not moved. In pre-shifting policy, the next block access is predicted, and the tape head is aligned with that block. The eager policy leads to simple shift control logic and allows static assignment of optimal tape head of each block. Lazy policy leverages spatial locality, which indicates that the current head location is likely to be close to the next block to be accessed. Their experiments show that without harming performance, their cache design achieves large area and energy reduction compared to both SRAM and STT-RAM caches. Venkatesan et al. [2013] also propose another L2 cache design where tag array is designed with 1bitDWM to allow quick hit/miss determination, and data array is designed with multibitDWM to achieve high density. To hide shift latency in MultiBitDWM, they use a pre-shifting approach. Initially, based on the location of a cache block, a R/W port is statically assigned to it. On a cache access, next block expected to be accessed is predicted and is pre-shifted to its statically assigned port. On an accurate prediction, shift operation is avoided. On an inaccurate prediction, a read operation is performed through the nearest read port since read operations are critical to performance. By comparison, a write operation is performed through the statically assigned write port to reduce the number of extra bits required for avoiding data loss during shifting (since write operations will require a higher number of shifts due to the presence of readonly ports). Their technique improves performance and energy efficiency compared to SRAM and STT-RAM caches.
Zhang et al.
[2015] present a DWM track organization for reducing DWM access energy. In the current DWM designs, multiple heads of a single track share the same bitline and sourceline, and, hence, only one head can be active in any track at a time. They propose vertically connecting multiple heads of one track to different bitlines while the sourceline is still shared between them. This alleviates contention on shared bitline and allows concurrent access to multiple bits through different heads. Also, they store Q bits of a cache line in each track and interleave a P-bit cache block in P/Q tracks and not P tracks. Thus, only P/Q tracks (and not P tracks) need to be activated for accessing a P-bit cache line from Q-heads. They show that their design reduces the number of shift operations required and improves energy efficiency of DWM cache. Mao et al. [2015] present three data allocation strategies for reducing shift overhead in DWM-based scratchpad memory. The first strategy, named "first come first store," stores data in the order they appear in the trace. However, this strategy is inefficient for loop accesses. The second strategy, named "most access in middle," places data in decreasing order of frequency from the middle of DWM to both ends. The third strategy, named "most access first," places data in decreasing order of frequency from one end to another. A common limitation of these three strategies is that once the working set changes, the allocated pattern no longer remains optimal. Hence, they propose using a genetic algorithm to perform data allocation in DWM. They encode different data positions as different genes. The fitness function of these genes is that there is a shift overhead for them and, based on it, the best genes are selected. They also apply crossover and mutation to generate new genes. After a sufficient number of iterations, the algorithm terminates with the best data allocation pattern (gene). They also use program access pattern to guide crossover operation to improve solution quality and reduce execution time. Their evaluations show that the first strategy has lower shift overhead than the last two, which have nearly equal overhead. Also, the shift overhead in the genetic algorithm is nearly half of that of the last two strategies and is close to that of an optimal solution found by exhaustive search.
DWM-Based Scratchpad Memory
DWM-Based Storage
Park et al. [2014] present an approach for accelerating graph processing by using DWM and pointer-assisted graph representation. DWM provides byte addressability and has low read/write latency, which boosts the performance of graph algorithms with a large amount of random traffic. Also, high capacity is achieved by virtue of high density, and high bandwidth can be achieved by accessing multiple racetracks in parallel. Byte addressability and low latency of DWM allow modeling the graph using pointers and, using this, only the desired data can be accessed from the storage without requiring data rearrangement, sorting, and search steps, as required in conventional graph processing. They show that, compared to a graph processing approach based on NAND Flash-based solid state drive (SSD), their technique improves performance due to low-latency DWM access. Both SSD and DWM consume similar energy, since shift operations account for a large portion of DWM energy in algorithms with random accesses. However, due to lower execution time with DWM, total energy (including CPU, system, and SSD or DWM storage) is smaller on using DWM storage. With DWM storage, they also use a controller that has an access controller, input/output (I/O) controller, and in-storage computing module. For simple kernels, such as pagerank, computation can be performed by the in-storage computing module itself. They show that utilizing this in-storage capability provides additional energy and performance gains.
USING DWM IN GPUS AND APPLICATION-SPECIFIC PROCESSORS
In recent years, the extreme performance demands placed on GPUs have shaped their design to be optimized for higher performance. This, however, has come at the cost of a sharp increase in their power consumption [Mittal and Vetter 2015b] . To tackle this issue, several researchers have proposed the use of DWM for designing various processor components in GPUs, such as GPU cache, register file, and shared memory. We now review some of these techniques.
Using DWM in GPU Cache
As the scope of GPUs expands from graphics applications to general-purpose applications, the size of LLC on GPUs is on rise [Mittal 2014a ]. For example, the GF100 GPU (Fermi architecture, compute capability 2.0) had 768KB LLC, whereas GK210 GPU (Kepler architecture, compute capability 3.7) has 1536KB LLC [NVIDIA 2014] . Clearly, DWM is suitable for designing GPU caches by virtue of its non-volatility and high density. use DWM for designing entire cache hierarchy in GPU. They note that storing larger numbers of bits in each DWM tape increases density at the cost of increased latency. Hence, they use 1bitDWM for designing L1 data cache, instruction cache, shared memory, texture, and constant cache. L2 cache is designed using multibitDWM (e.g., 32 bit) data array and 1bitDWM tag array to reduce tag lookup latency. To reduce shift latency in data array, they propose architectural management policies. First, the bits of every cache block are mapped to a group of tapes in bit-interleaved manner, and, thus, all of them can be accessed in parallel. They note that eager (also called "restored head") and lazy (also called "leave-in-place") head management policies [Venkatesan et al. 2016] do not work well for GPUs due to interleaving of cache requests coming from multiple warps [Mittal 2014a ]. Instead, they use a stride-based address predictor with a confidence mechanism that leverages intra-warp locality to predict next block access and uses this to pre-shift the track head. However, since different warps run on streaming multiprocessor (SM) in an interleaved manner, their cache requests are also interleaved. This interferes with the prediction mechanism and leads to large shift penalties in several cases. To address this issue, they use a fully associative buffer that is designed with 1bitDWM and shared by all SMs. L2 cache blocks that are accessed with large latency and are expected to see reuse, are selectively promoted to this buffer to allow accessing them with low latency. They show that their techniques provide higher performance and energy saving than using SRAM, STT-RAM, or drop-in replacement of DWM. Atoofian and Saghir [2015] propose use of address predictors for hiding shift latency overhead in DWM L2 cache in GPU. They use three address predictors, a stride predictor (which detects difference between current and previous addresses), a context predictor (which detects patterns in the address stream), and a hybrid predictor (which uses context predictor if its confidence counter is high and otherwise uses stride predictor). When a warp becomes ready for execution, corresponding PC and warp index (ID) are sent to the predictor. Based on it, the predictor estimates the memory cell to be shifted in L2 cache and, while the instruction goes through pipeline stages, the track head is shifted. They show that by virtue of combining the best of stride and context predictors, the hybrid predictor achieves highest performance and brings the performance of DWM cache very close to that of SRAM cache.
Using DWM in GPU Register File
The size of GPU RF is much larger than that of CPU RF [Mittal 2016a ]. For example, NVIDIA's GK210 GPU has a total RF size of 7680KB [NVIDIA 2014] . By comparison, Intel's 32nm Itanium 9560 processor has 22KB integer RF and 20KB floating point RF [Intel 2012 ]. Thus, the high capacity requirement of GPU RF makes high-density memory technologies such as DWM attractive options for designing GPU RF [Mittal 2016a ]. We now discuss techniques for using DWM in GPU RF. These techniques seek to reduce shifting overhead by using pre-shifting [Atoofian 2015; Moeng et al. 2015; Liang and Wang 2016] , intelligent register mapping [Mao et al. 2014; Liang and Wang 2016] , warp scheduling [Mao et al. 2014] , regulating TLP [Liang and Wang 2016] , using a write buffer [Mao et al. 2014; Moeng et al. 2015] and using a suitable shifting granularity [Moeng et al. 2015] . Mao et al. [2014] present three techniques to mitigate shift overhead in DWM-based GPU RFs. While the conventional mapping scheme may map the registers of every warp across the RF banks in a consecutive manner, their register remapping technique maps the registers around the access ports of DWM banks to reduce the shift distance. Their warp scheduling technique preferentially issues a warp that generates RF requests with smallest distance to the present position of access ports. Since write requests interfere in the scheduling technique, they also use a write buffer that stores the write requests. These requests are issued to RF only when the corresponding register is aligned with the access ports. The buffer may occasionally overflow and, in such a case, writes are issued to RF regardless of the alignment. Use of write-buffer helps in removing the impact of write operations in scheduling and also reducing writes to RF. With this support, only read requests need to be accounted for when computing the distance to access ports. They show that their technique boosts energy efficiency and performance.
Atoofian [2015] present pre-shifting schemes to hide shifting delay in DWM-based GPU RF. To allow concurrent access to all bits of the register, he uses bit-interleaving design where P 4B registers are stored in 32 P-bit tracks instead of storing 4B register in a single 32-bit track. For predicting future register access from past behavior, it is observed that, except in the case of branch divergence, the register ID of source and destination operands accessed by instructions in a GPU program is identical since the threads in a kernel execute same set of instructions. To pre-shift the track head, three schemes are used, each of which leverage data locality at an intra-warp level (i.e., between threads in a warp), intra-SM level (i.e., between warps in an SM), and inter-SM level (i.e., between warps in different SMs). The length of RF tracks is suitably selected to completely hide the shift delay. The experiments reveal that intra-SM, inter-SM, and intra-warp schemes provide performance in decreasing order. Also, compared to SRAM-based RF, intra-SM scheme saves large amounts of energy. Moeng et al. [2015] present techniques for designing GPU RF using memory technologies with non-uniform access behavior, viz. DWM and S/D memory. P registers corresponding to P register-contexts are stored in each memory element. They form a multicontext group (MCG) since concurrent access to them cannot take place. When a register-context is not at the track head, corresponding instruction cannot be committed. To avoid such issues, they employ a write buffer and allow preemption of other commands by write command. In case of buffer overflow, at least one write needs to finish before further progress can be made. This mechanism avoids write-related hazards. To reduce shifting overhead, they consider different shifting granularities: (1) the whole RF shifts synchronously, (2) each register in a bank shifts simultaneously but different banks shift independently, and (3) only the MCG corresponding to a selected register is shifted. The third policy incurs lowest shifting overhead and does not require preempting reads to other MCGs on a write access. It also exploits temporal locality of register accesses since, in this policy, an MCG goes out of existing context only if one of the other registers in the bank is accessed before next access to it. They show that the third policy outperforms the second policy. Also, using their technique, the performance with DWM is better than that with S/D memory and is close to that with SRAM. Moeng et al. [2015] also propose a technique to hide shift delay in DWM RF. They note that instructions stay in operand collector until RF reads are completed. Their technique issues a pre-shift command to the MCG with the requested register when an instruction enters operand collector. If no access is currently being served, then MCG starts pre-shifting to the requested register-context, which hides the shifting latency. They also study exploiting high density of DWM to double the capacity of RF, shared memory, and L1 cache. This improves the performance compared to the SRAM design even further. Liang and Wang [2016] propose three techniques for improving performance of DWMbased GPU RF that work at the application, compiler, and architecture levels, respectively. They note that high TLP enabled by DWM-based RF causes cache contention, and, hence, high TLP does not translate into high performance. In fact, high TLP leads to more thread blocks in GPU, which may cause large shift distances, resulting in higher access latencies and lower performance. Hence, their application-level technique regulates TLP based on L1 data cache contention and shift overhead. Cache contention is modeled by profiling application IPC under different TLP values, assuming no shift-overhead. Also, the shift overhead is modeled as a reduction in IPC on going from an ideal DWM (i.e., no shift overhead and no cache misses/contention) to a realistic DWM. Based on these, application performance is modeled as IPC under cache contention minus IPC reduction due to shift overhead.
Their compiler-level technique analyzes register access trace and aims to map registers that are used frequently together in RF for minimizing shift overhead. Different banks are modeled independently since their access traces are disjoint. This technique works in two steps. In the first step, registers are partitioned into groups where size of each group is equal to the number of ports. In the second step, the offset of each register is computed to optimize register arrangement within a bank. Their architecture-level technique seeks to reduce shift overhead by preshifting mechanism. They note that in GPUs, only a few RF banks can be accessed at a time. These banks are referred to as busy banks, and remaining banks are referred to as idle banks. The requests to be served are stored in a queue, and, by seeing the top request in this queue, the next register to be accessed can be accurately ascertained. Based on this, shifting can be initiated in idle banks, which allows hiding their shifting latency. They show that their techniques improve GPU performance significantly.
Using DWM in Application-Specific Processors
Different memory structures in application-specific processors show different access properties and capacity requirements. We now discuss some works that study these properties to identify the most suitable memory technologies for designing those memory structures. evaluate use of DWM and STT-RAM in domain-specific recognition and mining (RM) processors. In the RM processor, FIFOs constitute the first memory level (64KB size), where streaming read operations happen more frequently than write operations. Hence, DWM is used for designing this memory since DWM incurs higher latency on random accesses but is efficient for serial accesses. The secondlevel memory (2MB size) sees many random accesses and the number of read operations is significantly larger than the number of write operations. Hence, STT-RAM is used for designing this memory, which reduces leakage power consumption and does not suffer from write overhead due to infrequent write operations. Thus, they match the memory characteristics with architectural requirements of the processor. The area saved due to use of DWM/STT-RAM is utilized for increasing the number of processing elements and the size of FIFO. They show that such an iso-area redesign improves both performance and energy efficiency compared to baseline CMOS-only design and a drop-in replacement of DWM/STT-RAM. Chung et al. [2016] note that many DSP programs are characterized by sequential memory access, and, hence, flip-flop-based shift registers and SRAM-based embedded memory consume a large fraction of power and area overheads for these programs. They propose DWM-based embedded memories for designing DSP components, such as FIFO RFs in the FFT processor and bitonic sorter, an input register of a distributed arithmetic-based finite impulse response (FIR) filter, and survivor-path memories and LIFO in the Viterbi decoder. Since these applications only perform sequential memory access, they use only few read/write heads to reduce DWM area. This is the primary contributor to DWM efficiency for DSP programs compared to SRAM and STT-RAM, which allow random access. Also, to reduce DWM write energy, they provision low-power shift-based write and hide its additional latency using prudent access scheduling and memory architecture. They note that memories with single-input single-output and single-input parallel-output characteristics are better suited for DWM than parallel-input single-output since the latter would require many write heads. They show that DWM-based design consumes much lower area and power compared to SRAM-and STT-RAM-based designs.
TECHNIQUES FOR IMPROVING PERFORMANCE AND ENERGY EFFICIENCY
We now discuss many techniques for improving the efficiency of DWM-based components.
Using Cache Reconfiguration
Sun et al. [2014] note that organization of the access port has a large effect on the shift operations required in each access. Since a R/W port needs to provide sufficient write current, its size must be large. Hence, for area efficiency, this port needs to be shared between several magnetic domains, which leads to large shift overhead. By comparison, a read-only port is smaller in size, and, hence, many R-ports can be used in a racetrack, which reduces shift overhead for each read access. Using these ports, they design hybrid-port and uniform-port arrays, where the former have both R ports and R/W ports and the latter have only R/W ports. They note that R/W characteristics of different LLC blocks differ considerably. Instruction blocks see a large number of read accesses, whereas data blocks see a nearly balanced number of R/W accesses. Hence, they partition the LLC (4MB) into an instruction region (0.5MB) designed with hybrid-port arrays and a data region (3.5MB) designed with uniform-port arrays.
They further note that different applications have different cache requirements. They organize the bits for each nanowire into different ways of a cache set, and for applications with low cache demand, they use only selected cache ways that are close to access ports. This way-based cache reconfiguration is illustrated in Figure 5 , and it helps in reducing the shift overhead. This reconfiguration approach is used only in the data region to reduce its shift overhead since this overhead in the instruction region is already small. The associativity of each set can be independently controlled. The decision for reconfiguration is taken dynamically such that cache capacity is reduced when the miss rate falls below a threshold and vice versa. They show that their techniques improve performance and save energy by virtue of reducing shift operations.
29:18
S. Mittal Fig. 6 . Set-based cache reconfiguration for reducing DWM shift overhead [Ranjan et al. 2015] . Ranjan et al. [2015] note that DWM facilitates trading off density with access speed by controlling the number of domains in every tape that are actually used for data storage. They present a set-based cache reconfiguration technique that initially packs the maximum number of bits per tape and then dynamically adapts DWM cache size and shift latency depending on the cache access pattern of the application. This setbased cache reconfiguration is illustrated in Figure 6 . The decision to perform cache reconfiguration is taken based on four factors, viz., miss latency, shift latency, miss rate, and number of shift operations. For example, increasing cache size generally increases shift penalty but reduces miss penalty, and, thus, reconfiguration is done based on the tradeoff between these factors. After a reconfiguration, the blocks whose mapping has changed are migrated to their new location over a migration time window. At the end of the migration time window, blocks that are still residing in old and incorrect locations are forcibly migrated to the correct new location (on size increase) or are flushed to main memory if dirty (on size reduction). Thus, they use a lazy approach instead of flushing/migrating all incorrectly mapped blocks at the time of reconfiguration. When the effective cache capacity is smaller than the maximum capacity, their technique uses the idle space as a victim cache to reduce the accesses to main memory. They show that their technique improves performance compared to an iso-area SRAM and a DWM cache without reconfiguration. present a technique to reduce data access and shifting costs in DWM cache. To reduce the number of bits stored in each block, they use lightweight compression, which compresses data with all-zero, repeated bytes and low dynamic range. Compression reduces the number of shift operations required, and, instead of shifting at cache line (512 bit) granularity, they propose shifting at smaller (16-bit) granularity. Hence, accessing a compressed block of, say, 4B (32-bit) size requires only 32 shifts and not 512 shifts. With bit-interleaved organization [Venkatesan et al. 2016] , accesses to different cache lines in the same DWM sub-array lead to contention. Their technique does not use the space saved by compression for packing more cache lines in a block, and this allows their technique to use a skewed alignment where the cache lines need not be aligned with the beginning of the storage cell. Thus, a non-overlapping cell alignment can be used to reduce sub-array contention. They show that, at iso-capacity, their technique provides better energy efficiency and performance than both SRAM and STT-RAM caches. propose a DWM-based cache design and management policies. Due to shift overhead of the DWM and small area contribution of tag array, it is designed with STT-RAM, which allows random access. They also propose mapping the same bits from all the ways of a set into one array, and, thus, accessing data of ways located right at R/W ports does not require shifting. They further propose a data management scheme that identifies the frequently accessed cache blocks and places them into physical locations at the R/W port. For most applications, only a few blocks are frequently accessed, and, hence, this scheme reduces shifting overhead. They also compare 1F 2 and 4F 2 cell size designs and show that the former achieves better performance and energy efficiency due to small access latency/energy.
Using Cache Compression
Using Data Migration
Heterogeneous Cell Cache Design
Motaman et al. [2015] note that domain-wall motion depends on the shift current, such that higher current increases the velocity while incurring higher power consumption. A similar tradeoff also exists among write current, latency, and power. They propose a DWM cache design and its management technique, which exploit these tradeoffs to balance performance and energy optimization. Their DWM L2 cache uses three types of shift and write operations. Fast, medium, and slow shift operations use shift currents of 25, 19, and 15 μA, respectively, have 1, 1.5, and 2ns latencies, respectively, and consume 16, 8, and 4mW, respectively. Similarly, fast, medium, and slow writes use write current of 70, 50, and 40 μA, respectively, and have latencies of 6.2, 4.63, and 4ns, respectively. They record L2 accesses in a given time period (e.g., 200K cycles) and use two thresholds (T 1 and T 2) to classify the accesses (R) in three intervals, that is, R < T 1, T 1 < R < T 2, and R > T 2. In these intervals, write and shift currents are set to low, medium, and high, respectively. They show that their technique provides higher performance and energy efficiency compared to SRAM, STT-RAM, and DWM without any optimization. note that MTJ-based R/W in spintronic memories leads to larger write energy/latency than CMOS memories. Also, large write current necessitates large access transistors, which threatens reliability and reduces density advantage of STT-RAM. Further, the tradeoff between requirements of read and write operations impose strict constraints for STT-RAM design, which leads to access failure under process variation. Compared to MTJ-based write, domain-wall motionbased write offers better scalability and lower write latency/energy [Fukami et al. 2009] . They evaluate such DWM with shift-based write (termed as DWM-SW) bitcell and show that this bitcell is comparable to SRAM in write efficiency and maintains density and low leakage advantages of STT-RAM. Also, DWM-SW uses decoupled R/W path that performs MTJ-based reads and shift-based writes. This provides scope for separately optimizing reads and writes to improve density and R/W stability and to use lower write current due to shift-based write. Also, it allows the use of voltage-mode sensing and smaller read voltage for reducing read latency and energy. They further propose multi-level DWM-SW (termed as ML-DWM-SW) bitcells, which can store two bits while having same area as DWM-SW. They propose using these DWM bitcells for designing all levels of cache in CPU and GPU and for both tag and data.
DWM-SW bitcells are comparable or superior to SRAM and STT-RAM on all important metrics, which allows them to use it as an alternative to SRAM in designing caches. ML-DWM-SW bitcells present challenges since they have data-dependent R/W energies and need a two-step write operation. For example, the read energy is highest and Fig. 7 . Mapping of LSBs and MSBs to level-1 and level-2 (respectively) of ML-DWM-SW bitcells .
lowest when all the multi-level bitcells storing a cache block have their MTJs in the (P,P) and (AP,AP) orientations, respectively (P = parallel, AP = anti-parallel). To address these challenges, they propose two schemes: intraword interleaving and cache-access pattern guided bit-encoding. In intraword interleaving, multiple bits stored in the ML-DWM-SW bitcell belong to a single word and not different cache blocks or different words of a single block. This interleaving scheme lowers read energy by reducing the number of bitlines to be charged/discharged in each read access. Also, since the MSBs of a word are updated less frequently than LSBs, MSBs and LSBs are mapped to level-2 and level-1, respectively, which have higher and lower write energies, respectively. This is illustrated in Figure 7 . Further, data comparison write is used to reduce the number of two-step writes and the number of writes to level-2. They note that reads constitute a larger fraction of energy in ML-DWM-SW caches, and, hence, they use a bit-encoding scheme that minimizes total read energy. On seeing the access patterns to L1 and L2 caches with SPEC2006 benchmarks, the 2-bit patterns, in decreasing order of frequency, were found to be 00, 01, 10, and 11. Since, in increasing order of energy, the configurations are (AP, AP), (AP, P), (P, AP), and (P,P), they encode 00 with (AP, AP), and so on, to minimize energy consumption. Their experiments show that for both CPU and GPU, their all-DWM cache design provides large reduction in area and energy compared to iso-capacity SRAM and STT-RAM caches.
Combining L1 and L2 Caches
Xu et al. [2016] note that in a nanowire with M domains, the domain aligned to the access port does not require shifting and has fixed access latency. In their design, such domains constitute L1 cache, whereas remaining M − 1 domains constitute L2 cache where non-uniform latency due to shifting is acceptable. Thus, they use a single physical storage to combine two logical levels of cache hierarchy (L1 and L2), which saves area and energy. More importantly, their design naturally maintains inclusion and eviction properties. On an L1 miss, track is shifted to the L2 data location, and, thus, the new data get aligned with the port, which is equivalent to its promotion to L1. The data previously in L1 are evicted to L2, and, thus, their design naturally maintains inclusion (if the data were not the LRU, then LRU data are searched and evicted to L2). The tags of both L1 and L2 can be combined such that L2 tags are stored and the L1 tags can be obtained by combining L2 tags with racetrack shift position. Their design allows L1 and L2 associativities to be either the same or to differ. For the case when L1 and L2 block sizes differ, their design presents additional complexity. Also, a maximum of two racetracks may be involved in serving an L2 access, and, hence, if an L1 location is one of these, then their design does not allow simultaneous access to both L1 and L2. Further, on increasing the number of access ports, the density advantage reduces, and domains aligned with access ports need to be considered as L1, which reduces relative size of L2. However, the advantage of increasing the port count is that shift overhead reduces, and the chances of allowing simultaneous access to L1 and L2 increase. Their experiments show that, compared to an iso-capacity two-level (L1/L2) SRAM cache, their design reduces energy consumption significantly without harming performance. 
Implementing Both Memory and Logic with Domain Wall Devices
Wang et al. [2015] present a distributed in-memory computing technique where most logic functions execute within memory. This mitigates bandwidth and the power-wall. They implement both logic and memory using domain-wall nanowire devices. Compared to CMOS-transistor-based logic, domain-wall nanowires allow implementing more complex logic due to their lower area/power overheads. Also, based on the physics of spintronics of domain-wall nanowires and conventional logic synthesis techniques, logic operations such as XOR, addition, and multiplication can be realized. With these and domain-wall lookup tables, a wide range of instructions can be executed. They demonstrate their approach using a neural-network-(NN) based image resolution enhancement kernel and map all machine-learning operations in NN to domain-wall nanowires. Both NN training and processing happen within memory. They show that, compared to DRAM, DWM-based main memory brings significant reduction in leakage and dynamic power. Similarly, domain-wall logic reduces leakage and dynamic energy compared to CMOS-transistor-based logic without affecting performance.
TECHNIQUES FOR IMPROVING RELIABILITY
Aggressive feature size scaling has made reliability a key constraint in the design of modern processors [Mittal and Vetter 2016] . The reliability of DWM may be threatened due to imperfect shift operations and process variation, among other factors. We now review techniques for addressing these reliability issues. Zhang et al. [2015b] identify two types of position errors in DWM (refer to Section 2.4 for a background on position errors). "Stop-in-middle" error refers to the event where a domain is not correctly aligned with the access port. In "out-of-step" error, the domains may be over-shifted, leading to an incorrect domain being accessed. For example, a ±K out-of-step error implies that domains are over/under-shifted by K steps. These two errors are illustrated in Figure 8 . To alleviate stop-in-middle errors, they perform a shift in two phases. In the first phase, a pulse of high driving current density is applied, and, in the second phase, a sub-threshold shift is made by applying an extra pulse of driving current. This shifting approach, however, increases out-of-step errors. Fig. 9 . Impact of process variation in DWM and its mitigation by shift current and write current boosting .
Addressing Positional Errors
To address them, they propose position-ECC (p-ECC). For example, a single error correction, double error detection (SECDED) p-ECC adds additional guard domains on both sides of the DWM stripe. At the time of reading DWM, the bits stored in these guard domains are also read and based on them, and a +1 or −1 error can be detected. Such errors can be corrected by shifting domains one step backward or forward, respectively. Although SECDED p-ECC can detect a +2 error, it cannot correct +2 error, since it cannot be distinguished from a −2 error. To address this, ECCs with stronger protection can be designed by extending the above-mentioned ideas. They also present strategies to reduce area overhead of p-ECC by utilizing the overhead region in the DWM stripe. They show that their technique improves MTTF from few microseconds to 69 years, with negligible performance loss. present techniques to mitigate the impact of process variation in DWM cache. Since process variation degrades the R/W latency of DWM, they identify slow read and write bits at the test phase. They note that in worst case, R/W latency is the summation of R/W and shift latency. Also, domain-wall shift latency and write latency can be reduced by boosting the shift current and write current, respectively. They use shift boosting to fix bits with worst-case read latency and both shift and write boosting to fix bits with worst-case write latency. This is illustrated in Figure 9 . The variation in read latency is less severe than that in write latency, and, hence, only shift boosting is sufficient to address read latency degradation. Further, since both shift and write boosting consume power, they are used only for worst-case bits to improve performance with a minimal increase in power consumption. They show that their technique saves dynamic energy compared to boosting all bitcells and improves performance compared to a cache without any policy for managing process variation.
Addressing Process Variation
CONCLUSION AND FUTURE OUTLOOK
With its potential to provide magnitude-order advantage in density compared to existing memory technologies, DWM holds the promise of being the universal memory for next-generation systems. In this article, we presented a survey on techniques for using DWM for designing processor components. We discussed use of DWM in different processor components of CPU and GPU and also mentioned integration and comparative evaluation of DWM with other memory technologies. We classified the techniques based on several attributes to bring out their similarities and distinctions. We conclude this article with a brief mention of future prospects for DWM.
Architecture-level studies using real systems and/or simulators is crucial for evaluating novel ideas proposed for DWM-based components. Due to the emerging nature of DWM, its actual prototypes may not be economically viable or widely available. In the absence of open-source tools for modeling DWM, researchers obtain DWM parameters from in-house tools or extrapolation of other works; this approach, however, inhibits reproducibility and is likely to provide inaccurate estimates. Clearly, availability of open-source tools for modeling DWM and progress in manufacturing and commercial feasibility of DWM will allow full design space exploration of DWM as well as its integration into product systems.
Exascale computing aims to perform 10 18 computations/second while spending no more than 20MW of power . While traditional memory technologies clearly fall short of the performance and energy efficiency targets of these systems, DWM, as it stands now, is also unlikely to meet these goals. Evidently, synergistic management of DWM across the system stack will be vital for architecting DWM towards realizing these goals. At the device level, exploring novel bitcell designs and array organizations will be essential to match the characteristics of processor component and meet higher-level optimization goals, such as performance, energy, or area. At the architecture level, write buffers, cache prefetching, out-of-order execution, and so on, can be used to hide the shifting latency. At the application level, language-level annotations can be used to mark non-critical data in error-resilient applications and, using this, partial reads/writes (i.e., approximate storage) can be performed to reduce shifting operations [Mittal 2016c ]. Also, using OS and compiler techniques, the memory access pattern of applications can be altered to better suit the requirements of the DWM, for example, sequential memory access.
