Abstract-Extreme multi-threading and fast thread switching in modern GPGPU require a large, power-hungry register file (RF), which quickly becomes one of major obstacles on the upscaling path of energy-efficient GPGPU computing. In this work, we propose to implement a power-efficient GPGPU RF built on the newly emerged racetrack memory. Racetrack memory has small cell area, low dynamic power, and nonvolatility. Its unique access mechanism, however, results in a long and location-dependent access latency, which offsets the energy saving benefit it introduces and probably harms the performance. In order to conquer the adverse impacts of racetrack memory based RF designs, we first propose a register mapping scheme to reduce the average access latency. Based on the register mapping, we develop a racetrack memory aware warp scheduling (RMWS) algorithm to further suppress the access latency. RMWS design includes a new write buffer structure that improves the scheduling efficiency as well as energy saving. We also investigate and optimize the design where multiple concurrent RMWS schedulers are employed. Experiment results show that our propose techniques can keep a GPGPU performance similar to the baseline with SRAM based RF while the RF energy is significantly reduced by 48.5 percent.
Ç

INTRODUCTION
G ENERAL purpose graphics processing unit (GPGPU) has become an important technology to accelerate general purpose applications which exhibit considerable data parallelism. Modern GPGPUs can support tens of thousands concurrent hardware threads executed in single instruction multiple threads (SIMT) manner. The extreme multithreading demands a high memory bandwidth, which asserts unbearable pressure on memory subsystems. Therefore, GPGPUs frequently switch threads to mitigate the negative impact of memory bandwidth contention and hide the long memory access latency. To support such fast thread switchings, a large register file (RF) is often employed: for example, each stream multiprocessor (SM) in NVIDIA GTX 680 [1] has a 256 KB RF, which is much larger than the L1 cache or share memory. 1 Traditionally, GPGPU RF is built with SRAM technology and can consume up to 16 percent power of the whole GPGPU system. It is even comparable to the power consumption of DRAM or execution unit, as shown in Fig. 1 .
As technology scales, the high leakage power and the large parametric variability of large on-chip SRAM based memory elements have become a major obstacle impeding GPGPU power efficiency [4] . Substituting traditional memory with emerging memories [5] , [6] appears as a promising approach to combat these scaling challenges in GPGPU designs, e.g., using spin-transfer-torque random access memory (STT-RAM) to implement GPGPU RFs. Compared to their SRAM counterpart, these emerging memories generally offer smaller cell area (or larger capacity with the same footprint), better scalability, and nonvolatility.
Intuitively, the access frequency of RF is much higher than that of other memory components, e.g., on-chip cache, because nearly every instruction must access more than one register. Hence, dynamic energy is expected to contribute a larger portion of total energy consumption of RF. We profile prevalent GPGPU benchmark suites [2] , [7] , [8] , [9] and observe that the access frequency of RF is nearly 10Â $ 30Â higher than that of L1 cache. The right side of Fig. 1 shows the power breakdown of RF. Here the RF is built on SRAM and the bank level parameters of RF is extracted from CACTI [10] that enhanced with GPGPU RF model. The emerging memory candidates, however, all suffer from high write energy overhead and slow write access time. Directly replacing traditional memory cells in GPGPU RF with emerging memory cells may offset the benefits of high density and low leakage power offered by emerging memory technologies.
In this work, we propose to implement GPGPU RF with the emerging racetrack memory (RM) [11] . RM have extremely small cell area [11] , [12] , low read and write energy, and nonvolatility. Unlike the random access memory, however, the access of RM is location-dependent: the sequential access mechanism of RM is similar to the tape, limited number of access ports are provided; whenever a particular address is sent to RM, it has to shift or move the target memory cells to a nearby access port where the content in cells can be read out. The access latency becomes nonuniform and is prolonged due to the shifting. Hence, we introduce a series of techniques to alleviate the adverse impact of the sequential access of the RM based RF on GPGPU performance as:
A register remapping scheme is proposed to reorganize logical-physical register mapping to reduce the average shift distance in RM accesses. A RM aware warp scheduling (RMWS) algorithm is developed to hide the shift delay by dynamically prioritizing the ready warps based on the current access port location on the RM tracks. To improve the scheduling efficacy of RMWS a write buffer structure is introduced. The write buffer also saves the energy of the RM based RFs through eliminating of unnecessary RF accesses. We also investigate the schedule hazard among multiple RMWS schedulers and come up with a novel warp-register mapping technique to remove the hazard. Our simulation results show that compared to SRAM based GPGPU RF design with a normal warp scheduler, the application of RM based RF significantly reduce the energy consumption of RF by 48.5 percent while the system performance is kept at the similar level.
The rest of this paper is organized as follows: Section 2 briefly introduces backgrounds of GPGPU architecture and RM; Section 3 presents an array-level evaluation of RM based RF; Section 4 describes details on the proposed architecture optimizations for RM based RF; Section 5 depicts our experimental setup; Section 6 shows our experimental results and the relevant analysis; Section 7 gives an overview of related works. Section 8 concludes our works.
BACKGROUND 2.1 GPGPU Architecture
Without loss of generality, we use NVIDIA Fermi [13] as an example to introduce the relevant knowledge of GPGPU architecture. The threads in Fermi are organized as a threedimensional grid where every element is a three dimensional cooperative thread array (CTA). Each CTA is dispatched as a whole to a stream multiprocessor for execution. One CTA can contain up to 1,536 threads, in which every 32 threads are packed into one warp. A warp is the basic scheduling element in which all threads execute in SIMT fashion. One SM in Fermi can concurrently run up to 48 warps. Fig. 2 shows two stages-issue and operand collection, in a Fermi SM pipeline. At issue stage, warps are issued by a warp scheduler: if the selected warp has a valid instruction in its instruction buffer, and confirms with scoreboard that it does not depend on previous instructions (notified by the r field), then it will be issued; otherwise, the scheduler will select another warp based on the scheduling algorithm. The issued instruction is attached with a mask by the SIMT-stack [14] that indicates the active threads within a warp. Fig. 2b shows details of the data path involved in operand collection stage into which the issued instruction enter. The instruction is buffered in the collector unit, and the register requests from the instruction are sent to the arbitrator associated with each RF bank. Here the arbitrator solves the bank conflict of the pending requests. The GPGPU RF is highly banked to virtually support multi-port feature. In this work, we assume there are total 64 entries in a bank [3] ; each entry is 128 bytes, containing 32 32-bit operands, namely, warp register 2 ; and a single request can read/write one register.
Domain-Wall-Shift-Write Based Racetrack Memory
Fig . 3a shows the schematic of one RM cell based on domainwall-shift-write [15] , [16] . As indicated in the name, write operations of such kind of RM cell rely on magnetic domain wall motion. The RM cell consists of a ferromagnetic wire, an MTJ, and two access transistors. Different from conventional 1T1J STT-RAM cell [17] , the MTJ in the RM cell contains two fixed pinning regions whose magnetization directions are opposite to each other. These two fixed regions are separated by a free region. In the free region, the 2. We use warp register and register interchangeably.
magnetization direction can be changed by injecting current from the two adjacent pinning regions. In write operations, the WWL (Write Word-Line) transistor is turned on; the cell will be written to '1'/'0' by setting Bit-Line (BL) to high/ low and Source-Line (SL) to low/high. During read operations, the Read Word-Line (RWL) transistor is turned on, then the sensing current will go through from the BL to the SL. The decoupled read and write paths can substantially reduce the read disturbance probability and improve the operation reliability of the RM cell. A RM consists of an array of magnetic strips, namely, racetracks. Fig. 3b shows the structure of a racetrack including an access port implemented by the RM cell [11] . The racetrack is partitioned into consecutive magnetic domains which are separated by magnetic domain walls [18] . Each magnetic domain is considered as a RM cell where '0' and '1' are stored as the different magnetization directions, i.e., up and down, respectively. An important feature of RM is that the RM cell can move along the racetrack in either direction when injecting the current from different ends of the racetrack. Hence, multiple RM cells can share one access port: the RM cell to be accessed, say, M i , must move to the slot under the access port. The ferromagnetic layer with fixed magnetization direction and the metal-oxide layer underneath, together with the M i , constitute a magnetic tunneling junction (MTJ) structure. Depending on the relative magnetization directions in the fixed ferromagnetic layer and the M i , i.e., parallel or anti-parallel, the whole MTJ structure demonstrates a low or high resistance state, respectively.
GPGPU REGISTER FILE BUILT WITH
DWSW-RM Fig. 4 shows the conceptual bank array design using RM. 3 Multiple rows share the same word-line (WL) connected to the access port, significantly reducing the number of WLs. Consequently, the design of row decoder becomes much simpler and the area is smaller than the traditional one. Note that for each access port, there are two WLs which are responsible for read and write, respectively. A shift driver supplies shift current to the head or tail of the track, depending on the location of the accessed bit on the track.
Due to design complexity concern, all the tracks in an array share one shift driver and are controlled by the same command, or say, the magnetic domains on all tracks in an array move simultaneously. The magnetic domains on the same row represent a warp register. The arbitrator associated with each bank is augmented with a shift controller that generates the shift pulse based on the incoming requests. The right of Fig. 4 illustrates the inside of a shift controller. The row index generated from the row decoder is fed into the shift controller. The current location of the access port is stored in the location register. By comparing the row index and location register we can measure the distance between target register and the access port. Accordingly, the proper shift pulses are emitted from the pulse generator and drive the shifting on track.
Fig . 5 indicates the deployment of access ports is a tradeoff between area, power and read/write performance: increasing the number of access ports on a track can boost the read/write performance of the RM, however, also requires a more complex peripheral control circuit. As an extreme case, each array can have as few as only two WLs across all the tracks though the read/write performance will be dramatically degraded by the shapely increased shift delay. Fig. 5 shows an array-level design exploration of RM based RF of one Fermi SM. The parameters of each configurations are extracted from modified NVSim [19] which is enhanced with RM model. We restrict the write pulse within 1 ns, and adopt the device-level parameters from [11] , [15] by carefully scaling them down to 32 nm technology node. Compared to SRAM design, the RF using four-port (4P) RM reduces the area and leakage power by $90 and $93 percent, respectively, while the read/write energy is also saved by $42 percent/$50 percent due to the smaller area, the shorter routing distance, and the simpler decoding logic. However, the read and write latencies are increased by 70 and 140 percent, respectively, due to the long sensing delay and write pulse width. Here the read and write latencies are measured from the reading and writing operations on the access port and hence, shift delay is not considered. Note that the read/write latency of 4P configuration is 0.59/1.26 ns, which can fit into two SM cycles. 4 Therefore, the prolonged read/write latency generates very minimum impact on system performance. The result also 3. For simplicity, we skip the details of inner bank array organization and assume that one bank consists of only one array.
4. The SM frequency of most real GPGPUs is lower than 1 GHz.
shows that continuing to increase the number of access port on each track beyond 16 only result in marginal read/write energy saving as the power consumption of sense amplifiers and write drivers start to dominate the total power. A further reduction of read latency/energy can be achieved by using thicker oxide that can improve the TMR of the MTJ structure [16] , which is beyond the scope of this paper. Given the longest track-64-bit-used in this work, the maximal deployment number of access ports is set to 64. Such design omits any shifting since every magnetic domain/bit is associated with a private access port, just like RAM.
ARCHITECTURE OPTIMIZATIONS FOR RM BASED GPGPU REGISTER FILE
Register Remapping
We observe that the usage ratio of RF during GPGPU execution of popular GPGPU benchmark suites is generally low. Such underutilization can be leveraged to optimize the read/write performance of RM. As we will show in Section 5, on average, only 62.15 percent RF are allocated over the simulated benchmarks. It is because the limited availability of the shared memory and the max CTAs supported by one SM makes the RF not fully utilized. As a result, only about 40 out of 64 entries are averagely used in a bank. Fig. 6a shows the original register mapping in two banks [20] . Here we assume there are two warps (W0&W1) running on a SM and each warp requires eight registers (R0$R7) interleaving in two banks. Each track has 16-bit and two access ports. Within a bank, the original mapping allocates all eight registers together of which the maximum shift distance is 8 (supposing the access sequence is register 2 of warp 1 (W1R2)!W1R6!W0R0 and the access port on the top is selected). Fig. 6b shows our proposed register remapping scheme. This scheme packs the registers around each access port to reduce the maximum shift distance down to 4. As a warp instruction often reads more than one operand, the shift delay may be reduced by accessing the registers distributed around different access ports simultaneously or consecutively. In present-day GPGPU design, the decoder in a RF bank has already designed to be reconfigurable [20] to fulfill the different register mapping requirements of different applications/CUDA kernels. Hence, the implementation cost incurred by register remapping is negligible.
There are two preconditions in the implementation of register remapping: 1 the hardware knows the register file allocation information and 2 the mapping of the registers of each warp to the banks is fixed. 1 can be satisfied with the help of compilation-compiler can deliver the register allocation information to the SM during execution (indeed, this scheme has already been adopted by CUDA and NVI-DIA GPU series [21] ). 2 is promised by the CTA launching scheme: each warp ID, which is used to index the RF, is reusable for future warps. So once a CUDA kernel is launched, its warp register mapping is determined and such a mapping will never change until the next kernel is launched.
RM Aware Warp Scheduling
As discussed in Section 3, although RM can help to reduce the RF area and leakage power, the read/write energy may not be necessarily saved if a long shift delay is required by data accesses (e.g., due to limited access port number). Hence, if we can arrange the RF accesses to be "sequential" then the shift distance between two adjacent RF accesses can be minimized and the overall shift overhead can be reduced.
Given the facts that thread switching in GPGPU happens very frequently, and there are normally dozens of warps pending for scheduling in a SM, we propose to re-arrange the issue order of the warps based on the distance between the access ports and the registers requested by the corresponding warp instructions to generate sequential accesses to the RM based RF. We name this technique as RM aware warp scheduling.
In RMWS, we define scheduling score as the maximum distance between the requested register location and the access port for the instruction of a warp. In a scheduling cycle, RMWS examines all running warps and selects the one with the lowest scheduling score to issue. There are three pieces of information are needed to obtain the scheduling score: 1 the registers that the warp instruction accesses; 2 the locations of the accessed registers; and 3 the current locations of the access ports.
We use Fig. 6b as an example to illustrate how the scheduler calculates the scheduling score of a warp instruction: assuming that an instruction of W1 accesses R2 and R4 ( 1 ). Scheduler can get 1 from the instruction buffer associated with the warp. Here the location of R2/R4 ( 2 ) is 1/0 bit away from the nearest access port ( 3 ). The highest score-1 bit, is set as the scheduling score of W1. Because we evenly distribute the access ports on each track, every access port covers the same number of bits ( e.g., 8-bit in Fig. 6b ), called a port segment. Note that all banks of a RF have the same port segment. We use the relative location w.r.t. port segment, namely, bit-map location (BML), to represent 2. A BML can be calculated by using the register ID and the warp ID attached to a RF request as 
where Reg warp is the number of warp registers allocated to a warp; Bank is the number of total register banks of a RF; Port track is the number of access ports on a track; Offset is the distance between the first empty bit on the top and the first valid bit in a port segment. For example, the BML of Fig. 6b is 2. The bold values in Equations (1), (2), and (3) can be pre-determined when a CUDA kernal is launched. Furthermore, all the arithmetic calculation are fixed-point with 6-bit resolution 5 ; Hence, the logic to generate BML has negligible hardware cost and can be easily installed in the scheduler.
W0R0 in
As aforementioned, all tracks in a RF bank move simultaneously. Hence, the BMLs of all the registers at every access port are the same. We can define the BML of a bank (i.e., bank BML) as the BML of any register at an access port. Here the bank BML represents the relative location of the access port in their assigned port segment ( 3 ). The scheduler tracks the bank BML of each RF bank: once an instruction is issued, the generated RF requests are pushed into the pending queue of an arbitrator associated with the RF bank. For a warp instruction that is scheduled to issue, we can compare the BML of the registers to be accessed by this instruction with the one of the newest pending request of the bank to obtain the scheduling score of the instruction. Like our former work [12] , we do not introduce any automatic back or forth access port adjustment [22] after issuing a RF request, because we found that such adjustment has little performance or energy merit but complicates the shift controller design.
The major cost of RMWS is searching for the warp instruction which has the smallest scheduling score. This is a typical min/max search problem and the cost increases linearly with the number of concurrent warps in an SM [23] . Here we introduce a high efficient scheduler design which only looks for the minimum scheduling score and avoids many expensive calculations. Fig. 7 depicts an example about how the proposed scheduler works. Similar to Fig. 6b , there are two warps to be scheduled: W0 has an instruction accessing R0 and R5 while W1 has an instruction accessing R2 and R1. At the first step of RMWS, the BML of each register access is preprocessed by
Here both shifts are arithmetic shift. For example, the BML of W0R0 is 0x2. Through the preprocessing in step 1, we obtain the BML pp of W0R0 which equals 0xE0. At step 2, the BML pp of W0R0 is XORed with the BML pp of bank0 corresponding to the newest pending request. At step 3, the result of step 2 -0x18, is then logically right shifted until its LSB becomes non-zero. The number of 1's in the obtained result-0x03 (called normalized bit distance) represents the distance between W0R0 and its assigned access port (in this case, the normalized bit distance = 2). The normalized bit distances of the other register accesses can be calculated similarly, except for W1R1: in this case, the output of step 2 is 0x0. The normalized bit distance of the W1R1 is directly set to 0x0 and nothing needs to be done at step 3. The normalized bit distances of the registers to be accessed by the same instruction are ORed together at step 4 to produce the scheduling score of the instruction. At step 5, the scheduling scores of all ready instructions form an array. We scan every column of the array from right to left and perform bitwise AND of all the bits on the same column. If the result of the bit-wise AND of a column is zero, or say, there is at least one bit in that column is zero, the row whose bit is zero will have the minimum scheduling score and the corresponding warp instruction will be scheduled for issue. If the warp cannot be issued due to other constraints (we will discuss this case in Section 4.2.1), the scan will continue from the current column until the warp with the next minimum scheduling score is found. The above RMWS algorithm can be easily implemented in current GPGPU designs. In fact, the widely used scheduling algorithms, e.g., loose round robin (LRR) or greedy-thenoldest (GTO) [2] , also adopt a scan-liked mechanism for warp selection: both of them select a warp from the warp pool and then check with the corresponding instruction buffer. If there is no ready instruction for the selected warp, they will select the next warp until a warp instruction can be issued. Hence, we expect RMWS to have a timing overhead comparable to advanced GTO.
We note that RMWS overlooks one possibility that the registers accessed by a warp instruction are located in the same bank but separated by the assigned access port at different sides. However, such a case rarely happens because in GPGPU executions, most of the warp instructions are managed to collect their operands from different RF banks to minimize the bank conflict of register accesses. Although we are able to elaborate our RMWS algorithm to consider this scenario, e.g., using circular shift, we decide to tolerate such an inaccuracy simply because of the low occurrence probability of this scenario: our profiling results shows that among all the executed instructions, only 1 percent of them have more than two registers located in the same RF bank, not even talking about if they are separated by the same access port at different sides. As we shall show in Section 6.1.1, such a low probability of miscalculating the scheduling score does not visibly affect the efficacy of the RMWS.
Write Buffer
RMWS assumes that the service order of the pending RF requests is deterministic because of the first-come-first-serve (FCFS) policy adopted by the arbitrator. Write requests, however, may invalidate this assumption: a write request is 5. In Fermi a RF bank has at most 64 registers. generated from writeback stage of the pipeline but not issue stage. Hence, inserting a write request to the deterministic read request sequence will introduce uncertainty to the calculation of scheduling score. In order to remove this uncertainty, we introduce a write buffer to store the incoming write request and use piggyback-write policy to dispatch the write requests from write buffer to RF: the write operation is performed when the target register is moved to an access port by the previous read or write request. By doing so, RMWS only needs to consider the read requests during the calculation of scheduling score. When the write buffer is overflowed, a writeback will be performed regardless the current location of the registers being accessed in the RF bank. The overflow of the write buffer may harm the efficacy of RMWS. But as we shall show in Section 6.1.1, the probability of write buffer overflow is considerably low so that the its impact is very marginal.
In general, the RF access frequency in GPGPU is very high so that the lifetime of a GPGPU register value is short [4] . This characteristic has two positive side effects: first, the data stored in the write buffer is very likely to be accessed by a RF read request before it is written back to the RF (RAW dependency); second, a write request residing in the write buffer may be erased by a later issued instruction which also writes to the same warp register (WAW dependency). By leveraging these two dependencies, the introduction of the write buffer can filter the read/write requests to the RF and further reduces the RF access energy consumption by leveraging the low access cost of the write buffer. To perform the data dependency detection between the write buffer and the issued instructions, the write buffer is divided into two parts-write buffer data array (WBDA) and write buffer info table (WBIT), which are deployed in arbitrator and scoreboard, respectively. Fig. 8a shows a GPGPU pipeline augmented with RMWS and write buffer, including the modifications on the following stages:
Issued stage. As shown in Fig. 8a , RMWS scheduler reads the instruction buffer ( 1) and then generates the scheduling score array (see Fig. 7 ). The scheduler then picks up a ready warp instruction and updates the WBIT in the scoreboard ( 2). WBIT is a set-associated table tracking the utilization of WBDA; the ways of a set in WBIT equals the number of WBDA entries associated with a register bank, as shown in Fig. 8b . Each WBIT entry has four fields-V (1 bit) indicates whether the associated WBDA entry is valid; R (1 bit) indicates whether the associated WBDA entry has received the data; warp ID (6-bit) and register ID (6-bit) record the write request; and F (up to 4-bit) is a counter recording how many in-flight read requests will read the corresponding WBDA entry so as to avoid the WAR hazards and help the scoreboard to label the relevant warp instructions as ready in the instruction buffer.
Figs. 8c and 8d depict how scoreboard processes an incoming warp instruction. The warp ID and register ID fields of an instruction are used to log the write/read request information into a WBIT entry. A warp instruction cannot be issued if: 1 the write buffer is full, 2 WAW hazard or 3 RAW hazard happens. 1 happens only when all WBDA entries are valid (V ¼ 1) but none of them is ready for writing back to RF. It may be because the corresponding instructions have not produced the output data yet (R ¼ 0) or the read requests to the particular WBDA entries are still in-flight (F > 0). 2 and 3 never happen because the scoreboard can detect both hazards, preventing the unready warp instruction from being selected by the scheduler. The scoreboard allows a warp instruction to issue if none of the above conditions exists ( 4 ); otherwise, the scheduler will select another warp instruction ( 3 ).
Operand collection stage. Whenever a read request retrieves the data from a WBDA entry, the arbitrator notifies the scoreboard to decrease the F field of the WBIT entry associated with that WBDA entry ( 6 ).
Write back stage. The write request writes the WBDA entry assigned by the scoreboard at issue stage, and notifies the scoreboard to update the R field of the corresponding WBIT entry ( 7 ). The scoreboard then checks the instruction buffer and labels the warp instructions as ready if no hazard is detected ( 8 ). Once the data has been written back from WBDA to RF, the arbitrator notifies the scoreboard to recycle the corresponding entry in WBIT ( 9 ).
If we assume there are two WBDA entries for each bank and total 16 banks in a SM, the WBIT costs 72-byte and the WBDA costs 4 KB (128 bytes per entry). The incurred performance and energy overheads are very low though they are still considered in our evaluations.
Support for Multiple Warp Schedulers
Using multiple independent warp schedulers in each SM [13] , [23] allows the GPGPU to achieve near-to-peak hardware performance as well as the reduction in the design complexity of each scheduler. However, with the introduction of RM based RF and RMWS, increasing the number of schedulers in a SM may potentially degrade the scheduling efficiency because the local optimization at each RMWS scheduler does not necessarily generate a global optimal outcome. We refer to such inefficiency caused by multiple schedulers as schedule hazard. Fig. 9 gives an example illustrating the schedule hazard between two RMWS schedulers of a RF bank: scheduler 0 should dispatch R2 at cycle 1 to achieve a global optimal solution by considering the track movement triggered by scheduler 1. However, it falsely picks up R1 which offers a local optimal solution for scheduler 0 only.
Schedule hazard comes from the inconsistency of the track movement directions requested by multiple simultaneously scheduled warp instructions. In order to eliminate the schedule hazard and keep a concise scheduler design, we introduce warp-register remapping which is shown in Fig. 10 . Compared to the original register remapping in Fig. 6b , all registers of a single warp are mapped to the same bank by warp-register remapping. Each scheduler can work independently without interfering to each other.
To implement warp-register remapping, we divide the RF banks into equal-sized, non-overlap subsets, each of which is bound to a warp schedule. The warps, which are allocated to a scheduler, access all their operands from the subset of RF banks associated with that scheduler. As such each scheduler can exclusively emit RF requests to its private-owned RF banks and leverages RMWS to produce optimal scheduling without any interference of schedule hazard. Even the registers accessed by a warp span a reduced stretch now, the bank conflict incurred by warpregister remapping is expected to be mild, since for each bank, the number of warps it serves also decreases.
EXPERIMENT METHODOLOGY
Applications
We construct a diverse application set from [2] , [7] , [8] , [9] to evaluate our proposed GPU RF architecture. All applications are fully simulated except for KM, FWT and RD, which are simulated for the only first 2 billion instructions. The detailed characteristics of the applications used in this paper is summarized in Table 1 .
Simulation Platform
We use GPGPU-Sim [2] as our simulation platform, which has been modified with all the proposed architectural optimizations. Fermi-liked SM [13] is adopted and simulated by GPGPU-Sim. The simulator configuration is depicted in Table 2 . To accurately simulate register remapping, we configure GPGPU-Sim to run PTXPlus codes which exactly exposes the register allocation to hardware. We extract the performance statistics from GPGPU-Sim and also generate the detailed access statistic of RM based RF for energy consumption measurement.
We choose the widely-used GTO as the basic scheduler in our GPU configuration. Although there are many other scheduling algorithms [4] , [24] , [25] , [26] , [27] , [28] , most of them focus on improving the performance of some particular applications, e.g., memory intensive applications. Instead, RMWS is designed for more generic scenarios and can be easily incorporated into other schedulers. For example, for a two-level scheduler [4] , we can replace the LRR at each level with RMWS. In this work, we did not take into account the integrated solution combing RMWS and other schedulers and leave it to our future research.
The design parameters of RF implemented with different memories are depicted in Table 3 ; all parameters are generated by modified NVSim [19] at 32 nm technology node. For the 4P RM, we adopt the device parameters from [11] , [15] and the memory cell area data from [12] . The SM frequency is set to 700 MHz, which implies that the RF is running at 1,400 MHz [2] . The read latency of 4P RM can fit into one cycle while the write latency is two cycles. We conservatively assume the delay of shifting one bit on a track as one cycle. The actual shift current density is decided by the target shift velocity [29] and the length of one magnetic domain. We use the parameters from [29] to estimate the shift energy of 4P RM while also taking the shift driver overhead into account. We also add one extra cycle to write back stage when performing the consistency in write buffer. All the above timing and energy overheads have been included in our simulations.
A register bank in a Fermi SM consists of 2,048 32-bit registers; there are 64 entries (each entry has 32 32-bit registers) within a bank. In 4P RM RF design, a bank includes 1,024 tracks, each of which has 64 bits. As there are four ports on a track, the maximum shift distance is 15-bit. The write buffer size is set to 32 and each two entries of the write buffer are associated with a bank.
RESULT
In this section, we will first evaluate the performance and energy of the GPGPU with one scheduler per SM, followed by the exploration on different design parameters. At last, we will present the evaluation of multi-scheduler design. Fig. 11 shows the performances of different design choices. The baseline RF design is built with SRAM; GTO is used as the default warp scheduler. Directly deploying 4P RM based RF with GTO (4P+GTO) degrades the overall performance by 4.8 percent w.r.t. the GTO baseline. Register remapping reduces the average shift distance and limits the performance degradation within 3.4 percent. Combined with register remapping, the performance of RMWS is within 0.3 percent of the baseline. We also evaluate the performance of register file cache (RFC) [30] . As indicated by its name, RFC introduces a small cache for the RF in order to reduce the accesses to RFC. The function of write buffer in RMWS is a subset of the function of RFC; consequently, the implementation overhead of write buffer is less than that of RFC. We observe 8.9 percent performance degradation of RFC. RFC uses two-level scheduler [30] which perform inferior to GTO, even RFC avoids a considerable amount of RF accesses.
Results of Single Scheduler Design
Performance
To analyze the GPGPU performance in details, we divide the applications into three categories based on the sensitivity of their performance to RF access delay when GTO is employed:
C1 is sensitive to a prolonged RF access delay. With 4P+GTO, the geometric mean (GMEAN) of C1's performance is reduced by $11.7 percent. Most applications in C1 are compute-intensive and exhibit extremely high data parallelism. C2 is mildly sensitive to RF access delay as the included applications' performance degradation are restricted within 2.8 percent. Some applications in C2 are memory-intensive applications and their performance is limited by the available memory bandwidth; the performance of the other applications, however, is constrained by the limited innate parallelism. As a result, the performance impact of RF access delay is constrained. The performance of the applications belonging to C3 is even improved slightly by 2.1 percent after employing RM based RF. This improvement mainly comes from some side effects introduced in the runtime behaviors such as: the RF access delay positively affects the pipeline execution, e.g., less bubble stalls in the pipeline due to less structural/data hazards; and the cache performance may be also be improved due to the different warp execution order [25] , [26] . Fig. 12 shows the accumulated percentage of scheduling scores of issued warp instructions. Nearly 39 percent of warp instructions in C1 are issued with a scheduling score of 3 or above; for C2 and C3, this percentage decreases down to 27 percent. This again explains that in 4P+GTO, why the performance variances of C2 and C3 are lower than that of C1. After applying RMWS, the number of the issued instructions with a scheduling score of 3 or above reduces significantly in all three categories, especially in C1. As a result, we observe a dramatic performance improvement of C1 in RMWS. For all applications with RMWS, the number of warp instructions issued with a scheduling score of 3 or above is averagely reduced by $13 percent w.r.t. 4P+GTO, demonstrating the effectiveness and applicability of RMWS. Fig. 13 depicts the efficacy of write buffer. By resolving the RAW and WAW dependencies, write buffer avoids $33.3 percent RF reads (WB read) and $31.5 percent writes (WB voided) in RMWS. Thanks to piggyback-write, 63.8 percent of total write requests to the RF are performed without any shifting: 36.1 percent of them can directly write to RF (Reg dirct. write) as the target register is just at the access port; the rest 27.7 percent are stored in write buffer and activated only when their target registers move to the access ports (WB writeback). The percentage of the RF writes triggered by the overflow of write buffer (WB overflow) is lower than 5 percent.
Some other factors to measure the efficacy of RMWS are the reduction in the waiting-cycle of RF requests and the average shift distance associated with RF requests, as depicted in Fig. 13 . Compared to 4P+GTO, the waiting-cycle of RF requests in register remapping/4P+GTO/RMWS are reduced by 8.3 percent/16.3 percent/22.6 percent, respectively. The reduction in waiting-cycle indicates the alleviation of shift-delay-induced RF access conflict, leading to substantial performance improvement in C1. C2 and C3, however, achieve marginal speedup because they are insensitive to RF access delay.
Energy
As shown in Fig. 15 , the introduction of RM achieves significant RF energy saving. The application of RM (4P+GTO) eliminates almost the entire leakage energy consumption. The dynamic energy directly consumed by the read and write operations of RM cells is also averagely reduced by 39.1 percent. The shifting on track, however, introduces 40.3 percent energy overhead. Total 13.4 percent energy is saved by 4P+GTO compared to baseline. Register remapping further saves the energy by 11.8 percent due to the reduction in shift energy. 4P+RFC reduces 39.7 percent of the RF energy consumption. The considerable energy saving of 4P+RFC is because the RFC filters a large amount of RF accesses. RMWS achieves 13.4 percent more (48.5 percent in total) average energy saving on top of register remapping. The energy saving of RMWS mainly comes from the filtering of unnecessary RF requests by the write buffer as well as the further reduction in shift energy. As shown in Fig. 14 , the number of shifts performed in RMWS decreases by 52.7 percent compared to 4P +GTO. The energy consumption of write buffer is only about 3 percent of the total RF energy consumption. 
Exploration of Access Port Placement
Fig . 16 shows the performance and energy exploration results of RMWS by varying the number of access ports from 1 to 64 on a track. As discussed in Section 3, the energy dissipation increases with the increase in the number of access port as well as the decrease in shift overhead. The overall performance continues being improved when the number of access ports increases. It is because the schedule decision made by RMWS gradually approaches the one of GTO when the number of access ports increases. The energy consumption sharply climbs up when the number of access ports on a track increases because: 1 The dynamic energy increases due to a more complicated design of the peripheral circuitry and the increase of the interconnect length over a larger RF area. 2 More writes to the RF bank due to the improved availability of the access ports. 3 The degraded capability of write buffer to solve the RAW/WAW dependency when piggyback write is applied. It directly results in the increases of RF writes and reads. And 4 the additional leakage energy introduced by the extra access ports. Here we could not obtain the energy consumption result from NVSim of the 1P and 2P designs because no practical layout design can be found for such irregular array structures. Fig. 16 shows that 4P RM design achieves the highest performance as well as has the lowest energy consumption among all the options. Increasing the number of entries in write buffer to 32 can further achieve over 99 percent of the baseline performance. Continue to increase the number of entries in the write buffer, however, only gives us very marginal performance enhancement. As expected, increasing the write buffer size generates some energy consumption overhead. For example, the energy consumption of RF designs doubles when the number of entries in the write buffer increases from 32 to 256. We note that the energy consumption of a 32-entry write buffer design is lower than that of a 16 entry design because the 32-energy design filters more RF access requests and encounters fewer write buffer overflows. Hence, the 32-entry write buffer design achieves the best trade-off between performance and energy consumption.
Exploration of Write Buffer Size
6.4 Results of Multi-Scheduler 6.4.1 Performance Fig. 18 shows the performance of different multi-scheduler designs. Dual-RMWS design decreases the performance of 6.4 percent compared to Dual-GTO across all three types of applications; Here Dual-GTO is with SRAM based RF. As previously discussed, the schedule hazard among multiple RMWS schedulers harms the potential performance. After combining Dual-RMWS and warp-register remapping, the GPGPU performance is improved by 4.7 percent over Dual-RMWS, within 1.9 percent of the performance of Dual-GTO. It implies that warp-register remapping effectively suppresses schedule hazard and reduces the number of shifts performed by RF requests. Intuitively, as more RMWS schedules are introduced in a SM, the occurrence possibility of scheduler hazard claims up. The Quad-RMWS further downgrades the performance by 7.2 percent w.r.t. Quad-GTO. Again, warp-register remapping salvages 5.1 percent performance on top of Quad-RMWS and within 97.8 percent of the performance of Quad-GTO. 7 RELATED WORK 7.1 Power-Efficient GPGPU Improving power efficiency is one of the main focuses of the research and development of GPGPU in both industry and academia. For example, NVIDIA Kepler architecture triples the number of cores for lower shader frequency and includes the compilation support of energy-efficient warp scheduling for better performance per watt [23] . The traditional low-power techniques, e.g., DVFS [3] and power gating [3] , [31] , have been also utilized to reduce both dynamic and leakage power consumption of GPGPU. Special efforts [32] , [33] , [34] on hacking the microarchitecture or pipeline, e.g., exploiting value structures during execution, are also proposed to address the power issue. Nonetheless, systematic analysis' [3] , [35] on GPGPU power consumption showed that RF is one of the major factors affecting GPGPU power efficiency. Power efficiency improvement of GPGPU RF can be also achieved through architecture optimizations. Gebhart et al. proposed RFC to minimize RF accesses with two-level scheduling [4] , [30] for reduction of both leakage and dynamic energy consumptions. The authors further introduced a unified on-chip memory combining L1 cache, shared memory, and RF with significantly enhanced powerefficiency and performance [36] . Unlike RFC, our write buffer design piggybacks on the temporal locality within the pipeline execution (i.e., WAW/RAW) and does not require any special scheduling policies.
Energy
Yu et al. first proposed using new memory technology, i.e., eDRAM, to build GPGPU RF [37] . The authors designed a RF context aware scheduler to maintain the issue fairness of warps [37] . In our design, RMWS aims to minimize the shift delay of RM without putting special focus on the fairness. Interestingly, our simulations show that RMWS naturally carries a good issue fairness so that its performance very close to GTO, a fair scheduler design. In [38] and [39] , Jing et al. developed an opportunistic/compiler-assisted refreshing scheme to retain the data in volatile eDRAM cells. Due to the non-volatility of RM, such costly refreshing scheme can be safely removed in our proposed RF design to achieve significant standby power reduction.
Goswami et al. exploited the application of nonvolatile STT-RAM in GPGPU architecture as on-chip memory [40] . As aforementioned in Section 1, STT-RAM possesses nearzero leakage power as well as very high dynamic power, which is actually the main challenge in GPGPU RF designs. Hence, early write termination [41] is utilized to minimize the dynamic power consumption of STT-RAM based onchip memory [40] . In our proposed RF design, the dynamic power is naturally reduced by the energy-efficient write mechanism of RM while the write buffer also filters the unnecessary RF accesses.
We note that traditional memory power management schemes can be also applied to reduce the RF power consumption. In [42] , Abdel-Majeed et al. introduced drowsy RF design to save the leakage power consumed over a long time period between two successive accesses. The dynamic power of RF can be also minimized by masking the RF accesses from inactive threads within a warp. The architectural motivation of these solutions is orthogonal to our proposed techniques and can be incorporated together to further improve the energy efficiency of the GPGPU RF and its peripheral circuits.
Domain Wall Motion and Its Applications
Domain wall motion is predicted by [43] and has drawn increasing attentions as a promising candidate for future storage [44] and logic devices [45] . The storage practice utilizing current driven domain wall motion, i.e., racetrack memory [46] , has been fabricated with IBM 90 nm technology [47] and widely studied as on-/off-chip memory components. Venkatesan et al. [22] first demonstrated using racetrack memory to build the last-level cache (LLC) on CPU. They also proposed several scheduling policies to process the cache requests. Sun et al. [12] proposed a very dense racetrack memory based LLC design where multiple tracks are placed on top of each access port and the access ports are carefully placed to minimize the wasted area. Venkatesan et al. [11] further proposed domain-wall-shiftwrite based RM that has lower programming power; they then also introduced RM based cache hierarchy for GPGPU [48] . Recent work [49] also employs such kind of RM to architect GPU RF with focus on performance. Following the similar design motivation, we use RM to build the GPGPU RF in this work. In a GPGPU, the generation of RF requests can be controlled by switching warps, leaving a natural optimization space of warp scheduling for shift overhead reduction in RMWS.
CONCLUSION
We propose a RM based GPGPU RF design which can achieve significantly improved energy and area efficiency of GPGPU RF w.r.t. conventional SRAM design. In order to overcome the negative performance impact induced by the inherent sequential access of RM, we also propose to dynamically reorganize register mapping to reduce the shift delay in RM accesses. An efficient RM aware warp scheduling scheme, including a newly introduced write buffer, is designed to hide the long RF access latency. A warp-register reorganization scheme is developed to eliminate the schedule hazard among multiple schedulers. After combining all the proposed technologies, we can achieve more than 48 percent RF energy saving w.r.t. the SRAM based RF design while keeping the similar performance.
Mengjie Mao received the BS and MS degrees in computer science from South China University of Technology and University of Science and Technology of China, respectively, the PhD degree from the University of Pittsburgh, in 2016. He is currently a senior software engineer in the MathWorks Inc.
Wujie Wen received the MS degree from Tsinghua University and the PhD degree from the University of Pittsburgh, in 2010 and 2015, respectively, both in electronic engineering. He is currently an assistant professor in the ECE Department, Florida International University, Miami, Florida. Before, he joined the academic, he also worked with AMD and Broadcom for GPU and wireless communication chip designs. His current research interests include emerging memory, VLSI circuit/chip design, neuromorphic computing, and hardware security. He received the 49th DAC A. Richard Newton Graduate Scholarship-the prestigious PhD scholarship (one awardee per year) in Electronic Design Automation society, 2014 Bronze Medal of ACM SIGDA SRC in ICCAD, 2014 DAC best paper candidate nomination, 2015 DAC PhD forum best poster presentation and 2016 DATE best paper candidate nomination. He is a member of the IEEE.
Yaojun Zhang received the BS and MS degrees from the Department of Electrical Engineering, Shanghai Jiaotong University, Shanghai, China, in 2008 and the University of Pittsburgh, Pittsburgh, Pennsylvania, in 2010, respectively, both in electrical engineering. He is currently working toward the PhD degree in the ECE Department, University of Pittsburgh. His research mainly focuses on emerging non-volatile memory design for high performance, low power and scalable computer architectures, statistical reliability analysis of memory technology, and customized array and cell level circuit design for STT-RAM technology. He is a member of the IEEE. " For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
