The region that includes the register file is a hot spot in high-performance cores that limits the clock frequency. Although multibanking drastically reduces the area and energy consumption of the register files of superscalar processor cores, it suffers from low IPC due to bank conflicts. Our skewed multistaging drastically reduces not the bank conflict probability but the pipeline disturbance probability by the second stage. The evaluation results show that, compared with NORCS, which is the latest research on a register file for area and energy efficiency, a proposed register file with 18 banks achieves a 39.9% and 66.4% reduction in circuit area and in energy consumption, while maintaining a relative IPC of 97.5%.
Introduction
According to the published results of SPEC CPU [1] , since the 2000s, the best scores improved by more than 20% per year even without an increase in the clock frequency [2] . The primary factor behind this improvement is an increase in the width of cores. Recently, 8-issue cores, such as the IBM POWER8, and Intel Haswell and Skylake, have come onto the market [3] - [6] .
Such wide cores, however, suffer from increased area and energy consumption of the register file. Wide cores require a large number of registers proportional to the number of in-flight instructions. Multithreaded cores require a number of registers proportional to the number of threads. Besides, the circuit area of a register file composed of a RAM is proportional to the square of the number of its ports [7] - [9] . Figure 1 shows a die photograph of the AMD Bulldozer processor, which is the most documented in recent processors [10] - [12] . The integer core of the processor is a moderate-sized, non-multithreaded 4-issue one. Nevertheless, as shown in this figure, the 96-entry integer register file Manuscript received September 30, 2016 . Manuscript publicized January 11, 2017. † The authors are with Graduate School of Information Science and Technology, The University of Tokyo, Tokyo, 113-8656 Japan.
† † The author is with Department of Informatics, School of Multidisciplinary Sciences, SOKENDAI (Graduate University for Advanced Studies), Kanagawa-ken, 240-0193 Japan.
† † † The author is with Graduate School of Engineering, Nagoya University, Nagoya-shi, 464-8603 Japan.
† † † † The author is with Information Systems Architecture Research Division, National Institute of Informatics, Tokyo, 101-8430 Japan.
a) E-mail: yamadaju@mtl.t.u-tokyo.ac.jp DOI: 10.1587/transinf.2016EDP7414 Energy consumption and heat are two of the most serious bottlenecks in recent cores [13] . The energy consumption of a RAM is proportional to its circuit area and accessed frequency [7] - [9] . While L1D is accessed only once by load/store instructions, the register file is accessed more than once for read and once for write by almost all instructions. As a result, the register file with a comparable area with L1D consumes much more energy than L1D. The region that includes the register file is a hot spot in a core that limits the clock frequency and the scale of the core.
This hot spot problem is becoming more serious, because downscaling the operating voltage is becoming more difficult. In fact, the voltage range available for DVFS is shrinking with each new process. In addition, the register file is one of the most sensitive modules to low voltage in a core [14] . Therefore, architectural solution against energy and heat is more important than ever.
Reducing Register File Ports
Because its area and energy consumption is proportional to the square of the number of ports, reducing its ports is effective to downscale the register files.
In fact, the Bulldozer core shown in Fig. 1 composes the 8-read + 4-write register file of a replicated pair of 4-read + 4-write RAMs to reduce the ports from 12 to 8 though the number of cells are doubled. Such replication is widely used for recent processors [3] , [12] , [15] - [18] .
A register cache is a drastic method of reducing the Therefore, some techniques have been proposed to reduce bank conflicts without increasing the banks. Some techniques have been proposed for banks with fewer ports, but with considerable overhead (Sect. 2.3), as summarized as follows:
Redundant Access Elimination eliminates bank accesses whose operands are provided by the bypass or by another access to the same operand [23] , [24] . However, this elimination alone does not sufficiently reduce the accesses (Sect. 5.2). The Register Access Queues schedule register accesses so that no bank conflict occurs. However, the total scheduler logic is tripled for one execution and two register read operations [25] . Register Multi-mapping allocates two registers in different banks, and selectively reads one of them so that no bank conflict occurs. However, the register file and register management logic is doubled [26] . Delayed Register Allocation allocates registers before the writeback stage so that no bank conflict occurs on writeback, with an extra mapping table 1/4 times as large as the original register file [27] .
Proposal
This paper proposes skewed multistaging for a multibanked register file. In this method, accesses that cannot obtain their operands because of bank conflicts still have second chances in the second stage. As a result, an acceptable IPC degradation of approximately 2.5% is achieved, while inheriting a drastic reduction in area and energy consumption of a plain multibanked register file. As far as we know, our proposal is the only realistic register file system composed of minimum single-port cells. The rest of this paper is organized as follows: Sect. 2 introduces existing techniques including a plain multibanked register file. Then, Sect. 3 describes our system. Section 4 gives a mathematical model to understand low IPC degradation of the proposal. Sections 5 and 6 show the evaluation results on IPC, area, and energy consumption. Related work not mentioned in the other sections is summarized in Sect. 7.
Existing Techniques
This section introduces three types of existing techniques in three subsections for three different purposes. Section 2.1 introduces a register cache system as the latest proposal on a register file system for area and energy efficiency, which is compared with our proposal in Sects. 5 and 6. Section 2.2 shows a possible plain multibanked register file before going into our proposal in Sect. 3. Then, Sect. 2.3 introduces some techniques to reduce bank conflicts of multibanked register files mentioned in Sect. 1 to show the difficulties in reducing bank conflicts.
NORCS [20]-[22]
A register cache can also reduce the register file area and energy consumption by reducing the number of ports [19] . Compared with the original register file, the register cache is smaller because it has only 4 to 8 entries; the main register file is smaller because it has a few ports.
However, conventional register cache systems suffer from low IPC due to register cache misses. The backend pipeline is stalled when any of the register accesses in a cycle causes a register cache miss. If the register cache miss rate per access is 5% and the number of accesses per cycle is 3, the stall probability is as high as 1 − (1 − 0.05) 3 = 1 − 0.857 = 14.3%.
To reduce this probability, Shioya, et al. proposed the non-latency-oriented register cache system (NORCS). This is the latest proposal on the register file for area and energy efficiency [20] , and researchers in NVIDIA adopted this idea for their GPUs [21] , [22] .
Structure and Pipeline
As shown in Fig. 2 (middle) , NORCS has almost the same structure as conventional register cache systems except for the write buffer. The main difference is their pipelined behavior.
The pipeline of a conventional register cache system does not have a stage to read the main register file, in the same manner that usual pipelines have stages to read L1D but not the main memory. Conversely, the NORCS pipeline has dedicated stages to read the main register file, and all the instructions pass through these stages whether they hit or miss the register cache.
The NORCS pipeline is disturbed when register cache misses in a single cycle exceed the main register file read ports. With the same number of accesses of 3 and register cache miss rate of 5%, the pipeline with 2-read-port main register file is disturbed if all the 3 accesses miss the register cache, and the stall probability is reduced from 14.3% to 0.05 3 = 0.0125%.
Write Buffer
The results of instructions are written in parallel to the register cache, and to the main register file through the write buffer. The purpose of the NORCS write buffer is to reduce the write ports of the main register file by averaging the traffic from the execution units.
Plain Multibanked Register File
Multibanking is a technique typically used for the main memory of vector processors, and there is no standard implementation of a multibanked register file. Therefore, this subsection devotes several pages to show a possible plain multibanked register file before going into our proposal in Sect. 3. 
Structure of Multibanked Register File
Figure 2 (lower) shows the datapath of a multibanked register file. Figure 3 adds the control to the left half.
Datapaths
A multibanked register file has read and write switches for any-to-any routing between the execution units and banks.
As described in Sect. 1, the banks are more than an order of magnitude smaller than the original full-port register file in area.
Circuit Size of Data Switches
These switches are also more than an order of magnitude smaller contrary to expectation. Thus, we give an intuitive explanation on the circuit size of the switches before quantitative evaluation in Sect. 6. The circuit size of these switches can be estimated via a 64-bit r-read+w-write memory with only 1-entry. This 1-entry memory works as a 64-bit any-to-any switch by writing a 64-bit word to any of the w write ports, and reading it from any of the r read ports. This 1-entry r-read + w-write memory is two orders of magnitude smaller than an r-read+ w-write register file with hundred entries.
The read and write switches are a few times larger than this memory because they are not r-read+w-write, but r-read + b-write and b-read + w-write, respectively, where b is the number of banks and b > r = 2w. Finally, these switches are more than an order of magnitude smaller than the r-read+w-write register file.
The any-to-any routing and memory functions are integrated in a full-port, while distributed into the switches and banks in a multibanked register file. It is safe to say that a multibanked register file is smaller because of this function distribution at the risk of bank conflicts.
Control
The physical register number from the instruction issue port is used as the concatenation of the bank number and intrabank number fields, which are 4-to 5-bit wide.
The system has the arbiters and the intra-bank num-ber routing switches. The bank number field of the register number is decoded and distributed to the arbiters. Then, the intra-bank number field is routed to the bank through the switches controlled by the arbitration result.
Circuit Size of Arbiters and Register Number Switches
The arbiters and the register number switches are further smaller than the 64-bit datapath described above. The arbiter is equivalent to a select logic of an instruction scheduler that selects one out of the same number of instructions as the register file banks with fixed priority. Thus, its latency is a fraction of a half-cycle time usually allocated to the select logic that selects two or more out of 64 or more instructions. Note that the arbiters work in parallel with one another.
The intra-bank register number is 4-bit wide, and the register number routing switches are approximately 4/64 of the read/write switches for 64-bit data in area.
As shown by the pipeline registers in the middle of Fig. 3 , one cycle is assigned to the arbitration and register number routing throughout this paper.
Register Number for Multibanked Register File
The physical register number for a multibanked register file is similar to but different from the address for a multibanked main memory.
Randomness of Register Number
In a multibanked main memory, consecutive addresses reside in different banks to prevent bank conflicts in particular on continuous access. In contrast, it is meaningless to arrange the register numbers for the banks, because the register numbers are randomized by the register renaming as detailed below.
We assume instructions I pred and I succ have the same logical register as their destinations, that is, I succ is to overwrite the result of I pred . A physical register allocated to the destination of I pred is freed when (not I pred but) I succ is committed [28] . In short, physical registers are allocated in order, and freed out of order. As a result, the sequence of register numbers in the free list is randomly shuffled after sufficient cycles have passed since initialization.
A cycle-accurate simulator in Sect. 5 reproduces that this shuffling actually randomizes the bank accesses.
Number of Banks
The number of banks of a multibanked main memory should be a power of 2 to avoid a complex operation such as division to obtain the bank number from the address [29] . A multibanked register file can more freely choose the number of banks. This is because the physical register number is not a consecutive sequence number but a unique identifier, i.e., a tag; thus, the physical register number can be the concatenation of the bank and intra-bank numbers. The number of banks not a power of 2 slightly decreases the utility of the bank number field. From the same reason, a prime number of banks does not have a direct effect in reducing bank conflicts for multibanked register file.
Operand Bypass
The multibanked register file can reduce the bypass network only for instructions executed back-to-back as in a register file with half-cycle latency, because the bank ports partially substitutes for the original bypass. Figure 4 shows the pipeline with a multibanked register file to explain operand passing. In this figure, the result of I p is passed to I 3 , I 2 , and I 1 as follows:
Bank Because of the short latency of the banks, I 3 can read the result of I p through the bank. Bank port Because the bank is 1-read/write, when the result appears at the bank port to be written, I 2 can read it through the read switch ( Fig. 2 (lower) ). Bypass Consequently, only I 1 , which is executed back-toback with I p , must receive the result through the usual bypass in the execution unit.
Pipeline Disturbance on Bank Conflict
A bank conflict causes pipeline disturbance. The situation of a bank conflict is similar to but different from an L1D miss or a register cache miss.
Selective Delay Problem
We are sometimes asked if selective delay of the instructions that caused a bank conflict is possible without disturbing the entire pipeline. It is unrealistic, as detailed below.
The positional relation among all the instructions in the backend pipeline must be preserved as issued, because everything has been arranged so that the issued instructions flow through the backend pipeline as issued. To realize selective delay, a kind of post-scheduler is required to rearrange the instructions flowing through the backend pipeline [20] .
Rescheduling and Stalling
The following two ways are known to solve the pipeline disturbance:
Rescheduling The responsible instruction and its dependent instructions are once canceled and rescheduled. Stalling During the entire backend pipeline is stalled, the missed data is written to the proper register; then, the pipeline can be restarted as if there had not been a miss.
L1D Miss
Most processors adopt rescheduling for L1D misses. Edmondson et al. also adopted rescheduling for the Alpha 21164 processor at an early date after a detailed discussion on the trade-off between rescheduling and stalling [30] .
They pointed out that stall logic could create critical paths because the write enable terminal of all the pipeline registers must be immediately turned off.
Register Cache Miss and Bank Conflict
However, almost all studies on register caches and multibanked register files adopted stalling [19] , [20] , [26] , [27] .
This difference primarily depends on the length of the miss latencies compared with the issue latency which gives the penalty of rescheduling. The latency on an L1D miss is comparable with the issue latency, while that on a register cache miss is usually one cycle and much shorter than the issue latency. Shioya et al. evaluated both of rescheduling and stalling, and concluded that stalling is advantageous for register cache misses [20] .
Stalling is more advantageous for multibanked systems because the latency on a bank conflict is equal to or shorter than that on a register cache miss [26] , [27] .
Critical Paths for Pipeline Stall
Although, in general, stall logic can make critical paths; fortunately, our proposal can detect a stall condition 2 cycles before the pipeline is actually stalled, as detailed in Sect. 3.1. This will prevent the stall logic from creating critical paths.
Techniques to Reduce Bank Conflict
As mentioned in Sect. 1, some techniques have been proposed to reduce the bank conflict probability of a plain multibanked register file without increasing the banks but with complex mechanisms.
Register Access Queue [25] Hironaka, et al. proposed scheduling queues for register accesses. An instruction is divided into one execution and two register read operations. They are dispatched to the separate queues, and scheduled so that no bank conflict occurs.
However, the total scheduling logic is tripled for one execution and two register read operations. Moreover, the register access queues are more complex than usual instruction queues. To schedule register read operations so that no bank conflict occurs, the register numbers of ready instructions must be associatively compared with one another in the queues.
Additionally, it is difficult to predict safe combinations of accesses which do not cause bank conflicts. It is not probable that learned safe combinations will not cause bank conflicts, because the register numbers are randomized as described in Sect. 2.2.2.
Register Multi-mapping [26] Duong et al. proposed a technique to allocate to an instruction two registers in different banks. The instruction writes the result into both the registers. The pipeline is stalled if read accesses to both the registers fail because of double conflicts.
However, a naive implementation of double mapping also doubles the registers. The register management logic, such as the register mapping table, the active and free lists, is also doubled.
Delayed Register Allocation [26] , [27] In addition to read, write accesses also cause bank conflicts. Park et al. proposed the delayed register allocation [27] . The register multi-mapping described above adopted the same technique in combination [26] .
This technique allocates a virtual tag to an instruction in the rename stage to resolve dependency among instructions; then, actually does a physical register immediately before the writeback stage so that no bank conflict occurs on writeback.
However, an extra mapping table is required to map the virtual tag to the register. This table is almost equivalent to a full-port register file with a 1/4 bitwidth, mainly because the tag is roughly 1/4 of the register in bitwidth.
Interface of Multibanked Register File System
As described in Sect. 2.2, a plain multibanked register file has basically the same interface as a full-port register file, that is, given a register number, it completes read/write operations in specified cycles except for the pipeline disturbance. It is transparent from outside of the system that the register number is internally divided into the bank and intrabank numbers, and no change is needed for the instruction scheduling or register management logic, such as the instruction scheduling queues, reorder buffer, register mapping table, and active and free lists of the registers.
However, this does not hold true for the three techniques in Sect. 2.3. The register queues schedule register accesses before issued. The multi-mapping and delayed allocation change how to map the registers. As a result, a considerable amount of overhead is imposed on other parts of the core than the register file system.
In contrast, the transparency of a plain system is inherited to our proposal described in the next section.
Proposed Technique
The proposed skewed multistaged multibanked register file reduces not the bank conflict probability but the pipeline disturbance probability by the second stage. In this section, skewed multistaging is detailed from Sects. 3.1 to 3.3. We present a mathematical model for quantitative explanation of a low disturbance probability in the next Sect. 4. Figure 5 shows the pipelined behavior of a plain and a multistaged pipelines. In this figure, the stages to read the register file are divided into upper and lower halves to indicate that an instruction has two source operands. The blank stages indicate that the instructions do not have the corresponding source operands. This figure shows the rarest case where all the accesses are to the bank 0 to prevent the diagram from being unnecessarily long.
Skewed Multistaged Pipeline
In this section, we denote the accesses of the instruction I i , from the p-th operand issue port, to the b-th bank as i p,b . In Fig. 5 , I 1 has two source operands and they are denoted as 1 0,0 and 1 1, 0 .
Because all the accesses are to the bank 0, the plain pipeline shown in the upper half of the figure is stalled every time when an instruction has two source operands. For example, 1 1,0 fails to read the bank due to the bank conflict with 1 0,0 as denoted by the blocked sign. The pipeline is stalled in C 3 , when 1 1,0 reads the bank 0.
Skewed Multistaged Pipeline Behavior
In contrast, the multistaged pipeline has two stages of RR1 and RR2 for reading the register file banks, to reduce not bank conflicts but pipeline stalls as follows:
• 1 0,0 wins the bank 0, and read it in RR1 in C 2 .
• As in the plain pipeline, 1 1,0 loses the bank 0 with 1 0,0 .
However, in the multistaged pipeline, losing 1 1,0 retries in C 2 ; then, it successfully reads it in RR2 in C 3 without stalling the pipeline. When losing 1 1,0 reads the bank in C 3 , winning 1 0,0 passes through this stage with the source operand obtained in the previous cycle as denoted by the dashed box in the figure. As a result, both the source operands are provided at the same time for exec in C 4 .
• The accesses to a bank are served in FCFS manner. 3 1, 0 lost in C 3 is given a higher priority than newly arriving 4 0,0 in C 4 , and hence never loses.
• In contrast, 5 1,0 loses twice in C 5 and C 6 , resulting in a pipeline stall in C 8 .
In this figure, the multistaged pipeline finishes 1 cycle faster than the plain, owing to less stalled cycles, though the former is 1 stage deeper than the latter. Whereas stalls directly prolong the execution time, an extra pipeline stage only does so by prolonging the penalty on infrequent mispredictions.
Skewed Multistaged Pipeline Structure Figure 7 shows the unique structure of the skewed multistaged pipeline that realizes the above-described behavior. For simplicity, this figure extracts two read ports and one bank from r read and w write ports and b banks.
In the middle of this figure, there are two physical stages of Arbiter and Reg# SW; and Bank and Read SW. These two physical stages are skewed and shared by three virtual stages. The accesses from the issue ports P 0 and P 1 corresponds to 1 0,0 and 1 1,0 in Fig. 5 (lower), and they follow the solid arrows as follows:
• In C 1 , 1 0,0 wins the bank 0, and the two physical stages are allocated to the first and second virtual stages.
On the contrary, 1 1,0 loses, and proceeds to the pipeline register denoted as rn 2 .
• In C 2 , because 1 0,0 goes to the next stage, 1 1,0 can win the bank, and the two physical stages are allocated to the second and third virtual stages. In this cycle, 1 0,0 reads the bank 0, and the read operand is written to the pipeline latch denoted as d 1 , to make a pair with the operand of 1 1,0 that will be read in the next cycle, which preserves pipelined behavior.
In this manner, the two physical stages are dynamically allocated to the first and second, or to the second and third, of the three virtual stages depending on whether the accesses win or lose the bank. From the other perspective, in each of the three virtual stages, the physical partial circuit that realizes actual processing dynamically varies with winning or losing. This is quite unusual for conventional pipelines. In general, a stage means a specific physical module. It is possible that a shared module appears in plural stages, for example, an I/D unified cache appears in the fetch and memory stages. However, it is unusual that a module moves to the neighboring stages.
Cycle-by-Cycle Behavior Figure 6 shows cycle-by-cycle behavior of the skewed pipeline. In this figure, rn 1 , rn 2 , rn x , d 1 , and d 2 are the pipeline latches in Fig. 7 . The behavior in this figure is the same as Fig. 5 (lower) and Fig. 7 , and the above explanation for the behavior can be exactly applied to that in this figure. Thus, we do not repeat the same explanation, but we should note that the following items with this figure:
• More than one accesses are serialized, and at most one access resides in the Bank column. This condition satisfies the resource restriction of the banks.
• All the accesses that appear in the rn 1 column in C c also appear in the d 2 column in C c+3 (except for stalled cycles), where the difference from C 3 to C c+3 comes from the number of virtual stages. This condition ensures the pipelined behavior.
In particular, the pair of 5 0,0 and 5 1,0 which appears in rn 1 in C 5 also appears in d 2 in C 9 because of the presence of the stalled cycle in C 8 . In C 8 , 5 0,0 has already arrived at d 2 ; and then, it stays there until C 9 because of the stall, when 5 1,0 catches up with 5 0,0 in d 2 .
Pipeline Stall
As explained in Sect. 2.2.4, stalling is more advantageous than rescheduling for multibanked systems.
Stall Condition
The multistaged pipeline is stalled if a bank has a total of 3 or more accesses in a cycle. In Fig. 5 (lower) and Fig. 6 , this condition is met in C 5 as denoted by the dashed box.
Critical Paths for Stall
As mentioned in Sect. 2.2.4, in general, stall logic can make critical paths; fortunately, this does not hold true for the multistaged pipeline. In Fig. 5 , a stall condition is detected in the middle of C 5 ; then, the pipeline is actually stalled in C 8 . Thus, the tree to distribute the stall signal can take longer than 2 cycles. Consequently, it is practically impossible for the logic to make critical paths.
Request Aggregation
If two or more instructions in the scheduler have the same source operand, it is probable that they are woken up and issued at the same time, and then, will cause a bank conflict for that operand. This problem of accesses to the same register is specific to multibanked register files. Designers of multibanked main memories inevitably consider consecutive or stride accesses, but they have hardly considered accesses to the same address.
Tseng et al. proposed to share the read port among the read requests to the same register [23] , [24] . We also adopted request aggregation to the same register. As Tseng et al. called their technique as read sharing, they aggregate only read accesses. In contrast, we aggregate both read and write accesses. Because the bank is 1-read/write, a data to be written can be read from the same port as shown in Sect. 2.2.3.
When two or more accesses to the same register are detected, the access with the highest priority requests the bank for the others. When this aggregated request is granted, all of the accesses receive the grant signals for the bank. After that, the following processes are automatically performed depending on whether the accesses are read or write:
Read and Read When two or more read requests are aggregated, the read switch duplicates the data read from the bank controlled by the grant signals. Write and Read Aggregation of a write and a read request realizes the operand passing via the bank port described in Sect. 2.2.3.
These two processes are combined for one write and two or more read accesses.
Comparator Array
For this request aggregation, an array of comparators shown in Fig. 8 is placed before the arbiters. As shown in this figure, the write accesses are given higher priority than the read accesses to ensure read-after-write dependency. Thus, the comparators can be omitted or added without affecting the correctness of the behavior. In this figure and the evaluation in Sects. 5 and 6, the comparators are provided only for the newly arriving accesses, and the match result is reused in the second stage when aggregated requests are not granted. Instead, extra comparators can be added between newly arriving and losing accesses with an IPC increase of 1% or less.
Stall Probability
The bank of a multibanked register file system can be modeled as a waiting queue, where the server is the bank itself, and the customers are accesses to the bank. This section presents analytic solutions for the stall probabilities based on queueing theory.
M/D/1/1 Queue
Each of the banks (not the whole system) of a plain multibanked system corresponds to the M/D/1/1 queue, as detailed as follows:
M The arrival process can be assumed to be Markovian because the bank accesses (customers) are randomized by register renaming (Sect.
2.2.2).
D A customer (access) is served by the server (bank) in a deterministic time of 1 cycle in a pipelined manner.
1 The number of servers (banks) is 1.
1 The capacity includes the places for a customer being served and those in the waiting room. That is, the M/D/1/1 queue has no waiting room.
Because of no waiting room, if two customers arrive at the M/D/1/1 queue in a single cycle, the lost customer leaves the queue. This corresponds to a bank conflict and the resulting pipeline disturbance.
M/D/1/2 Queue [31]
The M/D/1/2 queue, which has the waiting room for only one customer, largely reduces this leaving probability. Even if two customers arrive at the M/D/1/2 queue in a single cycle, the lost customer can wait in the waiting room and be For example, in the M/D/1/2 queue, if 2 customers arrive in the state 0 (i.e., 0 customers in the system), the first is served while the second waits in the waiting room; thus, the queue moves to the state 1 (1 customer). Then, if 1 customer arrives, the queue stays in the state 1, because the service time is 1 cycle.
Transition and Stationary Probabilities
For the ease of understanding the calculation, the probabilities are calculated assuming the number of banks b = 10 and the number of register accesses per cycle is fixed to n = 3.
In this case, the transition probability is given by the binomial distribution x k = 3 C k (1/10) k (9/10) 3−k . For example, x 2 = 3 C 2 (1/10) 2 (9/10) 1 = (3 × 9)/10 3 = 2.7%, resulting in the label "2 (2.7%)".
The dashed arrows represent the state transitions with leaving customers, each results in a pipeline disturbance. The dashed arrows return to the state 0, which means disturbed cycles are not counted in the stationary probabilities.
The stationary probabilities of the M/D/1/2 queues are calculated from the chain. Note that the stationary probability of the state 1 is as low as 3.4%, intuitively because the state 1 is hard to come (2.7%) and easy to go (72.9 + 2.7 + 0.1 = 1 − 24.3 = 75.7%). to cause a bank disturbance from 2 to 3. As a result, the disturbance probability in the state 0 is decreased from 2.7 + 0.1% to 0.1%. State 1 In the state 1, a disturbance occurs when 2 or more customers arrive as with the M/D/1/1 queue. However, the disturbance probability in the state 1 is multiplied by the stationary probability, and decreased from 2.8% to 2.8% × 3.4% = 0.0952%.
In total, the disturbance probability pre bank p is decreased from 2.8% to 0.1 + 0.0952 = 0.1952%. Then, the overall pipeline disturbance probability P = (1 − p) b is decreased from 1 − (1 − 0.028) 10 24.7% to 1 − (1 − 0.001952) 10 1.93%. A waiting room for only one customer has an equivalent effect to increasing the number of banks to b times. The simulation results in Sect. 5 show P = 2.5% when b = 18 for SPEC benchmark.
Evaluation of IPC

Evaluation Environment Benchmark
We used all 29 programs of the SPEC CPU 2006 benchmark with the ref data sets [1] . The programs were compiled with gcc 4.2.2 −O3. We evaluated the 1G instructions after the first 10G instructions.
Simulator
We used the Onikiri 2 [32] simulator, which was also used to evaluate NORCS [20] . This simulator is fully cycle accurate, that is, it reproduces the behavior of instructions in each stage in the correct cycles. The simulator executes instructions in the correct execute stages, and verifies the results with those of an on-line emulator in the commit stage. Thus, the behavior on mispredictions is also accurately reproduced. The simulator also reproduces the fact that register renaming actually randomizes the accessed registers (Sect. 2.2.2).
Number of Operands
First, we counted the number of operands of each program to better understand the relationship between the numbers of operands and register file ports.
Some source operands are provided by the bypass or by the request aggregation without consuming the register file ports (Sect. 3.3). The other source operands consume the read ports, and become the cause of bank conflicts. Table 1 classifies these types of accesses † . Figure 10 shows the numbers of ports used by these types of accesses per cycle for 10 representative programs † Some technique such as delayed allocation excludes write accesses; however, we did not show the evaluation result because the overhead is too large to compare with the others, as mentioned in Sect. 2.3. and the average of all the 29 programs. Some programs such as h264ref and bzip2 consume approximately 2 read and 2 write ports per cycle; however, this does not mean that a 4-port register file is sufficient. Figure 11 shows the cumulative distribution of port usage for h264ref, which is the program with the largest number of operands. In this graph, R(-) + W at the point of 4 ports is 52.4%, which means that only 52.4% of the execution cycles are covered by a 4-port register file. Conversely, this curve reaches 97.8% at the point of 9 ports, which means that it is not practically difficult to reduce the number of ports to 9 or more. Table 2 shows the evaluated models of their default configurations. We chose as the default the minimum configurations with which MStage, NORCS and RPort show average relative IPC of more than 0.97.
Evaluated Models
Baseline Model
The baseline core has a full-port register file composed of a replicated pair of RAMs. This replication is widely used in recent cores such as the Bulldozer core in Sect. 1 [3] , [12] , [15] - [18] . Table 3 gives its configuration, which follows modern 8-issue cores such as the IBM POWER7/8, and Intel Haswell and Skylake processors [3] - [6] .
Reduced Port Model
As described in Sect. 5.2, it is not practically difficult to reduce the number of ports from 15 to 9 or less. In fact, commercial processors adopt ad-hoc methods to reduce the number of ports; however, they are difficult to generalize.
Thus, we evaluated the RPort model which has a bport register file (b < 15) for reference. From the Plain model (Fig. 2 (lower) ), the b banks of 1-read/write RAM is replaced with one b-read/write RAM. The model is free from bank conflicts; instead, the backend pipeline is stalled when more than b operands are accessed in a single cycle. For fair comparison with MStage, we applied the request aggregation (Sect. 3.3) to this model with no overhead in area or in energy consumption.
Register Latencies
As described in Sect. 2.2.1, we assumed that the arbitration and register number routing of the multibanked models take one cycle, which is denoted as "a/r: 1" in Table 2 .
Unfortunately, the register file latency is not documented for recent cores [33] . We assumed that the latency of the baseline core is 3 cycles, and that of some models is reduced to 2.
However, this difference of one cycle has an insignificant effect on the IPC of recent cores with highly accurate predictors. In this evaluation, the average IPC decreased by 1.7% because of this one-cycle increase.
These latencies are partially verified in Sect. 6.4. 
Relative IPC and Stalled Cycles
Relative IPC Figure 12 shows the averaged relative IPC of the models with different configurations averaged for the 29 programs in SPEC CPU 2006. In this graph, four bars are shown for multibanked models with different numbers of banks, for RPort with different numbers of ports, and for NORCS with a 12/16-entry register cache and a 2/3-read + 3-write main register file. Regarding NORCS, we evaluated many other configurations, e.g., a main register file with fewer write ports, and selected these four as representatives so that they can prove that the default configuration is the best. We evaluated the number of banks in multiples of 6 based on the layout constraint derived in Sect. 6.2. As described in Sect. 2.2.2, the number of banks not being a power of 2 does not have a significant impact.
While Plain cannot achieve sufficient IPC even with 24 banks, MStage achieves a relative IPC of as high as 97.5% with 18 banks. Figure 13 shows the relative IPC of the models with the default configurations shown in Table 2 for all the 29 programs in SPEC CPU 2006. The programs are arranged in descending order of the number of R(-)+W accesses shown as the curve in this graph, which is that to the integer register file or that to the floating-point register file, whichever is greater; actually, that to the floating-point is chosen only for lbm (Fig. 10) .
We chose the default configurations so that MStage, RPort, and NORCS show average relative IPC of more than 0.97. However, all of them show the relative IPC of as low as 0.9 for programs with a large number of register file accesses such as h264ref. 
Relative IPC and Bank Conflicts
The first graph in Fig. 14 shows the number of stalls per cycles caused by bank conflicts, the pipeline disturbance probability caused by bank conflicts, for 10 representative programs and the average of all the 29 programs. The second graph is extracted from Fig. 13 to compare with the first.
From Plain, MStage reduces this disturbance probability from 0.120 to 0.025, which is the actual effect of the M/D/1/2 queue.
As a whole, R(-)+W accesses per cycle, stalled cycles caused by bank conflicts, and the IPC show strong correlation. It is safe to say that the IPC of these models is primarily determined by the number of register file accesses per cycle of a program.
Request Aggregation
The third graph in Fig. 14 shows the relative IPC of MStage with and without the bypass and request aggregation. In this graph, three bars are shown for each of the programs, and the differences between the first and second, and the second and third, show improvement by the bypass and by request aggregation, respectively.
As shown in this graph, these two techniques have the same level of impact on IPC. However, as shown in Fig. 10 , the accesses excluded by the request aggregation are quite few. This phenomenon occurs because the bypass and request aggregation has an indirect and a direct effect, respectively, as described below.
Even if the accesses that can be excluded by the bypass are not excluded, they do not necessarily cause bank conflicts. They increase the number of accesses; and then, stochastically increase the stalled cycles.
On the contrary, if the accesses that can be excluded by request aggregation are not excluded, they increase the stalled cycles with probability 1. Thus, we can conclude that request aggregation is as important as the bypass for multibanked register files.
Evaluation of Area and Energy
Evaluation Methodology
FreePDK15 and CACTI We used FreePDK15, a predictive process design kit for 15nm FinFET technology [34] , and NanGate FreePDK15 Open Cell Library [35] .
Because this library does not include RAMs or switches, we used CACTI [9] , [36] , [37] to evaluate them. CACTI calculates the RAM area from the numbers of vertical and horizontal wires, and the RAM energy from the capacitance of the transistors and wires charged and discharged in read and write operation. We evaluated RAMs and switches using the formula of CACTI with the parameters of FreePDK15. Table 4 shows the areas of RAM cells calculated in this manner. Because the areas of small cells strongly depend on the designers' efforts, we investigated recent researches on small-port memory cells [38] , [39] , and verified that the values are quite consistent.
RAM and Switch Cells
Logic Synthesis and Place-and-Route
We described the entire systems of Plain and MStage in System Verilog.
Then, we synthesized, and placed-and-routed the description with Cadence Encounter v10.13 including RTL Compiler v10.10 with the standard cells in the FreePDK-15 library. In the place-and-route, the RAMs and switches are treated as large cells which have parameters estimated with CACTI.
Regarding the baseline, NORCS, and RPort models, only RAMs and switches estimated with CACTI are simply added without consideration for layout constraint. Figure 15 shows the place-and-route results of the Plain and MStage integer register files. This figure also shows the shapes of the datapaths of the baseline, NORCS and RPort integer register files for reference. 
Layout
8-bit Slice
Because each of the banks requires a decoder and a buffer, we adopted an 8-bit-slice design for the multibanked models to reduce the overhead to 1/8. The width (the vertical direction in these figures) of the 8-bit slice is determined by the width of the operand buses between the execution units and the register file systems. These buses are {2 (read)+1 (write)}×5 (unit) = 15 tracks of 8-bit wire bundles. These wires are routed in an upper layer, whose pitch is twice as thick as that in the lower layers used for RAMs [11] and switches. Thus, 15 × 2 = 30 tracks can be used within this width for the RAMs and switches. The width of the shapes of the datapaths of the three models in Fig. 15 are also determined in this manner.
Banks
In this 30-track width, 6 register file banks are arranged. This is the reason why the number of banks of the multibanked models is the multiple of 6. We cannot freely adjust the width and height of the RAM cell because they are almost completely determined by the number of bit-and word-lines.
Switches
The heights (the horizontal direction) of the switches are determined by the number of routing control lines which run vertically through the eight 8-bit slices. Thus, the read and write switches cannot overlap with each other in the horizontal direction.
The height of the layout is thus determined by the sum of the heights of 3 banks, and a read and a write switch. Figure 16 shows the breakdown of the area of the control circuit. The arbiters and the array of comparators for request aggregation occupies 25.0% and 13.6% of the control circuit, respectively. The significant part of the rest is decoder circuits to generate the request signals for the arbiters.
Control
There remains much room for optimization. For example, the arbiters will be considerably minimized if implemented with dynamic logic. Figure 17 shows the relative area and energy consumption of the integer and floating-point register files. The Plain and MStage areas include dead spaces produced by layout constraint. The energy is calculated using the access count produced by the simulation in Sect. 5.
Area
The areas of the multibanked models are considerably smaller than those of the other models with the default configurations. As the register file bank areas are reduced, those of the switches and control logic become relatively large. In particular, the switch areas increase with the square of the number of banks. A horizontal line at 2.7% shows an ideal register file composed of 1-read/write cells without overheads. The area of MStage with 18 banks is approximately 6.8 times as large as that of the ideal register file.
As the number of registers increases, the register file areas become dominant. Thus, MStage with 1-read/write cells is more advantageous in heavily-multithread cores with several times more registers.
Latency
Because CACTI does not evaluate small RAMs, and Free-PDK15 has not yet provided the HSPICE model, we measured the length of the access paths to verify the assumption in Table 2 . We assumed that the latency to read one of the register file banks through the read switch is a third of that to read the baseline register file. As the area is reduced, the total length of the access paths of the former is reduced to 31.2% from that of the latter. Thus, it is safe to say that the latency assumption in Table 2 is not advantageous for multibanked models.
Energy Consumption
As shown in Fig. 17 , the result of energy consumption is basically consistent with that of the area, except that the energy of the register file banks is reduced in inverse proportion to the number of banks; because only accessed banks consume dynamic energy. On the contrary, the energy of switches increases with the square of the number of banks. As a result, the total energy consumption is minimized at 18 banks.
Area and Energy Efficiency
The graphs in Fig. 18 show the relative IPC with respect to the relative area and energy consumption. The graphs are simply derived from the graphs in Figs. 12 and 17 to show the trade-off between IPC and area, and between IPC and energy consumption. For technique to reduce area and energy while keeping IPC, it is important to plot one point within the region close to the top of the graphs as close to the y-axis as possible.
In each of the graphs, the points for MStage, NORCS, and RPort with their default configurations (denoted by circles) are located within the region where the average relative IPC is more than 0.97, from left to right in this order, which proves that MStage reduces more area and energy than NORCS and RPort while keeping the same level of IPC. Compared with NORCS (with a 16-entry register cache and a 3-read+3-write main register file), MStage (with 18 banks) achieves a 39.9% and 66.4% reduction in area and energy consumption, respectively.
Comparison with DVFS
For reference, the dashed curve in the right graph is plotted for DVFS assuming that the percentage of the register file to the whole core in energy consumption is 25% [13] .
MStage, RPort, and NORCS outperform DVFS. However, it should be noted that these techniques are not contradicting to DVFS, that is, a core that adopts these techniques can also utilize DVFS, and they reduces energy consumption and heat when the core is operating at the highest voltage. Additionally, as mentioned in Sect. 1, downscaling the voltage, in particular of the register file, is becoming more difficult in recent technologies [14] .
Related Work
This section reviews related work not mentioned in the other sections.
Number of Operands
Instructions that have two source operands are rare (see Sect. 5.2). Utilizing this fact, Kim et al. reduced the ports of the wakeup logic and register file [40] , and Sangireddy did those of the register map table [41] .
Distributed Register Files
Clustered or tile architectures have distributed register files [42] - [49] . If a register file is distributed to a group of i execution units, the number of its ports is reduced to 3i + α, where α is the number of additional ports for communication. Aggressive distribution of which i = 1 incurs a certain level of IPC degradation depending on the accuracy of instruction steering.
Conclusion
The region including the register file is a hot spot of a core that limits the clock frequency and the scale of the core. Although a multibanked register file drastically reduces its area and energy consumption to mitigate the hot spot problem, conventional implementations suffer from low IPC because of bank conflicts. Our skewed multistaging transforms the bank from the M/D/1/1 to M/D/1/2 queues, and reduces the pipeline disturbance probability caused by bank conflicts by more than an order of magnitude.
The evaluation results show that, from NORCS [20] , which is the latest proposal to reduce the area and energy consumption of a register file, the proposed system achieves a 39.9% and 66.4% reduction in area and energy consumption, respectively.
The proposed technique is also important for many architects because it is applicable to many of the other components of a superscalar core. This technique will greatly reduce the area and energy consumption of the entire superscalar processor core. 
