Abstract: Upcoming ground-breaking applications for always-on tiny interconnected devices steadily demand two-fold features of processor cores: aggressively low power consumption and enhanced performance. We propose implementation of a novel superscalar low-power processor core with a low supply voltage. The core implements intra-core low-power microarchitecture with minimal performance degradation in instruction fetch, branch prediction, scheduling, and execution units. The inter-core lockstep not only detects malfunctions during low-voltage operation but also carries out software-based recovery. The chip incorporates a pair of cores, high-speed memory, and peripheral interfaces to be implemented with a 65nm node. The processor core consumes only 24mW at 350MHz and 0.68V, resulting in power efficiency of 80 W/MHz. The operating frequency of the core reaches 850MHz at 1.2V.
Introduction
Emerging mobile device markets, such as the internet of things (IoT) and wearable devices, create everincreasing demand for low-power processor cores. While the core at the heart of microcontroller units shows power consumption of several milliwatts, performance is limited to tens of millions of operations per second. Application processors in smartphones or tablets operate at several gigahertz to run heavy operating systems, such as Android, drawing several thousand milliwatts. The processor core in the era of the IoT and wearable devices is forced to provide performance of several thousand operations per second with a power budget comparable to that of microcontrollers for continuous activation over days or months.
Commercial application processor cores pursuing only performance maximization have lots of complex fill-ratesustaining structures (FRS) including register renaming, reorder buffers, and instruction fetch speculation units [1] . A deeply pipelined processor core focusing on throughput enhancement necessitates implementation of a powerhungry FRS to sustain the instruction fill rate in pipeline stages of the core. However, the technology scaling wall in nodes over 14nm puts limits on fanout-of-4 (FO4) delay reduction. The lower bound on logic delay imposes a performance limit on the FRS, as well as instruction sequencing and execution units. It indicates that a more complex FRS provides a much smaller return in terms of performance-sacrificing power efficiency. Re-architecting of the processor core is required for the upcoming era of low-power devices.
Low supply voltage to a processor core is one of the driving factors for achieving extreme low-power devices. The sub-threshold operation of application-specific logic or an Intel architecture, 32-bit (IA-32) core, operating at near-threshold voltage, shows the prospective power benefits of low-voltage operation [2, 3] . Low voltage operation of the processor core incurs two limiting factors. First, a processor core operating at a low supply voltage inherently exhibits degraded performance caused by the increased FO4 delay. The increased delay pushes the operating frequency to the lower bound, resulting in lowperformance devices. Second, the static and dynamic variation in deep-submicron technology combined with reduced noise immunity in transistors causes instability in processor core operations.
The manufacturing-induced static parameter variation in circuits of a processor core changes from die to die. A dynamic variation occurs during processor operations by changing environmental conditions, such as temperature, voltage fluctuation, irregular current induction, and various workload distributions in the time-domain. The performance monitoring circuit prevents operational failure of the processor core. Jain et al. [3] employed variation-aware pruning of standard cells for reliable lowvoltage operation. Das et al. [4] proposed the shadow latch with meta-detector to identify circuit failures. A performance monitor, such as tunable replica circuits presented by Raychowdhury et al. [5] is a widely adopted technique to enable time-domain scaling of operating voltage and frequency.
In this paper, we propose a novel superscalar dual-core processor called Aldebaran for high-performance lowpower applications. We designed the microarchitecture and Register-Transfer-Level (RTL) code of the Aldebaran core from the ground up to achieve high performance while minimizing power consumption. The contributions of this paper are summarized as follows. First, we designed a dual-issue superscalar processor core with an intra-core low-power architecture using a ground-up approach. Second, a novel inter-core lockstep architecture for reliable low-voltage operation is proposed. Periodic activation of an inter-core lockstep enables the Aldebaran core to sustain stable operation under dynamic variations, such as environmental and workload condition changes accompanied by low supply voltage. This paper is organized as follows. Section 2 discusses the internal low-power microarchitecture of the Aldebaran core, and Section 3 presents the inter-core lockstep architecture. Chip implementation and measurement results of Aldebaran integrating cores, memory and peripheral interface intellectual property (IP) are presented in Section 4, with concluding remarks in Section 5.
Intra-Core Low-Power Architecture
Aldebaran is a 32-bit superscalar processor of which the internal microarchitecture is built for low-voltage highperformance operation. The overall microarchitecture is shown in Fig. 1 with simplified architectural details. The 13-stage pipeline is composed of three sub-cores: instruction fetch and branch prediction (IF), decode with queues (DEC), and execution (EX).
Acronyms and full names of hardware units in Fig. 1 are summarized in Table 1 . The IF sub-core accesses the instruction cache to fill the instruction queue (IQ). The DEC sub-core is composed of dual-rail decode data paths, where each path reads an instruction from the IQ, decodes it, and stores the decoded signals in the execution queue. The EX sub-core fetches decoded instructions from the EQ, reads operands from the register file, and executes instruction operations in parallel with two-way integer units, the load-store unit, and the floating-point unit.
The virtual address (VA) computation in the IF subcore produces the virtual address of instructions to be fetched. The four-way set-associative instruction cache (I$) is composed of i-Tag (tag memory), and i-Data (data memory), which contain cache lines. The virtual address is fed to i-Tag to check if the requested cache line exists in iData. The instruction address translation look-aside buffer (iTLB) translates the virtual instruction address into the physical address. The cache line where the address matches the physical address is fetched into the instruction queue if i-Tag confirms a cache hit. The sequence of I$ access is repeated at every cycle in a pipelined way. I$ access accounts for a large portion of the power consumption as bit line charging and discharging in memory blocks is done during every access. The branch prediction architecture, which is composed of the branch target buffer (BTB) and the branch predictor (BP) is another source of power consumption. The BTB stores branch target addresses in two-way 58-bit 256-entry memory. The BP is implemented with a two-bit 4096-entry memory array where two bits indicate the taken history for each target address. The branch pre-decision (BD) unit minimizes the access frequency of I$ as well as that of the BTB and the BP. The internal architecture of the BD is shown in Fig. 2 . The cache line read from i-Data contains eight 32-bit instructions. In Fig. 2 , i0 ~ i7 are instructions read from iData on the same cache line. In the Aldebaran core, only three bits of instruction at a higher bit position are sufficient to identify a branch instruction. The pre-decoder unit attached to each instruction locates the branch instruction by decoding each of the higher three bits. The leading branch detector identifies where the first branch instruction is. If the branch instruction is at i5, the IQ fetches all of i0 ~ i4 into some of its 24 instruction slots. In a subsequent cycle, the branch instruction i5 and the following delayed instruction, i6, are sent to the BTB and the BP for branch prediction. These instructions are also put into the following slots in the IQ.
Simultaneous single-cycle fetching of instructions in the same cache line enables sporadic activation of I$ resulting in power reduction. The leading branch detection in the BD also relieves the IQ of logic overhead, storing instruction addresses as all instructions fetched into IQ are actually executed, if branch prediction is correct. Upon branch mis-prediction found during the execution stage, the IQ is cleared. Branch instruction identification in the BD also allows sporadic activation of the BTB and the BP to save power. Memory accesses in the BTB and the BP occur only when there is a branch instruction in the cache line fetched in the previous cycle.
Decoder-up (DU) and decoder-down (DD) pop, at most, one instruction each from the IQ, decode them, and store the decoded information in the EQ, which is composed of eight slots. The scheduler (SH) schedules the dispatch timing and the sequence of decoded instructions in the EQ.
The in-order dispatching of instructions into the EX subcore discards introduction of complex interconnection wires required in reservation stations or register renaming units resulting in area and power savings.
An instruction flow through the pipeline is stalled by hazards caused by register dependencies among instructions. The separation of the Aldebaran core into sub-cores, IF, DEC, and EX enables independent control of instruction flow. The IQ between the IF, and the DEC and the EQ between the DEC and the EX implement this separation. Flow control independence in the Aldebaran core adjusts the performance during low-voltage operation, as shown in Fig. 3 . The critical path in the Aldebaran core lies on the units composed of D cache (D$) address computation, dTLB for physical address translation, and access to d$. As the operating voltage decreases, the increased delay on the critical path may cause malfunction of the core. In the Aldebaran core, compensation for increased delay is done by setting up a processor state register (PSR). The specific bit in the PSR stalls the EX sub-core for a single cycle only when the load-store instruction is executed under low supply voltage. Performance degradation by single-cycle stall is negligible by virtue of in-time stall as well as flow control independence.
Inter-Core Lockstep for Reliability
Increased delay by lowering the supply voltage may cause incorrect operation of the processor core. A preconstructed table denoting the frequency-voltage relation is useful in controlling voltage and frequency for a required workload performance. However, static and dynamic variation in operating conditions alters the frequency-voltage relation from the initial voltagefrequency table. Research by Bowman et al [7] and Karpuzcu et al [8] presents techniques to detect timing failures while increasing the operating frequency of the processor. The error detection sequential circuits and tunable replica circuits report timing violation occurrences. Upon the occurrence of timing errors, the core restarts from the instruction already safely completed, and the adjustable clock distribution network sustains continuous correct functionality.
Inter-core lockstep is a technique to ensure correct operation of the Aldebaran core for a given operating voltage. The dual-core on the Aldebaran chip is for parallel workload execution. An appropriate configuration of cores enables the inter-core lockstep, with the architectural view shown in Fig. 4 .
We implemented a per-core voltage domain where the power grid of the core is separated from the other core. Core-L is the 'first' core with supply voltage of VDD-L, and Core-R is the 'second' core with VDD-R. The lockstep controller (LC) compares both input and output signals of the register file (REG) from Core-L with voltage-shifted signals from Core-R. The mismatches in the signal output indicate that the behavior of Core-L deviates from that of Core-R. This acts as evidence that incorrect functionality has occurred during the operation.
The inter-core lockstep ensures reliable low-voltage operation with dynamic voltage adjustment. It exploits the actual core and workload to search for the optimal supply voltage. After power-on reset of each core, the power manager of the operating system (OS) assigns nominal operating voltage (HV), 1.1V, to each core. An OS kernel thread starts in Core-L at HV. The OS thread scheduler may assign another thread or workload to Core-R. The thread assignment to the core initiates the voltage adjustment process shown in Fig. 5 .
During the operation, remember that each core is processing its own thread with the pre-assigned operating voltage. The OS then assigns a new thread to Core-R. The OS triggers a trap routine in Core-L to store the current state of Core-L in the stack. The trap routine assigns HV to Core-L while it sets lowest voltage (LV) to Core-R. It assigns the same thread to both Core-L and Core-R to detect a mismatch occurrence coming out of the LC if the mismatch occurs as the thread execution progresses. Low operating voltage combined with dynamic variation of operating conditions may cause incorrect operation in Core-R. The result of incorrect operation appears as 'failure detection' in the LC. Upon failure detection, the LC reports the failure to the Trap (interrupt) controller (TRAP). The trap routine stores the current PC (Program Counter) indicating the progress of the core, clears core pipelines to restart the cores, and increases Core-R's operating voltage by ΔV. The amount of ΔV is in the range of 10mV. After increasing the operating voltage, the lockstep controller checks mismatches between signal outputs. If it is found out that there is no failure detected for a pre-determined time, thread execution continues in Core-R while Core-L reverts to its original thread before the voltage assignment process. The voltage assignment process is run periodically by the OS. 
Implementation
The architecture of the Aldebaran chip is shown in The Aldebaran chip, fabricated with a 65nm node shown in Fig. 7 , occupies an area of 2.4mm x 3.3mm, excluding pads. The area of the core is 2.65mm 2 including all sub-cores, 32KB I$, 32KB D$, iTLB (instruction Translation Lookaside Bus), and dTLB (data address Translation Lookaside Bus). The voltage level shifters around Core-R manipulates signal transfers between percore voltage domains. The power management integrated circuit (PMIC) outside of the chip supplies the separated power grid of each core with a separate power source. The core is able to adjust the operating voltage in units of 10mV by finely controlling the PMIC through general purpose input/output (GPIO) interface. The power consumption breakdown obtained from the static power estimation tool (Synopsys PT-PX) shows that leakage power accounts for 57% of overall power consumption. It implies that the compact area of the Aldebaran core, which is about two-thirds of that of commercial cores, effectively reduces power consumption. While the architecture concentrates on saving power, it shows a normalized Coremark [6] score of 2.0, which is comparable to that of commercial cores (1.8~2.3). In summary, the Aldebaran architecture obtains comparable performance while it achieves sufficiently high power efficiency.
The actual chip measurement in terms of operating frequency and power consumption versus operating voltage of the core is shown in Fig. 8 . The operating frequency rated on the left y axis in log scale linearly 
Conclusion
We propose the design and implementation of a 32-bit superscalar processor: the Aldebaran core. This intra-core low-power architecture achieves 0.08mW/MHz power efficiency at 0.68V. The novel inter-core lockstep with dynamic voltage adjustment ensures reliable low-voltage operation under static and dynamic variations in operating conditions. 
