Data-parallel applications, such as data analytics, machine learning, and scientific computing, are placing an ever-growing demand on floating-point operations per second on emerging systems. With increasing integration density, the quest for energy efficiency becomes the number one design concern. While dedicated accelerators provide high energy efficiency, they are over-specialized and hard to adjust to algorithmic changes. We propose an architectural concept that tackles the issues of achieving extreme energy efficiency while still maintaining high flexibility as a general-purpose compute engine. The key idea is to pair a tiny 10 kGE control core, called Snitch, with a double-precision FPU to adjust the compute to control ratio. While traditionally minimizing non-FPU area and achieving high floating-point utilization has been a trade-off, with Snitch, we achieve them both, by enhancing the ISA with two minimally intrusive extensions: stream semantic registers (SSR) and a floating-point repetition instruction (FREP). SSRs allow the core to implicitly encode load/store instructions as register reads/writes, eliding many explicit memory instructions. The FREP extension decouples the floating-point and integer pipeline by sequencing instructions from a micro-loop buffer. These ISA extensions significantly reduce the pressure on the core and free it up for other tasks, making Snitch and FPU effectively dual-issue at a minimal incremental cost of 3.2%. The two low overhead ISA extensions make Snitch more flexible than a contemporary vector processor lane, achieving a 2× energy-efficiency improvement. We have evaluated the proposed core and ISA extensions on an octa-core cluster in 22 nm technology. We achieve more than 5× multi-core speed-up and a 3.5× gain in energy efficiency on several parallel microkernels.
INTRODUCTION
T HE ever-increasing demand for floating-point performance in scientific computing, machine learning, big data analytics, and human-computer interaction are dominating the requirements for next-generation computer systems [1] . The paramount design goal to satisfy the demand of computing resources is energy efficiency: Shrinking feature sizes allow us to pack billions of transistors in dies as large as 600 mm 2 . The high transistor density makes it impossible to switch all of them at the same time at high speed as the consumed power in the form of heat cannot dissipate into the environment fast enough. These effects lead to a phenomenon called dark (dim) silicon [2] and a utilization wall [3] where only parts of the system can be operated simultaneously and at full speed. Ultimately, designers have to be more careful than ever only to spend energy on logic, which contributes to solving the problem at hand.
For this reason, we see an explosion on the number of accelerators solely dedicated to solving one particular problem efficiently. Unfortunately, there is only a limited optimization space that, with the end of technology scaling, will reach a limit of a near-optimal hardware architecture for a certain problem [4] . Furthermore, algorithms evolve rapidly, thereby making domain-specific architectures inefficient or even obsolete [5] . On the other end of the spectrum, we can find fully programmable systems such as graphics processing units (GPUs) and even more general-purpose units like central processing units (CPUs). The programmability and flexibility of those systems incur significant overhead and make such systems less energy efficient. Furthermore, CPUs and GPUs (to a lesser degree) are affected by the Von Neumann bottleneck, meaning that the rate of which information can travel between data and instruction memory limits the architecture. Dedicated hardware is necessary to mitigate these effects, such as caching, multi-threading, and super-scalar out-of-order processor pipelines [6] . All these mitigation techniques aim to increase the utilization of the compute resource, in this case, the FPU. Still, they achieve this goal at a price of much-increased hardware complexity, which in turn decreases efficiency, because a smaller part of the silicon budget remains dedicated to compute units. For example for the open-source, out-of-order BOOM CPU [7] only 63 kGE, less than 2.7 % of the cores overall area, is spent on the FPU. 1 Even larger CPUs such as Intel's Nehalem architecture show similar compute per area efficiency with around 6 % for all execution units [8] . Two load instructions, one floating-point accumulate instruction, and one branch instruction make up the inner-most loop kernel. We provide energy per op of an application-class RISC-V processor called Ariane [9] . In total one loop operation consumes 317 pJ for which only 28 pJ are spent on the actual computation.
Design Goal: Area and Energy efficiency
To give the reader quantitative intuition on the severe efficiency limits affecting programmable architectures, let us consider, the simple kernel of a (double precision) dot product (z = a· b) in Figure 1 (b,c) and the corresponding energies per instruction type in Figure 1 (a) for a 64-bit applicationclass RISC-V processor as reported in [9] in a state-of-the-art 22 nm technology. The kernel is made up of five instructions, four of those instructions perform bookkeeping tasks such as moving the data into the local register file (RF) on which arithmetic instructions can be performed and looping over all n elements of the input vectors. In total the energy used for performing an element multiplication and addition in this setting is 317 pJ. The only "useful" workload in this kernel is performed by the FPU which accounts for 28 pJ. The rest of the energy (289 pJ) is spent on auxiliary tasks. Even this short kernel gives us an immediate intuition on where energy efficiency is lost. FPU utilization is low (17 %) , mostly due to load and store and loop management instructions.
Existing Mitigation Techniques and Architectures
Techniques and architectures exist that try to mitigate the efficiency issue highlighted above.
• Instruction set architecture (ISA) extensions and compiler techniques: The compiler can statically unroll known loop bounds. While this helps to amortize the overhead of loop management instructions, it increases the pressure on the instruction cache. Post increment load and store instruction can accelerate pointer bumping within a loop [10] . For an efficient implementation they require a second write-port into the RF, therefore increasing the implementation cost. Single instruction multiple data (SIMD) such as Streaming SIMD extensions (SSE)/advanced vector extensions (AVX) [11] in x86 or NEON Media Processing Engine (NEON) [12] 1. estimated on a post-synthesis netlist in 22 nm in Advanced RISC Machines (ARM) perform a singleinstruction on a fixed amount of data items in a parallel fashion. Therefore reducing the total loop count and amortizing the loop overhead per computation. Unfortunately, wide SIMD data-paths are inflexible when elements need to be re-shuffled as dedicated shuffle operations are needed to bring the data into a SIMDamenable form. • Vector architectures: Cray-style [13] vector units such as the scaleable vector extensions (SVE) [14] and the RISC-V vector extension [15] operate on larger chunks of data in the form of vectors. Explicit load and store instructions, as well as the more inflexible linear access pattern inherent to the representation as a vector, make such systems perform poorly on short vectors [15] . Moreover, vector architectures require complex hardware to shuffle data coming from the memory appropriately into the vector register file. • GPUs: Single instruction multiple thread (SIMT) architectures such as NVIDIA's V100 [16] GPU use multiple parallel scalar threads that execute the same instructions. Hardware scheduling of threads hides memory latency. Subgroups of threads operate in lock-step and access memory resources at the same time. Coalescing units bundle the memory traffic to make accesses into (main) memory more efficient. The hardware to manage threads, however, is quite complex and comes at a cost that offsets the energy efficiency of GPUs. The thread scheduler needs to swap different thread contexts on the same streaming multiprocessor (SM) whenever it detects a stalling thread (group) waiting for memory loads to return or due to different outcomes of branches (branch divergence). This means that the SM must keep a very large number of thread contexts (including the relatively large RF) in local static random-access memories (SRAMs) [17] . SRAM accesses incur a higher energy cost than reads to flipflop-based memories and enforce a word-wise access granularity. For GPUs to overcome these limitations, they offer operand caches in which software can cache operands and results, which are then reusable at a later point in time, which further decreases area and energy efficiency. For example, NVIDIA's Volta architecture offers two 64-bit read ports on its register file per thread, in order to sustain a three operand fused multiply-add (FMA) instruction it needs to source one operand from one of its operand caches [17] .
Contributions
The solutions we propose here to solve the problems outlined above are the following: 1) A general-purpose, single-stage, single-issue core, called Snitch, tuned for utmost energy efficiency. Aiming to maximize the compute/control ratio (making the FPU the dominant part of the design) mitigating the effects of deep pipelines and dynamic scheduling. 2) An ISA extension, originally proposed by Schuiki et al. [18] , called stream semantic register (SSR). This extension accelerates data-oblivious [19] problems by providing an efficient semantic to read and write from memory. Load and store instructions which follow affine access patterns (streams) are implicitly mapped to register read/writes. SSRs effectively elide all explicit memory operations. Semantically they are comparable to vector operations as they operate on vectors (tensors) without the explicit need for load and store instructions. We have enhanced the SSR implementation by providing shadow registers to overlap configuration and computation. 3) A second ISA extension, floating-point repetition instruction (FREP), which controls an FPU sequencer. The FPU and the integer core in the proposed system are fully decoupled and only synchronize with explicit move instructions between the two subsystems. The FPU sequencer is situated on the offloading path of the integer core to the FPU. It provides a small, configurable size loop buffer from which it can sequence floating-point instructions in a configurable manner. The loop buffer frees the integer core from issuing instructions to the FPU that is, therefore, available for other control tasks. This makes this single-issue, inorder core pseudo dual-issue, enabling it to overlap independent integer and floating-point instructions. Furthermore, the loop buffer eliminates the need for loops in the code and reduces the pressure on the instruction fetch.
While traditionally minimizing non-FPU area and achieving floating-point high utilization has been a trade-off, we can eliminate the need to compromise: Our extensions have negligible area cost and boost FPU utilization significantly. Our Snitch core achieves the same clock frequency, higher flexibility, and is 2× more area-and energy-efficient than a conventional vector processor lane. From the design and implementation viewpoint, the contributions of this work are: 1) A fully programmable, shared memory, multi-core system tuned for utmost energy efficiency by using a tiny integer core attached to a double-precision FPU. Achieving 3.5× more energy efficiency and 4.5× better FPU utilization on small matrices than the current state of the art. 2) An implementation of the SSR [18] enhanced with shadow registers to allow overlapping loop-setup with ongoing operations using the FREP extension enabling the usage of our SSR and FREP extensions on more irregular kernels such as Fast Fourier Transform (FFT). Achieving speed-ups of 4.7× in the single-core case and close to 3× in the parallel octa-core case for the FFT benchmark. 3) A decoupled FPU and integer core architecture featuring a loop buffer that can independently service the FPU while the integer core is busy with control tasks. This extension, together with the SSR, make the small integer core pseudo dual-issue at a minimal incremental area cost of less than 7 % for the core complex and 3.2 % on the cluster level including memories.
The rest of the paper is organized as follows: Section 2 describes the proposed architecture and ISA extensions, Section 3 offers more details on the programming model of the system and the ISA extensions, Section 4 presents the experimental setup, evaluation and comparison to other systems. The last sections conclude the presented work and present future research directions. Figure 2 depicts the microarchitecture of the proposed system. The smallest unit of repetition is a Snitch core complex (CC). It contains the integer core and the FPU subsystem. The core is repeated N times to form a Snitch Hive. Cores of a Hive share an integer multiply/divide unit and an L1 instruction cache. M Hives make up a Snitch Cluster that shares a TCDM acting as a software-managed L1 cache. K clusters share last level memory via a crossbar.
ARCHITECTURE

Snitch Core Complex
The smallest unit of repetition is a Snitch CC. It contains an RV32IMAFD (RV32G) RISC-V core and can be configured with or without support for the proposed ISA extensions. Depending on the technology and desired speed targets of the design, the offloading request, response, and the load/store interface to the TCDM can be fully decoupled, increasing the design's clock frequency at the expense of increased latency of one cycle.
Integer Core
The foundation of the system is an ultra-small (9 kGE to 20 kGE), and energy-efficient 32 bit integer RISC-V compute unit, which we call Snitch. Snitch implements the entire (mandatory) integer base (RV32I). As the design of the CPU is dominated by its register file (RF) implementation, we alternatively also support the embedded profile (E), which only provides 15 registers. In addition, the RF can either be implemented based on D-latches or D-flipflops. Each Snitch has a dedicated instruction fetch port, a data port with an independent valid-then-ready [20] decoupled request and response path, and a generic accelerator offloading interface. The accelerator interface has full support for offloading an entire 32 bit RISC-V instruction, and we re-use the same RISC-V instruction encoding. This saves energy in the core's decoding logic as only a few bits need to be decoded to decide whether to offload an instruction or not. The interface has two independent decoupled channels. One for offloading an operation, up to three operands, and a back-channel for writing-back the result of the offloaded operation.
The core is a single-stage, single-issue, in-order design. Integer instructions with all of their operands available (no data dependencies present) can be fetched, decoded, executed, and written back in the same cycle. We chose this design point to maximize energy efficiency and keep the area of the design at a minimum. The core keeps track of all 31 registers (the zero register is not writable, hence it does not need dedicated tracking) using a single bit in a scoreboard. There are three classes of instructions that need special handling: 2. source of stalling as the arithmetic logic unit (ALU) is fully combinational and executes its instruction in a single cycle.
To foster the re-use of the ALU, it also performs comparison for branches, calculates CSR masks, and performs address calculations for load/store instructions. 2.1.1.2 Load/Store instructions: Load/store instructions execute as soon as all operands are available, and the memory subsystem can process a new request. The data port of the core can exert back-pressure onto the load/store subsystem. Furthermore, the load store unit (LSU) needs to keep track of issued load instructions and performs realignment and possible sign-extension. The core can have a configurable number of outstanding load instructions. Store instructions are considered fire-and-forget from a core perspective. The memory subsystem needs to maintain issue order as the core expects the arrival of load values in-order.
In addition to regular load and stores, the LSU can also issue atomic memory operations and load-reserved/storeconditional (LR/SC) as defined by the RISC-V atomic memory operation specification. From a core perspective, the only difference is that the core also sends an atomic operation to the memory subsystem alongside the address and data. We provide additional signaling to accomplish that.
2.1.1.3 Accelerator/special function unit instructions: Off-loaded instruction can execute as soon as all operands are available, and the accelerator interface can accept a new offloading request. We distinguish three types of instructions: 1) Both destination and source operands are in the integer RF, such as integer multiplication and division. Snitch's scoreboard keeps track of the destination operand. 2) Source operands are in the integer RF, and the receiving unit maintains the destination register. Such an example would be a move from integer to floating-point RF. 3) Both operands are outside of the integer RF, such as any floating-point compute instruction (e.g., FMA).
We offload floating-point instructions to the core-private floating point subsystem (FP-SS) (Section 2.1.2). As most of the floating-point instructions operate on a separate floating-point RF we can easily decouple the floating-point logic from the integer logic. The RISC-V ISA specifies explicit move instructions from and to the floating-point RF, which makes this ISA particularly amenable for such an implementation. Decoupling the FP-SS from the integer core makes it possible to alter and sequence floating-point instructions into the FP-SS. This is discussed in detail in Section 2.5.
The second compelling use-case of the accelerator interface is to share expensive but uncommonly used resources. We provide a hardware implementation of the multiplication and division instructions for RISC-V (M). This includes a fully pipelined 32 bit multiplier, and a 32 bit bit-serial integer divider with preliminary operand shifting for an earlyout division -all cores of a Hive share such a hardware multiply/divide unit. By controlling the number of cores per Hive, the designer can adjust the sharing ratio. Sharing is independent of the functionality, and possibly many other resources can be shared, for example, a bit-manipulation ALU.
As the RF only contains a single write-port, the three sources mentioned above contend over the single write port in a priority arbitrated fashion. Single-cycle instructions have priority over results from the LSU over write-backs from the accelerator interface. That makes it possible to interleave results if an integer instruction does not need to write back, such as branch instructions, for example. Requests to the memory subsystem are only issued if there is space available to store the load result. Hence, cores cannot block each other with outstanding requests to the memory hierarchy. The integer core has priority on the register file to reduce the amount of logic necessary to retire a single-cycle instruction.
The Snitch integer core is formally verified against the ISA specification using the open-source RISC-V formal framework [21] .
FPU Subsystem
The FP-SS bundles an IEEE-754 compliant floating-point (FP) with a 32×64 bit RF. The FP-SS has its own dedicated scoreboard where it, in a similar fashion to the integer core, keeps track of all registers. The FPU is parameterizable in supported precision and operation latency [22] . All floating-point operations are fully pipelined (with possibly different pipeline depths). Operations without dependencies can be issued back to back. In addition to the FPU it also contains a separate LSU dedicated to loading and storing floating-point data from/to the floating-point RF, the address calculation is performed in the integer core, which significantly reduces the area of the LSU. Furthermore, the FP-SS contains two SSRs which map, upon activation through a CSR write, registers ft0 and ft1 to memory streams. The architecture of the streamers is depicted in Figure 3 and described in more detail in Section 2.4.
Snitch Hive
A Hive contains a configurable number of core complexes that share an instruction cache and a hardware multiply divide unit.
Each core has a small, private, fully set-associative L0 instruction cache from which it can fetch instructions in a single cycle. A miss on the L0 cache generates a refill request upon the shared L1 instruction cache. If the cache-line is present, it is served from the data array of the L1 cache. If it also misses on the L1 cache, a refill request is being generated and send to backing memory. Multiple requests to the same cache-line coalesce to a single refill request, which serves all pending requests. The L1 cache refills using an Advanced eXtensible Interface (AXI) burst-based protocol from the cluster crossbar.
The Snitch Hive serves another vital purpose: It provides a suitable boundary for separating physical design concerns. All signals crossing the design boundary are fully decoupled, pipeline registers can be inserted to ease timing concerns on the boundaries of the design. The possibility to make a Hive the unit of repetition (a macro which is synthesized and placed and routed separately) allows for assembling more massive clusters containing many more cores.
Snitch Cluster
One or more Hives make up a cluster. Hives connect into the TCDM crossbar that attaches to a banked shared memory, and the instruction refill port connects to the AXI cluster crossbar where it shares peripherals and communication to other clusters. The cluster crossbar provides both slave and master ports, which makes it possible to access the data of other clusters.
Tightly Coupled Data Memory (TCDM)
Core data requests are passed through an address decoder. Requests to a specific (configurable) memory range are routed towards the TCDM, and all other requests are forwarded to the cluster crossbar. In its current implementation, the TCDM crossbar is a fully connected, purely combinational interconnect. Other interconnect strategies can easily be implemented and will offer different scalability and conflict trade-offs. In order to reduce the effects of banking conflicts, we employ a banking factor of two, i.e., for each initiator port (two per core), we use two memory banks.
We resolve atomic memory operations and LR/SC issued by the core in a dedicated unit in front of each memory. The unit consists of a simple finite-state machine (FSM) that performs the read-out of the operands from the underlying SRAM. In the next cycle, it uses its local ALU to perform the required operations and finally saves the results in its memory.
Cluster Peripherals
The cluster peripherals are used by software to get information about the underlying hardware. Read-only registers provide information on TCDM start and end address, number of cores per cluster, and performance counters such as effective FPU utilization, cycle count, TCDM bank conflicts. Writable registers are a couple of scratch registers and a wake-up register, which triggers an inter-processor interrupt (IPI).
Stream Semantic Register (SSR)
The SSR extension was first proposed by Schuiki et al. [18] , [23] . This hardware extension allows the programmer to configure up to two memory streams with an affine address pattern of dimension N . The dimension N depends on the number of available loops (see Figure 3 ) and can be parameterized. Streamers are configurable using memory-mapped input/output (IO). Each streamer is only configurable by the integer core controlling the FP-SS. No other core can write the core-private configuration registers.
The SSR module wraps logically around the integer RF. When activated by using a write to a CSR, operations on the RF are intercepted iff the operands correspond to either ft0 or ft1 (which map to SSR lane 0 or lane 1 respectively). The reads or writes are redirected towards an internal queue. The core communicates with the SSR lane via a two-phase handshake. The core signals a valid request by pulling its read or write valid signal high. In case data in the internal queue is available the respective SSR lane signals readiness. Finally, if the core decides to consume its register element it pulls its done signal high. For this work, we have extended the SSR's configuration scheme [18] by adding shadow registers in which the core can already push the configuration of the next memory stream while the streaming is still in progress. This allows for overlapping loop-bound calculation with actual computation when using the frep extension.
FPU Sequencer
The FPU sequencer, depicted in Figure 4 , is located at the off-loading interface between integer core and FP-SS. It can be configured using the frep instruction that provides the following information:
• is_outer: 1 bit indicating whether to repeat the whole kernel (consisting of max_inst) or each instruction. • max_inst: 4 bit immediate (up to 16 values), indicates that the next max_inst should be sequenced. • max_rep: register identifier that holds the number of iterations (up to 2 32 iterations) • stagger_mask: 4 bit for each operand (rs1 rs2 rs3 rd). If the bit is set, the corresponding operand is staggered. • stagger_count: 3 bit, indicating for how many iterations the stagger should increment before it wraps again (up to 2 3 = 8). The frep instruction marks the beginning of a floating-point kernel which should be repeated, see Figure 5 (a). It indicates how many subsequent instructions are stored in the loop buffer, how often and how (operand staggering, repetition mode) each instruction is going to be repeated. To illustrate this we have given two examples in Figure 5 d). The first example sequences a block of two instructions a total of four times. The second example sequences two instructions three times. For this example, the sequencing mode is inner, meaning that each instruction is sequenced three times before the sequencer steps to the next instruction in the block.
A particular problem with floating-point instructions is the fact that, in most cases, the FPU is pipelined. Pipelining means that most computationally expensive floating-point operations have a couple of cycles latency. If the sequencer is going to sequence a short loop with data-dependencies amongst its operands, then the FP-SS is going to stall because of data dependencies and therefore deteriorating performance, effective FPU utilization, and energy efficiency.
To mitigate the effects of stalling, the sequencer can change the register operands, indicated by a stagger mask, by adding a staggering count. Figure 5 (c, d) demonstrates the sequencer's staggering capabilities. The first example (c) staggers the destination register, and the second source register a total of two times. The second example only staggers the first source register a total of 3 times.
PROGRAMMING
Changing environments and requirements require a programmable system. To avoid overspecialization, we propose a system that is composed of many programmable and highly energy-efficient processing elements by leveraging widely applicable ISA extensions. At the foundation, the proposed system is a general-purpose RISC-V-based multicore system. The system has no private data caches but offers a fast, energy-efficient, and high-throughput software managed TCDM as an alternative. It can be efficiently programmed using a RISC-V toolchain, see Figure 6 (a). The hardware provides atomic memory operations as defined by RISC-V for efficient multi-core programs.
SSRs
We provide a small, header-only, software library to program the SSR efficiently. In particular, the programmer can decide the dimension of the stream and select the appropriate library function. For each dimension, the programmer needs to provide a stride, a bound, and a base address to configure the streamer. Finally a write to the SSR CSR activates the stream semantic on register ft0 and ft1. After the streaming operation finishes, the same CSR is cleared to deactivate the extension. The whole programming sequence for an example kernel is depicted in Figure 6 (c). On the example of the dot product kernel, we can see the speed-up of using the SSR extension over the baseline implementation. The vanilla RISC-V implementation executes a total of six instructions in its innermost loop, of which three are integer, and three are floating-point instructions, see Figure 6 (b). The SSR-enhanced version, on the other hand, elides all loads and only needs to track one loop counter to determine the loop termination condition. This saves three instructions and provides a 2x speed-up. The loop setup overhead is slightly higher, and a detailed analysis can be found in the original SSR paper [18] . For this system, we have enhanced the SSR system to provide the programmer with shadow registers for the loop configuration. The integer core can, therefore, already set up the next loop iteration and store the configuration in the shadow registers while the current iteration is still in progress. When the current iteration finishes, the SSR configuration logic automatically starts the iteration for the new configuration.
FPU Sequencer
The frep instruction configures the FPU sequencer to automatically repeat and autonomously issue the next n floating-point instructions to the FPU. This completely elides all loop instructions in the innermost loop iteration as the branch decision and loop counting is pushed to the sequencer hardware. For the dot product example, this only 
Pseudo Dual Issue
Integer core continues execution of floating-point independent code. Figure 7 . The same dot product kernel as in Figure 6 in RISC-V vector assembly [24] . The vector code is written independently of the vector length (VL), software needs to break the input problem size n down to VL in a strip mine loop. Of the ten instructions in the strip mine loop, five instructions are executed on the integer core while the other half is executed on the vector unit.
leaves one instruction in the innermost loop and provides a speed-up of 6× compared to the baseline and a 3× improvement over the plain SSR version of the kernel see Figure 6 (f). As the FPU sequencer frees the integer core of issuing instructions to the FP-SS it can continue executing integer instructions. This makes the core pseudo dual-issue, see Figure 6 (f). For the same dot product kernel, we have also listed the corresponding RISC-V vector assembly as a comparison point, see Figure 7 . Depending on the hardware's maximum VL and the problem size, software needs to perform a strip mine loop over the input data. For each iteration, the setvl instruction saves the number of elements of subsequent vector instructions into its destination register. The integer core performs bookkeeping and pointer arithmetic for each iteration. Of the ten instructions of the strip mine loop, only five execute on the vector unit, of which only two perform arithmetic operations.
Operand Staggering
The complex floating-point operations performed by the FPU require pipelining to achieve reasonable clock frequencies. Pipelining, on the other hand, increases the latency of floating-point instructions, which makes it impossible for one floating-point instruction to directly re-use the result of the previous instruction without stalling the pipeline. Depending on the speed target, we expect between two and four pipeline stages. Therefore the next operation would need to wait the same number of cycles until the operand becomes available. Some of these stalls can be hidden by executing independent floating-point operations in the meantime. This technique requires partial unrolling of the kernel. To combine this efficiently with the FREP extension, we provide an option for the sequencer to stagger its operands. The staggering logic automatically increases the operand names of the issued instruction by one. The frep command takes an additional stagger mask and stagger count. The mask defines which register should be staggered. The mask contains one bit for all three source operands and the destination operand, four bits in total. If the corresponding bit is set, the FPU sequencer increases the register name by one until the stagger count has been reached. Once the count is reached, the register name wraps again. The anatomy of the frep instruction including a sample trace with staggering enabled can be seen in Figure 5 (a).
Software
The SSR and FREP extension can be conveniently used with the provided header-only C library using an intrinsiclike style, similar to the RISC-V vector intrinsics currently under development [25] . Furthermore, a first Low Level Virtual Machine (LLVM) prototype shows that automatic code generation for SSR setup is feasible [18] .
RESULTS
We have synthesized, placed and routed an eight core configuration with 128 KiB of TCDM and 8 KiB of instruction cache using the SYNOPSYS DESIGN COMPILER 2017.09 and CADENCE INNOVUS 17.11 in a modern GLOBALFOUND-RIES 22 nm FDX technology. The floorplan of this cluster is depicted in Figure 8 . For the synthesis we have constrained the design to close timing at 1 GHz in worst case conditions (SSG, 0.72 V, −40 • C). The subsequent place and route step was constrained to 0.7 GHz. Sign-off static timing analysis (STA) using SYNOPSYS PRIMETIME 2016.12 showed that the design runs at 755 MHz in worst case conditions and 1.06 GHz in typical conditions (TT, 0.8 V, 25 • C). * Pseudo-dual issue behavior with an IPC higher than one † Reduction of FPU utilization because of SSR setup and frequent resynchronization between FFT stages. We still show a speed-up of 2.8× (see Figure 12) 
Microkernels
To evaluate the performance, power, and energy-efficiency of the architecture, we have implemented a set of different data-oblivious parallel benchmarks, where the control flow only depends on a constant number of program parameters. We selected four complementary kernels:
• dot product: A simple dot product implementation that calculates the scalar product of two arrays of length n. • ReLU: This kernel applies a rectified linear unit (ReLU) to the elements of an array of length n. • Matrix multiplication: A chunked implementation of a matrix multiplication of size n × n. • FFT: Implementation of a parallel FFT algorithm of size n. For each kernel we provide a baseline C implementation 2 (without auto-vectorization or special intrinsics), an implementation which makes use of SSRs and one which combines SSRs and FREP. Speed-ups were measured in a cycle-accurate register transfer level (RTL) simulation.
Single-Core
Performance
The single-pipeline stage of the core lets it achieve a very high IPC of close to one for most of kernels. The only effective source of stalls comes from the memory interface if there is a load-use dependency present or when the load result contends for the single write port of the core's RF. The proposed ISA extensions, SSR, and FREP reduce the number of explicit load and store instructions as well as the branching overhead. For above-mentioned microkernels we can report single-core speed-ups of over 6x in Figure 9 on certain benchmarks. The single-core case presents an idealized execution environment as there is no contention on the shared TCDM. We observe interesting effects: The matrix multiplication benchmark achieves an IPC of more than one by overlapping the computation of one block with the SSR setup of the next block.
In Table 1 we are tracking four metrics: in the total IPC. For the baseline case, this metric is interesting as due to the single pipeline stage and the tightly coupled memory subsystem we achieve an IPC of one for almost every kernel in the single-core case. For the multi-core system, contentions on the memory interface slightly limit the attainable IPC. This ensures a fair baseline for further evaluating our ISA extensions.
The single-issue nature of the baseline core limits the maximum achievable FPU utilization as we need to explicitly move data from memory into the core's register file. This ranges from 0.14 to 0.36 depending on the benchmark. We can see a very high core utilization as the integer core is supplying the FPU with instructions.
The introduction of SSR relaxes these constraints as we are translating all loads and stores into implicitly encoded register reads. We can see a positive effect on execution time as we are not using an issue slot (cycle) of the integer core to issue load(s)/store(s). We can still see that the integer core is busy issuing arithmetic floating-point instructions to the FPU by observing a high Snitch utilization.
Finally, with the introduction of FREP we significantly reduce the pressure on the integer core. The integer core only issues the floating-point operations once into the frep buffer from which it is being sequenced multiple times to the FP-SS. We can observe a very low integer core utilization of somewhere between 0.03 to 0.19. As we free the integer core from issuing floating-point instructions on every cycle, we can easily keep the FPU busy. This results in a very high FPU utilization of 0.57 to 0.93. A high FPU utilization, in turn, means high energy efficiency. For the single-core case we can see an improvement in speed-up (see Figure 9 ) and FPU utilization for all microkernels. The FFT benchmark shows a reduction in IPC as more frequent SSR set-up and load-use dependencies insert stall cycles which result in pipeline bubbles. Figure 12 . Multi-core speed-up for an octa-core cluster for each microkernel and enabled extension. We can achieve speedups from 1.87× to 5.28×. 
Area
The integer core ISA is configurable to either be RV32I or RV32E. Both support the same instructions but differ in the size of the RF. While RV32I comes with 32 general purpose integer register, RV32E only provides 16. As the CPU design is heavily dominated by the RF (see Figure 10 ) this design choice has a significant influence on the core's area. Furthermore, as mentioned in Section 2.1.1 we provide a latch-based and a ff-based RF implementation. The first being 50 % smaller in area while the latter can be used if latches are not available in the standard-cell library. Moreover, performance counters can be enabled separately which adds approximately 2 kGE in area. Altogether this make the core configurable from 9 kGE (RV32E, latch-based RF without performance counter) up to 21 kGE (RV32I, flipflop-based RF with performance counter), see Figure 11 . The SSR hardware consumes 16 kGE to implement address generation and control logic as well as load data buffering. This puts it at 12 % of the FP-SS and 8.5 % of the CC. The FREP extension, configured with 16 entries, takes up 13 kGE which is 7 % of the FP-SS's area and 3.2 % of the overall system on chip (SoC).
Multi-Core
Performance
For the multi-core performance evaluations we have instantiated an eight core cluster with 8 KiB of instruction cache and 128 KiB of TCDM memory (see Figure 8 ). We have parallelized our kernels to distribute work evenly on all cores. As can be seen in Figure 12 we achieve speed-ups from 1.87× to 5.28× depending on the benchmark. As in Table 3 Normalized achieved performance between compute-equivalent Snitch Cluster, Ara [15] , and Hwacha [26] instances for a matrix multiplication, with different n × n problem sizes. [26] the single-core case we can use the proposed SSR and FREP extensions to elide explicit load/stores and control flow instructions. In contrast to the single-core case (Figure 9 ) we can observe a slight reduction in speed-up as operand values are potentially (temporarily) unavailable due to contentions on the shared TCDM (SRAM bank conflicts), as well as effects of Amdahl's law. Nevertheless, we observe a quasi linear speed-up when scaling cores per cluster up to eight ( Table 2 ). Furthermore, we achieve up over 94 % FPU utilization for matrices of size 128 × 128. As can be seen in Table 3 we significantly, by a factor of 4.5, outperform existing vector processors on small matrix multiplication problems. On larger problems we can show equal or better performance.
The FFT benchmark demonstrates that the proposed ISA extensions are also applicable on less linear problems such as FFT. While we see a decreased FPU utilization in the multi-core system ( Table 2) we can observe a total speedup of 2.8×. The decreased FPU utilization is attributable to the less linear access pattern and the higher core synchronization frequency for each FFT stage, which in turn leads to higher contentions as cores are forced to start fetching at the same time from the same memory bank upon each (re-)synchronization.
Area
While the impact of the FREP extension is confined to CC the SSR extension also has a cluster-level impact. With SSR enabled, each core has two ports into the TCDM, increasing the area of the fully connected interconnect. In the selected implementation of an eight-core cluster, we have 16 request ports and 32 memory banks (providing a banking-factor of two). With 155 kGE the TCDM interconnect occupies 5 % of the overall area. The complexity of the crossbar scales with the product of its master and slave ports. We have estimated the complexity of a 32 requests and 64 banks crossbar to be around 630 kGE and the area of a 64 request ports and 128 banks to be around 2.5 MGE.
Energy Efficiency and Power
We have selected a 32 × 32 matrix multiplication benchmark running on a post-layout netlist to give an indicative power break-down of the system's component (Figure 13 ). For the given benchmark the cluster consumes a total of 171 mW of which 63 % are consumed in the CC, 5 % in the interconnect and 22 % in the SRAM banks of the TCDM. 42 % of the energy is spent in the actual FPU on the computation. While the integer control core only uses 1 % of the overall power. Figure 13 . Hierarchical power distribution estimates obtained using SYNOPSYS PRIMETIME 2016.12 at 1 GHz and 25 • C on a 32×32 matrix multiplication kernel using the proposed SSR and FREP extensions. All integer core only use 1 % of the overall power. The necessary hardware for the SSRs and the FREP extension uses less than 4 % and 1 % of the total power respectively.
The additional hardware for SSR and FREP only make up for a fraction of the overall power consumption, less than 4 % and 1 % respectively. What is particularly interesting ist that the instruction cache only consumes 4.8 mW or 4 % of the total cluster power. This is due to the FREP extension servicing the FPU from its local loop buffer, and the Snitch integer core exhibiting a very low activity that can mostly be served from its L0 instruction cache, that has been implemented as a flip-flop-based memory and can hence, energy-wise read and written much more cheaply compared to SRAMs. The total power of all micro-benchmarks is given in Figure 14 . As we only see a marginal increase in power for the given benchmarks but a significant improvement in execution speed and a high FPU utilization we can observe a similiar increase in energy efficiency. Figure 15 shows an 1.9 to 3.2 increase in energy efficiency compared to the baseline. The systems achieves an absolute peak energy efficiency of close to 80 DPGflop/s/W and 104 SPGflop/s/W.
To put the absolute energy efficiency into perspective, we estimated the achievable peak energy efficiency in 22 nm. Every architecture, even highly specialized accelerators, must at least perform two loads and a FMA instruction for each element. We can, therefore, estimate the energy-efficiency upper bound of 120 DPGflop/s/W. Snitch achieves more than 66 % of this theoretical peak efficiency.
RELATED WORK
The problem of keeping the FPU utilization high has been the subject of a lot of architecture research. The most prominent and widely used techniques encompass superscalar (out-of-order), general-purpose, CPUs, (Cray-style) vector architectures and general-purpose compute using GPUs. While these architectures promise to deliver high performance, they do not target energy efficiency as their primary design goal.
Vector Architectures
Cray-style vector architectures are enjoying renewed popularity with ARM providing their SVE [14] and RISC-V actively developing a vector extension [24] . An early, but complete version of the RISC-V vector extension in 22 nm called Ara, has been implemented by Cavalcante et al. [15] . The same technology node and configuration size allow for a direct comparison to our architecture. As a comparison point, we chose an eight-lane configuration that delivers a peak of 16 DPflop/cycle equal to the octa-core cluster we have presented in the evaluation section. The vector architecture accelerates programs that work on vectored data by providing a single-instruction which operates on (parts of) the vector. The instruction front-end of the attached core is feeding the vector unit special vector instructions that can then independently operate on chunks of data from the vector register file. The vector register file is similar in size and access latency to the TCDM in a Snitch cluster. However, in stark contrast to the vector register file, our system allows us to access individual elements of the TCDM as it is byte-wise addressable. The vector architecture compensates this fact by providing dedicated shuffle instructions, which, in contrast, consume precious instruction bandwidth and issue-slots. As a consequence, the scalar core needs to issue many instructions to the vector architecture that potentially bot- tleneck the instruction front-end and hence performs poorly on smaller and finer granular problems (see Table 3 ). On smaller matrix multiplication problems, our architecture significantly outperforms, by a factor of 4.5, the Ara vector architecture as our TCDM interconnect and byte-wise access to the TCDM provides implicit shuffle semantic. On increasing problem sizes, the vector architecture catches up in performance, but we can retain superiority even for larger problem sizes (see Table 3 ).
The rigid, linear access pattern, superimposed by the nature of vectors, imposes yet another problem: To compensate for the lack of access semantic into the register file additional ISA extensions such as 2D and tensor extensions are needed to encode the more complicated access patterns. As the shape of the computation is encoded in the instruction, this significantly bloats the encoding space, which in turn makes the instruction-frontend and decoding logic more complex and hence more energy-inefficient. In contrast the SSR and FREP extension provide up to 4 access dimensions in their current implementation. With the implicit load/store encoding into register reads/writes, no new instructions are needed, and the instruction-frontend and decoding logic is identical to the scalar core. Table 4 compares several figures of merit between Ara and the same size Snitch system. Both systems offer the same number of floating-point operations per cycle at comparable clock-frequency. On the chosen problem size of a 32 × 32 matrix multiplication, our system offers more than 1.5× sustained floating-point operations at twice the energy efficiency of almost 80 Gflop/s W compared to 40 Gflop/s W of Ara. Most of the energy efficiency gains come from the higher area efficiency and the much higher compute/control ratio. A comparable architecture to Ara is Hwacha [26] , which suffers from similar limitations.
GPUs
GPUs have completely penetrated the market of generalpurpose computing with their superior capabilities to accelerate dense linear algebra kernels most prominently found in machine-learning applications. The key idea of General Purpose Computation on Graphics Processing Unit (GPGPU) is to oversubscribe the compute units using multiple, parallel threads that can be dynamically scheduled by hardware to hide access latencies to memory. We have estimated energy efficiency of an NVIDIA GPU using a Tegra Xavier SoC [27] development kit. The board allows for direct power measurements on the supply rails of both the GPU and CPU. The Tegra SoC contains a Volta-based [31] GPU consisting of eight SMs which each in turn consists of 32 double-and 64 single-precision FPUs. Each SM contains four execution units, each managing eight double-precision and 16 single-precision FPUs, which share a common register file and an instruction cache. Hence such a quadrant is directly comparable to one Snitch cluster as presented here. Clock speeds of 1 GHz of Snitch and 1.38 GHz for the Volta SM are comparable keeping in mind that the SM has been manufactured in a more advanced technology, see Table 4 . On a high-level comparison, the Snitch system surpasses the SM in terms of energy efficiency, by almost a factor of 2 on single-precision workloads. This comparison does not take technology scaling into consideration, which would further improve energy-efficiency in favor of Snitch.
Super-scalar CPUs
The Tegra Xavier SoC also offers an eight-core cluster of NVIDIA's ARMv8 implementation called Carmel. The Carmel CPU is an 10-issue, super-scalar CPU including support for ARM's SIMD extension NEON. Each core contains two 128-bit SIMD-FPUs that are fracturable in either two 64-bit, four 32-bit or eight 16-bit units, offering a total of 8 double-precision flop/cycle, hence comparable to the presented octa-core Snitch cluster. The processor runs at a substantially higher clock frequency of 2.27 GHz at the expense of a much deeper pipeline, which in turn requires the processor to hide pipeline stalls by exploiting instruction level parallelism (ILP) in the form of super-scalar execution and a steep memory hierarchy to mitigate the effects of high memory latency. The increased hardware cost reduces the attainable area efficiency to only 1.26 DPGflop/s/mm 2 . The losses in area efficiency have a direct influence on the energy efficiency of the system, again not accounting for technology scaling, we can show more than 10× improvement in energy efficiency for FP32 and 15× for FP64.
Recent developments in high-performance chips, such as Fujitsu's A64FX [32] , clearly demonstrate that energyefficiency is becoming the number one design concern. The new Green500 [33] winner achieves 16.876 DPGflop/s/W system-level energy-efficiency (including cooling, board and power supplies). Unfortunately, as we do not have access to such a system for detailed measurements, we can not draw any meaningful direct comparisons.
CONCLUSION
We present a general-purpose computing system tuned for the highest possible energy efficiency on double-precision floating-point arithmetic. The system can be programmed using a standard RISC-V toolchain. The system offers an implementation of the RISC-V atomic extension (A) for efficient multi-core programming. We outperform existing state-of-the-art systems on energy efficiency by a factor of 3.5× by leveraging several ideas.
Small and efficient integer core: We aim to maximize the control to compute ratio by providing a small and agile integer core that can do single-cycle control flow decisions and integer arithmetic and combine it with a large FPU. The FP-SS decouples the integer/control flow from the floating-point operations and the FP-SS can operate on its own register file and provides its own FP LSU.
ISA extensions: We provide two minimal impact ISA extensions, SSRs and FREP. The first makes it possible to set up a four-dimensional stream to memory from which the core can simply read/write using two designated register names. The FREP extension complements the SSR extension by further decoupling the issuing of floating-point instructions to the FP-SS. The integer core pushes RISC-V instructions into the previously configured loop-buffer and subsequently sequences those instructions to the FPU. This has two beneficial side-effects: While the micro-loop buffer feeds the FPU with instructions, the integer core is free to do auxiliary tasks, such as configuring direct memory access (DMA) transfers. The second positive effect is that it relieves the pressure on the instruction cache, therefore saving energy.
Scratchpad memories: Explicit scratchpad memories instead of hardware managed caches enable deterministic data placement and avoid suboptimal cache replacement strategies. The TCDM memory is shared amongst a couple of cores, making data sharing significantly more energy efficient as no cache coherence protocol is necessary.
The system achieves a speed-up of up to 5× on dataoblivious kernels while still being fully programmable and not overspecializing on one problem domain. The flexibility offered by the small, Turing-complete, integer control unit makes it possible to adapt to a plethora of problems. Furthermore, we have shown that eight cores per cluster provide a good trade-off between speed-up and complexity of the interconnect. A future extension of the proposed SSR hardware could target improved efficiency for sparse linear algebra problems. Furthermore, extended benchmarking and improvements in the compiler infrastructure are exciting future research directions.
