The demand for floating-point compute power is ever growing. The domains of big-data, machine learning, and scientific computing require a wide precision range and high operational intensity. The sheer number of operations paired with increased power density implied by technology scaling makes it more important than ever to achieve maximum energyefficiency for floating point operations. In this work, we present Kosmodrom, our novel silicon solution in Globalfoundries 22 nm Fully-Depleted Silicon on Insulator (FD-SOI) which offers a multi-dimensional approach to trade-off performance, energyefficiency and power consumption. A variable-precision, dualcore RISC-V system together with a specialized floating point accelerator form the architectural basis. Different implementation strategies and standard cell flavors provide optimal solutions for different operating conditions while supply voltage and forward body bias (FBB) enable for a dynamic trade-off during operation. In this work, we provide a unique insight into the impact of a multitude of tuning parameters to achieve the optimal operating point on the power-performance surface. Kosmodrom achieves a peak energy-efficiency of 260Gflop/s/W and up to 28Gflop/s peak performance within a 6.2-400mW power envelope.
I. INTRODUCTION
Computing has become much more than just an augmenter of science, it advances all areas of science and engineering [1] .The demand for more powerful computers is ever growing, the High Performance Computing (HPC) community is targeting exascale machines (1 Eflop/s) by 2020 [2] . As Dennard scaling ceased to be valid and Moore's law is slowing down, new innovations in HPC architectures are necessary. The main driving avenue of innovation for new architectures of HPC systems will be energy-efficiency, a trend which can already be observed in the industry [3] , [4] . Higher energyefficiency will allow to overcome the limits of Thermal Design Power (TDP) allowing to integrate more computational power into the same chip. Furthermore, as the amount of heat dissipated is smaller, less cooling power will be needed which is one of the dominating cost factors in data centers [5] .
For domains such as big-data, machine learning and scientific computing, the dominant HPC workloads are IEEE-754 floating-point (FP) operations. Emerging trends, like machine learning (ML), exhibit a very regular and dense compute pattern which can be efficiently accelerated. Companies like Google and NVIDIA are building specialized hardware accelerators such as the Tensor Processing Unit (TPU) [6] and TensorCores deployed in NVIDIA's V100 graphics cards [7] which are solely dedicated to accelerating the regular parts of ML payloads in data centers. While accelerators can achieve very high energy-efficiency, they impose a challenge on the programming model. To circumvent this problem vendors usually provide a framework, such as TensorFlow [8] , which makes the domain-specific languages and programming frameworks of accelerators transparent to the programmer. The main drawback of this model is high specialization to the underlying problem, nowadays usually ML, which makes the accelerator unsuitable for other tasks. Another angle of attack to achieve higher energy-efficiency is to trade-off FP precision for energy-savings. Different applications exhibit different requirements on the underlying FP precision. The emerging transprecision computing paradigm [9] aims at providing a complete framework for leveraging this trade-off, requiring a flexible and energy-proportional implementation in hardware.
Last but not least, technology will be a fundamental consideration for achieving the next level in energy-efficiency. Supply voltage and body bias (BB) voltage can act as additional tuning knobs to set the desired operating point and to dynamically trade-off performance, energy-efficiency, and power. On an architectural level, the characteristic of the transistor (e.g. threshold voltage and leakage power), will have an impact on architectural decisions.
In this paper, we present Kosmodrom, which addresses these challenges and explores the manifold of different tuning knobs. Our contributions in particular are: 1) A novel architecture with a wide energy, performance and power trade-off surface of its computation engines, demonstrated on a GLOBALFOUNDRIES 22 nm FDX research demonstrator. 2) Detailed chip measurement results under different operating conditions of above silicon implementation. 3) A first rigorous analysis and discussion of trade-offs on architecture, implementation and operating conditions and quantification of the achievable core computational efficiency.
II. ARCHITECTURE Kosmodrom contains three different processing engines which have each been tuned to tackle a particular set of FP problems. Two application-class RISC-V Ariane cores [10] take care of the general-purpose payload while a dedicated network training accelerator (NTX) [11] has been specifically designed for oblivious kernels such as Deep Neural Network training, scientific computing stencils, and general linear algebra workloads. The units share 1.25 MiB of L2 memory via a 64 bit AXI bus and a set of peripherals such as debug infrastructure, on-chip BB generator and a Universal Asynchronous Receiver Transmitter (UART). The high-level floorplan of the chip and a blockdiagram is depicted in Fig. 1 . Each core and the NTX can be individually clocked and powered.
A. General-purpose Cores
The cores are general-purpose (RV64GC) 64 bit, 6-stage, in-order issue, out-of-order execute RISC-V cores. Each core contains 16 KiB of instruction cache and 32 KiB of data cache and a transprecision floating-point unit (TP-FPU). This FPU offers support for IEEE-754 double-, single-and halfprecision FP formats, as well as two custom 16 bit and 8 bit formats. The two custom formats were explored in [12] and are henceforth called FP16alt and FP8.
The TP-FPU offers all standard RISC-V FP operations on the five formats, as well as single instruction multiple data (SIMD) operations for formats narrower than 64 bit. Conversions among all FP formats including operations for packing and converting vectors are provided. Operations on narrow FP formats come with super-linear energy-proportionality, i.e. it is energetically cheaper to operate on a vector of two 32 bit FP data than on one 64 bit datum. This is achieved through coarse-grained clock-and data gating of mutually exclusive execution paths, rather than reusing wide hardware units for narrow formats. To this end, the operational units for the different formats are implemented as separate, parallel blocks that can be turned off if unused, and are serviced by a single set of issue and output arbitration circuitry.
On Kosmodrom we provide two different flavors of the same core implemented in different cell libraries and tuned for different operating conditions:
1) Ariane High Performance (AHP): Tuned for highperformance application. The L1 caches are implemented from single-ported, high-performance static random-access memories (SRAMs) and the standard-cells used are 8-track, low and super-low threshold voltage transistors with gate lengths of 20 nm, 24 nm and 28 nm. 0.8 V nominal supply voltage.
2) Ariane Low Power (ALP): Tuned for light, singlethreaded applications. L1 caches are implemented from single-ported, low power SRAMs and the standard-cells used are 7.5-track, low power, low and super-low threshold voltage transistors with gate lengths of 28 nm, 32 nm and 36 nm. 0.5 V nominal supply voltage.
B. Specialized Accelerator
NTX is a processing cluster dedicated to accelerating oblivious kernels such as Deep Neural Network training, scientific computing stencils, and general linear algebra workloads. It consists of one RISC-V core, eight NTX streaming coprocessors, and a DMA engine for data movement. All units operate directly on 64 kB of shared tightly-coupled data memory via an interconnect with high aggregate bandwidth. NTX tackles the von Neumann bottleneck by moving data transfers and computation out of the RISC-V processor's instruction stream and into dedicated units (DMA and NTXs). As such the processor merely orchestrates data movement and computation without being itself a bottleneck. Due to the highly regular nature of ML workloads, the RISC-V processor can focus on calculating addresses, dimensions, and computation schedules. Double buffering allows data transfers, NTX computation, and control tasks to fully overlap, leading to high FPU utilization and efficiency. In the worst-case corner (SSG, 0.72 V, −40/125 • C) the NTX co-processors and the data memory operate at 1.25 GHz, while the processor core and DMA engine are designed to operate at 625 MHz. This configuration allows the RISC-V core to "effectively issue" 32 floating point operations, 16 local, and 4 global 32 bit memory accesses per cycle. The overall cluster achieves a peak compute performance of 20 Gflop/s and a peak bandwidth of 5 GB/s under these conditions. The NTX co-processor itself uses a custom 300+ bit partial carry-save fused multiplyaccumulate (FMAC) data path compatible with the IEEE 754 32 bit floating-point format. This wide accumulator allows for a 1.7× lower Root Mean Squared Error than a conventional 32 bit FPU on long accumulations such as those found in DNN training. This gives the architecture a lot of flexibility and makes it highly suited for tasks beyond machine learning, such as the stencil operations and linear algebra commonly found in scientific computing.
III. RESULTS The design has been synthesized using SYNOPSYS DE-SIGN COMPILER 2017.09. The back-end design flow has been realized using CADENCE INNOVUS 17.11. The three designs have been individually hardened and full scan-chain insertion was performed on all three macros. Measurement results were obtained on the test chips using an ADVANTEST 9300 industry-grade Application-Specific Integrated Circuit (ASIC) tester.
The demonstrator allows us to explore and utilize the following trade-offs:
1) Core vs. Accelerator: We provide two general-purpose cores which can be used to run general multi-precision workloads or run control intensive tasks such as an operating system (OS). Specialized workloads such as ML can be much more efficiently off-loaded to the NTX. The NTX provides more than 6× the efficiency and more than 18× the performance than the general purpose RISC-V cores. We provide a detailed energy and performance characteristic for all three sub-designs in Fig. 6 .
2) Cell Library (Technology): AHP and ALP have been manufactured in two different cell libraries specifically tuned for different operating points. The AHP macro was designed for high-performance applications using fast, short-channel transistors while the ALP was tuned for minimizing power consumption under light, single-threaded workloads that require low operating frequency. Hence, a slower cell library with lower leakage and long channel transistors was used which achieves higher energy-efficiency at lower voltage (where the leakage power becomes a dominant factor). The gains in leakage are biased in our analysis due of the number of available cells in each library, being significantly smaller for the 7.5-track library (2261 available cells) compared to the 8-track library (7224 available cells). As a matter of fact, this increases the area of the ALP. Normalizing leakage power at low V DD by area shows the advantage of reduced leakage: The ALP only consumes 0.36 mW/mm 2 which is significantly less than 1.25 mW/mm 2 for the AHP (see Table I ).
3) Body Bias Voltage: The flip-well transistors of GLO-BALFOUNDRIES 22 nm FDX used in Kosmodrom allow for forward body bias (FBB). Increasing FBB lowers the threshold which in turn increases the switching speed of the transistor at the expense of higher leakage power (see Fig. 5 ). The achieved speed-up is depicted in Fig. 4 . As can be seen in Fig. 5 (right) , leakage power increases exponentially with increases in FBB voltage. FBB can be used for fine-grained speed boost on a per-core basis 1 Furthermore, Fig. 5 (right) shows the advantage of the longer channel transistors used in ALP compared to the AHP which, at 1.4 V FBB, show approximately 7x less leakage power. Our measurements hence confirm that FBB can be used for (i) frequency centering at low voltage for the ALP processor with an affordable increase in leakage (ii) extra frequency boost at high voltage for the AHP processors, when performance is essential and leakage is not a major concern.
4) Supply Voltage: The AHP was implemented using a high-performance 8-tracks library and achieves 885 MHz at 0.8 V nominal. It can be boosted up to 1.6 GHz at 1.2 V, see Fig. 3 . The NTX has been implemented with the same libraries as the the AHP and it is carefully pipelined to achieve 2x the frequency of the AHP core at the nominal 0.8 V.
In boost mode it achieves 2 GHz, limited by the maximum on-chip frequency achievable by our frequency locked loop (FLL) clock generator. In contrast, the ALP processor was tuned for always-on background task that do not require high performance. Hence, the implementation targets high energyefficiency at low frequency, with low operating voltage and leakage. The 7.5-tracks library and the constraints used in implementation minimize the amount of short-channel and lowthreshold cells, thereby greatly reducing leakage. The nominal design point for the ALP is 0.5 V where it achieves 175 MHz. At low supply voltages (0.5 V) the ALP remains efficient thanks to its low-leakage implementation (0.36 mW/mm 2 vs. 1.25 mW/mm 2 for the AHP), while both cores achieve the same speed around 175 MHz. Clearly, at higher voltages the ALP is less efficient than the AHP, hence the two cores have complementary strengths and together they span a very wide range of operating conditions with high overall energy efficiency.
5) The FP Precision and Energy Trade-Off: The TP-FPU present in the Ariane core allows for trading-off FP precision for gains in instruction energy [13] . The energy cost of FP operations in the TP-FPU is superlinearly proportional to the width of the data format, as shown in Fig. 2 . Operations on narrow FP formats also take fewer cycles to complete, leading to faster completion for applications that cannot optimally fill the execution unit pipelines. Furthermore, leveraging SIMDstyle vectors for narrow FP operands yields higher hardware throughput and thus more work done in the same number of application cycles. As such, leveraging narrow FP formats on the general-purpose cores can improve energy to solution as well as time to solution in a significant way, up to 7.94× and 7.6×, respectively, for FP8 workloads [13] .
A. Summary
Overall, these five tuning parameters discussed above allow us to explore a wide range of operating points from low power to high-performance and high-efficiency. The provided heterogeneity allows us to process workloads efficiently without excessive overspecialization. A typical usage scenario would 1 Using V DD is not affordable due to the cost of voltage regulators and level shifter), when execution of a sequential thread on a core is timing-critical. It also can be used for ensuring constant operating frequency vs. temperature and process variation. Fig. 4 demonstrates that a frequency tuning range of 44%, 27%, 23% is achievable on AHP, ALP and NTX respectively. be for the AHP to run an OS. The high speed of the processor would allow for accelerating the highly sequential code found in OSs and single-threaded applications. The ALP can be used for non-critical OS or application threads, and in very light load conditions where threads can be run at very low frequency. The absolute efficiency of AHP and ALP are very similar as the transistor characteristic mainly differs in reduced leakage power due to the longer channel transistors found in the ALP. Finally, the NTX is used to efficiently accelerate stream-based workloads such as highly data-parallel workloads with non-data-dependent memory access sequences, such as dense linear algebra routines used in NN training, inference, and in stencil computations. Thanks to its extremely high energy-efficiency, the NTX can also be run at higher speed than the AHP before hitting thermal limits.
B. Comparison to State of the Art
In Table II we compare our three processing engines (AHP, ALP and NTX) to leading industry-strength architectures such as the ARM Cortex A53 [14] and another RISC-V opensource core called Rocket [15] . For both the ALP and the AHP we achieve higher energy-efficiencies. The area efficiency is slightly worse compared to Rocket as we include 32 KiB data cache compared to Rocket's 16 KiB data cache. The area difference between ALP and AHP is an artifact of the less mature cell library used for implementation of the ALP (2261 cells available vs. 7224 in the more mature AHP). Similarly, the energy-efficiency of the ALP is penalized as the increased area also implies larger total leakage power and reduces the efficiency at low voltages. We are expecting a much clearer energy-efficiency offset in favor of the ALP at lower voltage in future versions of the 7.5-track library providing similar amounts of library cells for implementation. Furthermore, we compare the NTX to a Tesla V100 for which we achieve a 2x gain in energy-efficiency. Accounting for technology scaling our gain will even be higher.
In summary, Kosmodrom achieves a peak energy-efficiency of 260 Gflop/s W and up to 28 Gflop/s peak performance within a 6.2 mW to 400 mW power envelope thanks to the NTX accelerator for 32 bit and the Ariane cores for full transprecision workloads, demonstrating record-breaking core efficiencies. Our contribution is a detailed power, performance, and efficiency trade-off analysis. The entire architecture has been open-sourced and is available for download 2 . All processing engines will be fully integrated into larger systems with a complete memory hierarchy and high-speed interfaces in upcoming developments planned within the European Processor Initiative.
