Radio Resource Management (RRM) in 5G mobile communication is a challenging problem for which Recurrent Neural Networks (RNN) have shown promising results. Accelerating the compute-intensive RNN inference is therefore of utmost importance. Programmable solutions are desirable for effective 5G-RRM top cope with the rapidly evolving landscape of RNN variations. In this paper, we investigate RNN inference acceleration by tuning both the instruction set and microarchitecture of a micro-controller-class open-source RISC-V core. We couple HW extensions with software optimizations to achieve an overall improvement in throughput and energy efficiency of 15× and 10× w.r.t. the baseline core on a wide range of RNNs used in various RRM tasks. 1
I. INTRODUCTION
Radio Resource Management is challenging as it aims at achieving maximum utilization of the limited publicly available frequency bands [1] , under highly heterogeneous traffic (e.g., tiny sensor-nodes vs. mobile routers), and rapidly varying radio signal propagation conditions. Notably, RRM tasks have to be executed in the frame of milliseconds, which exclude compute-intensive algorithms [2] . Presently, 5G applications impose strict new intensive requirements on radio communication systems: 1) very high reliability and low-latency for autonomous vehicles, 2) very high bandwidth requirements for video telephony and virtual reality, and 3) massive machine-to-machine communication for the Internet of (Every)-things. These challenging requirements ask for extending the existing cellular network with more antennas, improving antenna efficiency, and more effective RRM. Therefore, more advanced allocation algorithms are required to distribute limited resources (e.g., frequency bands, transmit power, data rates) to mobile clients efficiently.
Typically, RRM problems have been modeled with full observability and solving convex problems with traditional optimization approaches. Exhaustive search methods led to very high computation costs [3] and sub-optimal solutions based on Lagrangian relaxation, iterative distribution optimization, and other heuristic approaches had convergence issues and lacked guarantees [3] . Traditional methods like the weighted sumrate MSE algorithm [4] and fractional programming [5] are iterative, and most of them need to perform complex operations (e.g., matrix inversion or SVD) in every single iteration.
It is, therefore, extremely challenging to push these methods to the throughput and scale required for 5G-RRM. Recently, neural networks have gained increasing attention for 5G-RRM. At the physical layer, RNNs have been used to compensate for imperfections and nonlinearities and collision detection in the RF domain [6] , [7] . This is getting even more important for high-frequency communication, where absorption starts to strongly depend on the environment, and for ultra-dense cell networks where cross-tier interference has to be compensated [8] . At the data-link layer, which is responsible for resource allocation including dynamic resource scheduling of frequency bands, dynamic range, and handover control, classic multi-layer perceptron [9] - [12] , (recurrent) Long Short-Term Memories LSTM [13] , [14] , and Convolution Neural Networks [15] have been used. Reinforcement learning-based deep Q-Learning networks [16] have been used for several typical RRM problems like dynamic spectrum access utilization [9] , [14] , [17] , power level selection [9] , [10] , [12] rate control [10] and time-slotted optimization [11] .
These networks are less computationally demanding than classical RRM algorithms, but they are far from trivial. Specialized and efficient stand-alone Neural Networks accelerators have been presented recently [18] . Nevertheless, hardwired RNN accelerators cannot cope with the flexibility requirements found in a typical RRM setting, as base stations typically stay in the field for a very long time, while RRM algorithms are rapidly evolving. To retain fexibility, FPGA-based acceleration has bee explored for RNN inference. For instance LSTM acceleration on FPGA achieving up to 13 GMAC/s/W have been presented in [19] , [20] . To further increase efficiency, compression techniques (e.g., block-circulant weight matrices, pruning with zero-skipping [19] , [20] ) have been applied, and a top (effective) energy efficiency of 82 GMAC/s/W on a Xilinx Zynq-7100 FPGA has been presented [20] . Nevertheless, these compression schemes have not yet been proven to work for the networks used in the RRM field, and FPGAs have a cost envelope that is not compatible with massive and dense deployment, as required in 5G networks. To address these intertwined flexibility, efficiency, and cost challenges, we propose to enhance the open and royalty-free RISC-V ISA and leverage the availability of high-quality open-source cores based on this widely supported ISA. We demonstrate (and open source) a micro-controller class RISC-V core with RNN-enhancements for RRM acceleration, and we couple hardware extensions with software optimization. We achieve an energy efficiency of 218 GMAC/s/W, and a throughput of 566 MMAC/s, which is an improvement of 10× and 15×, respectively over the baseline open-source core. Such an order-of-magnitude boost is obtained thanks to data reuse with output feature map tiling (1.9×), adding custom activation instructions (13% within LSTMs), merging load and compute (1.13×/1.7×), and input FM tiling (5%). The proposed extensions maintain backward compatibility with the baseline RISC-V ISA, and have a very small overhead (3.4%) in area and no increase in the longest path. Improvements are consistently achieved over a quite diverse set of RNNs used for various RRM tasks, thereby confirming the flexibility of our approach.
II. RELATED WORKS A. ML Compute Platforms
With the machine learning revolution, a variety of different ML compute platforms have been presented in industry and academia, spanning from high-performance server accelerators (e.g., Google's TPU cores) to embedded platforms (e.g., Nvidia Jetson Xavier) to stand-alone application-specific accelerators [18] . We are not aware of any RNN acceleration engine targeting RRM applications. General-purpose processors have been extended with new matrix and vector extensions to handle the common compute patterns in Neural Networks. In the Advanced Vector Extensions AVC-512 of the x86 ISA, Intel added the VNNIW instruction extension, which include 16×32-bit SIMD vector operation for efficient convolution kernels in single-precision float FP16 and accumulations in double-precision float FP32 and since Cascade Lake (2019) the fixed-point version (VNNI) with 8-bit (e.g., VPDBUSD) and 16-bit (e.g., VPDBUSSD) vector product with 32-bit accumulation [21] . The AARCH64 Neon extensions in the ARMv8-A processor series, provides special SIMD instructions for sumdot-products (e.g., BFDOT) and 2×2 matrix-matrix multiplications (e.g., BFMMLA) with 2-way SIMD in brain floatingpoint format bfloat16. Recently, ARM presented the Mprofile Vector Extensions MVE (Helium) for their embedded processor family Cortex-M. Helium instructions feature computations in various SIMD-formats (INT8/16/32, FP16/32), hardware loops, interleaved post-increment load/stores [22] . However Intel typically focuses on the high-performance highcost processor market and the Helium extensions are not yet available in HW implementations.
Besides ISA extension, also highly-optimized SW kernels have been developed exploiting these instructions. These includes utilizing parallel SIMD computations (e.g. 16-bit [23] , 8-bit [24] ) and data reuse with appropriate tiling. Tiling helps to reduce data loads from memory and reuse data with the local registerfile. Output FM tiling (OFM), where several outputs are calculated in parallel and input FM loads can be shared, has been commonly used (e.g., [23] , [24] ). Furthermore, convolutional layers can be reformulated as matrix-matrix multiplications with the im2col technique [25] . This allows to tile both the input and output FM spatially in m × nsized tiles and thus reduces the number of loads from O(mn)
to O(m + n), as both weights and input FM pixels can be reused. Previous work has mainly focused on and reported results on CNNs [23] , [24] . Still, this two-dimensional tiling cannot be applied to (non-convolutional) LSTMs and Linear Layers, which are the main network kernels used in RRM applications.
Neural Networks are commonly trained in floating-point format. Still, recently, it has been shown that integer-aware training allows us to use more energy and area efficient fixed-Point without any significant accuracy drop, especially 16bit quantization [26] , but even eight and fewer bits [27] . Finally, RNN use transcendental activation functions, which are computationally complex. Previously, there have been 4 approaches to accelerate computation of these functions: piecewise linear approximation (PLA) [23] , low-order Taylor series expansion (e.g., 2nd order [28] ), LUT with adaptive value granularity [29] , or a small neural network [30] . We use a PLA approach, but differently from previous work, we exploit the symmetry property of tanh and sig, we take into account fixed-point quantization and evaluate in detail the error introduced by different numbers of interpolation intervals, rather then selecting a high number of intervals (i.e., 128 in [23] ).
B. RISC-V and RI5CY
The RISC-V ISA [31] , has recently become the defacto standard in open-source and free instruction set architecture. RISC-V provides plenty of encoding space for extensions and is therefore suitable for application-driven processor customization while maintaining compatibility with the baseline ISA. In this work, we rely on the RI5CY [32] , a high-quality, silicon-proven and open-source core supporting the standard RISC-V RV32IMFC ISA (including integer, integer multiplications, single-precision floating-point, and compressed instructions). Additionally, RI5CY supports the Xpulp ISA extensions featuring extended fixed-point support (e.g., on-thefly re-quantization and saturation), SIMD instructions, postincrement store and loads, and hardware loops.
C. Benchmark Suite and Neural Networks
We have selected an application benchmark consisting of 10 neural networks which have been presented recently in the RRM domain. These networks differ in network types (Fully-Connected Neural Layers ( [2] , [3] , [9] , [11] , [12] , [17] , [33] , Long-short Term Memories [13] , [14] , Convolutional Neural Network [15] ), learning methods (Supervised [2] , [13] , [15] , [33] , reinforcement-based [9] , [11] , [12] , [14] , [17] , unsupervised [3] ), application (cellular networks [3] , [13] , peer-topeer communication [14] , wireless communication systems [11] , [12] , [15] , [17] , [33] , wired communication [2] ) and optimization metric (throughput [2] , [3] , [11] - [15] , [17] , [33] , fairness [13] , [14] , latency [9] , energy efficiency [15] ). A detailed description of the networks can be found in the project report [34] .
Three main ML kernels are used within these networks: Fully-connected layers (or Multi-Layer Perceptron MLP), Long-short Term Memories LSTM, and Convolutional Neural Network CNN Layer. A fully-connected layer connects all input (neurons) x ∈ R m to all outputs (neurons) o ∈ R n and is described with the following matrix-vector multiplication and the corresponding weight matrix W ∈ R n×m : o = b + Wx. LSTM are recurrent networks able to learn time series and are described by m input neurons and n internal memory cells c t , n hidden states h t and the corresponding matrixvector multiplications, vector-vector point-wise additions (i.e., hadamard product a • b = (a i · b i ) i ) and multiplications and point-wise application of sigmoid and hyperbolic tangent activation functions:
Whereas the weight matrices
Finally, CNN layers exploit the translation invariance in the data (e.g., in images) and map n h im,in × w im,in -sized input channels i n ∈ R him,in×wim,in k h im,out × w im,out -sized output channel maps by applying h k × b k -sized convolution filters w k,n ∈ R h k ×b k to every input channel for every output channel.
III. HW/SW EXTENSION AND OPTIMIZATIONS A. Baseline Implementation (SW)
We have developed a straight-forward implementation (e.g., organizing matrix-vector multiplication as a double nested loop over all inputs and outputs) of all required network kernels in C where weights and data values are encoded in 16-bit fixed-point format (i.e, Q 3.12 ). This format offers a good compromise between accuracy/robustness and energyefficiency/throughput, and most importantly does not require fixed-point aware retraining that would be necessary for smaller bit-widths. The C implementation is compiled with standard GCC 7.1.1 for RISC-V RV32IMFC ISA and was run on the RI5CY core. The instruction count for the entire benchmark suite is shown in Tab. Ia and is used as the baseline for further comparisons.
B. SIMD, HWL and post-increment load (HW)
As a first optimization step, we re-wrote each code to exploit Xpulp extensions as much as possible. The 16-bit data (weights and inputs) are packed into the packed SIMD vector format (i.e., v2s), allowing the compiler to map every two subsequent input FM p(2c i ) and p(2c i + 1) and the corresponding weights (c o , 2c i ) and w(c o , 2c i + 1) to a macs using a single pv.sdotsp.h instruction without the need of custom intrinsics.
The next optimization is to reduce the overhead of loop control instructions in small loop bodies that are seen in such operations by using hardware loops that are part of the Xpulp extensions. The hardware loop does not use any additional instructions during the loop execution, but requires loop index manipulation instructions (i.e., pl.setup) to set three registers: a loop counter (rB), the loop start PC+4 and the loop end (PC+rA). When the PC reaches the loop end, the controller decrements the loop counter and jumps back to the loop start until the loop counter reaches zero. The final optimization is to take advantage of post-increment load-word instruction (i.e., lw!) to increment the data pointer for weights and input feature maps at the same time as executing the load word instruction, saving a separate addi instruction in the process. Combining these three techniques results in 4.4× reduction w.r.t. to the unmodified RISC-V IMC baseline in the number of instructions executed as can be seen in Tab. Ib.
C. Output Feature Map Tiling (SW)
To compute one MAC two loads to the memory are needed: one for the weight and one for the value of the corresponding input neuron. Fortunately, the read for the input value can be reused for several outputs. The output features are therefore organized in tiles of N output channels and the contribution of the input neurons is calculated for all output neurons of the current input neuron. These partial sums can be stored in registers and are not written back to the memory until all input activations have been weighted and accumulated. Algorithm 1 gives an overview of the implementation and scheduling of the output FM tiling. The load of one input FM can thus be shared by N pl.sdotsp instructions (executing 2 MAC operations on 16-bit operands), and thus just O(1 + 1/N ) loads are needed per compute operation. Until the available registers are exhausted and data has to be pushed onto the stack memory; furthermore, the load latency can be hidden by the compiler by rearranging the instructions. Previous work has shown that the tiling can be extended to the feature-level in case of a convolutional layer if the input feature map is rearranged and replicated (i.e., in2col) such that the convolution becomes a matrix-matrix multiplication [23] , [24] .
In this paper, we focus mainly on the optimizations for LSTMs and MLPs, as these network kernels are mostly used in the selected RRM benchmark suite and have not been for all input channels i l ∈ c in do 6: temp in=Mem(i l ) 7: #unroll following loop 8: for all output channel o k in tilec out do 9: w=Mem(w o k ,i l ) 10 end for 17: end for discussed in previous work. As can be seen in Tab. Ic, the optimal tiling brings an additional improvement of 1.89× on the RRM benchmark.
The results are shown in Tab. Ic and Fig. 3 , most of the networks execution cycles can be improved between 1.79× [11] and 1.87× [17] , but small FMs suffer from high overhead and therefore less speedup (1.07× [33] and 1.30× [14] ).
Overall, we obtain a speedup of 15× to the RISC-V IMC baseline thanks to: 4.4× using SIMD and HWL from the Xpulp extension, 1.9× with OFM tiling, 1.7× merging load and compute and 4.7% with IFM tiling.
D. Tanh and Sigmoid Extension (HW)
Sigmoid and hyperbolic tangent are common activation functions in neural networks and used in LSTMs. The piecewise linear approximation technique can be implemented for these functions in SW with an increasing number of cycles to reach the required precision. This can be a major contribution to the overall calculation in LSTM-based networks. For example, the calculation of tanh/sig requires 10.3% in [13] and 33.6% in [14] of the overall computation cycles. We introduce two single-cycle instructions pl.tanh rD, rA and pl.sig rD, rA with the following useful properties: 1) They are continuous and smooth (i.e., derivatives are continuous, too); thus, the error is bound for a fixed interval in a Taylor series expansion even for degree one (i.e., tanh(x 0 + ) = tanh(x 0 ) + tanh (x 0 ) · ). 2) The functions converge fast to either 0, 1 or −1. Interpolation is needed only on this limited range of numbers. 3) Both functions are symmetric around 0 (i.e., tanh(−x) = −tanh(x) and sig(−x) = 1 − sig(x)), thus just the positive number range needs to be interpolated and the negative range can be derived from the positive values.
Alg. 2 shows the pseudo-code that was used for the hardware implementation of the proposed interpolation. First, we chose the number of intervals of M and the size of every interval 2 N , whereas the interpolation range is ±M · 2 N . For both functions f ∈ {tanh, sig} two M -entry LUTs lut m f [·] and lut q f [·] are defined. Then the absolute value is calculated (line 2) and the index is calculated by a right shift of the absolute value by N places, if the result is larger than M , it is considered to be in the convergence area and either {−1, 0, 1} is returned. Otherwise, the value is calculated by linear approximation within the selected interval id (line 8), sign inverted for negative values (line 9) and subtracted from 1 for negative values in the sigmoid case (l. 10). We evaluate the proposed piece-wise linear approximation with different number of intervals 2 N and interpolation ranges, taking into account that fixed-point operations using the Q 3.12 format are used. The result of this evaluation is illustrated in Fig. 2 . For the actual implementation, we have selected an interpolation range of [−4, 4] and 2 5 = 32 intervals, which produces an MSE of 9.81 · 10 −7 and maximum error of ±3.8 · 10 −4 when compared to the full-precision hyperbolic tangent function. Evaluation on the quantized RNN benchmarks shows RNN RISC-V core pl.sdotsp.h.SR rD,rA,rB pl.tanh/sig rD, rA Fig. 1 . RNN RISC-V Core with extensions to RI5CY core [32] in blue and datapath for pl.sdotpsp instruction marked in bold. no deterioration of the end-to-end-error when replacing the activation function with our proposed interpolation, which is not surprising as Neural Networks are known to be robust against noise. This extensions reduces the cycle count from 51.2 to 44.5 kcycles within the LSTM networks [13] , [14] , resulting in a 13.0% improvement.
E. Load and Compute VLIW instruction (HW)
Analyzing the cycle counts in Tab. Ic we can see that, the lw! and pl.sdotsp.h instructions dominate. By introducing a new instruction, which combines these two within a single pl.sdotsp.h instruction which calculates a 16-bit packed SIMD sum-dot-product: from memory by the load/store unit LSU and is incremented for the next data access (i.e., next weight of the corresponding output channel). To avoid a 2-cycle latency, and thus unnecessary stalling, the data is stored in two special-purpose registers SPR and is written and read in an alternating way (using pl.sdotsp.h.0 and pl.sdotsp.h.1 instructions) from these two registers. The data from the SPR is multiplexed as 2nd operand OpA to the multiplier calculating the sum-dotproduct. Data hazards are avoided by stalling the pipeline in case of missing grant from memory, exploiting exactly the same signals and control strategy used for standard load words.
Tab. II shows the assembly with output FM tiles of four with (right) and without (left) the extension. In lines 1-2, the SPRs are pre-loaded with the first two weights before the actual main loop. In line 4, the input FM is loaded, which is used for the following MAC computation. As can be seen in Tab. Id, the cycle count can effectively be reduced by 1.7×.
Due to the latency of the load word and the dependency with the following instructions, a bubble is inserted in line 5. This can be further optimized by loading two input data (= four input channels) and the result calculated for all the output channels doubling the number of pl.sdotsp.h in the most inner-loop. However, the gains, as seen in Tab. Ie are rather modest 1.05× (or 4.9%) since loads and stores from the stack increase by 1.4× as more registers are needed. Fig. 3 shows the relative benefits to the RI5CY baseline compared to the output FM tiling, using the instruction extensions and the Input Feature Map Tiling, where for most of the networks the input FM tiling has a positive effect, but few networks (i.e., networks with small feature sizes) even need more cycles due to the increase stack operations.
IV. CORE IMPLEMENTATION RESULTS
The extended RI5C-Y core was implemented in Globalfoundries 22 nm FDX technology using an 8-track lowthreshold (LVT) standard cell library and has been synthesized with Synopsys Design Compiler 18.06, back-end flow has been done with Cadence Innovus 18.11 and power estimates are obtained by running gate-level simulations using Modelsim Questa v2019.1 with back-annotated delays from the final layout. When compared to a standard RI5CY core (RV32-IMCXpulp), the new instructions result in very small circuit area overhead of 2.3 kGE (or 3.4 % of the core area). Further-more, the critical path of the core remains unchanged (between load-store unit and memory in the write-back stage) and the core operates at 380 MHz at 0.65 V at typical conditions at room temperature .
Where the enhanced core excels in energy efficiency. When compared in the same core performing the RISC-V standard RV32-IMC instructions, when executing relevant RNN benchmarks, the enhanced core is on average 15× faster. It performs 566 MMAC/s (instead of 21 MMAC/s). When the core is using the extensions, the power consumption rises from 1.73 mW to 2.61 mW (51% total increase). While the decoder contributes insignificantly more power (≈5µW), the higher power consumption is mainly due to the higher utilization of the compute units (ALU and MAC unit, i.e., 0.57 mW/33% of the total power), the increased GPR usage (0.16 mW/9%), and the higher use of the load-store unit (0.05 mW/3%). However, the overall energy efficiency at 218 GMAC/s/W shows a 10× improvement.
V. CONCLUSION
We presented the first RISC-V core design optimized for RRM applications using machine learning approaches based on RNNs. The core achieves order-of-magnitude performance (15×) and energy efficiency (10×) improvements over the baseline RISC-V ISA on a wide range or RNN flavors used in RNN. These results are obtained thanks to a synergistic combination of software and hardware optimizations, which only marginally increase area cost and do not affect operating frequency. It is essential to notice that the proposed optimization to not impact numerical precision, hence labor-intensive quantization-aware retraining is not needed. The enhanced RISC-V core achieves 566 MMAC/s and 218 GMAC/s/W (on 16-bit data types) in 22 nm FDX technology at 0.65 V, thereby providing a fully programmable and efficient open-source IP for future systems-on-chip for 5G Radio Resource Management.
VI. ACKNOWLEDGMENTS
This work was funded by Huawei Technologies Sweden AB. The authors would like to thank the PULP community for providing a comprehensive and open-source RISC-V platform.
