Reducing hardware overhead of neural networks for faster or lower power inference and training is an active area of research. Uniform quantization using integer multiply-add has been thoroughly investigated, which requires learning many quantization parameters, fine-tuning training or other prerequisites. Little effort is made to improve floating point relative to this baseline; it remains energy inefficient, and word size reduction yields drastic loss in needed dynamic range. We improve floating point to be more energy efficient than equivalent bit width integer hardware on a 28 nm ASIC process while retaining accuracy in 8 bits with a novel hybrid log multiply/linear add, Kulisch accumulation and tapered encodings from Gustafson's posit format. With no network retraining, and drop-in replacement of all math and float32 parameters via round-to-nearest-even only, this open-sourced 8-bit log float is within 0.9% top-1 and 0.2% top-5 accuracy of the original float32 ResNet-50 CNN model on ImageNet. Unlike int8 quantization, it is still a general purpose floating point arithmetic, interpretable out-of-the-box. Our 8/38-bit log float multiply-add is synthesized and power profiled at 28 nm at 0.96× the power and 1.12× the area of 8/32-bit integer multiply-add. In 16 bits, our log float multiply-add is 0.59× the power and 0.68× the area of IEEE 754 float16 fused multiply-add, maintaining the same signficand precision and dynamic range, proving useful for training ASICs as well.
Introduction
Reducing the computational complexity of neural networks (NNs) while maintaining accuracy encompasses a long line of research in NN design, training and inference. Different computer arithmetic primitives have been considered, including fixed-point [21] , uniform quantization via 8 bit integer [15] , ternary [20] and binary/low-bit representations [29, 3, 1] . Some implementations are efficiently implemented on CPU/GPU ISAs [35, 33] , while others demand custom hardware [10] . Instead of developing quantization techniques increasingly divorced from the original implementation, we seek to improve floating point itself, and let word size reduction yield efficiency for us. It is historically known to be up to 10× less energy efficient in hardware implementations than integer math [14] . Typical implementation is encumbered with IEEE 754 standard compliance [37] , demanding specific forms such as fused multiply-add (FMA) that we will show as being inefficient and imprecise. Memory movement (SRAM/DRAM/flip-flops) dominates power consumption; word bit length reduction thus provides obvious advantages beyond just reducing adder and multiplier area.
We explore encodings to better capture dynamic range with acceptable precision in smaller word sizes, and more efficient summation and multiplication (Sections 3-5), for a reduction in chip power and area. Significant inspiration for our work is found in logarithmic number systems (LNS) [2] and the work of Miyashita et al. [24] that finds logarithmic quantizers better suited to data distributions in NNs, and alternative visions of floating point from Gustafson [11, 12] and Kulisch [19] . We sidestep prior LNS design issues with numerical approximation and repurpose ideas from Gustafson and Kulisch, producing a general-purpose arithmetic that is effective on CNNs [13] without quantization tinkering or re-training (Section 7), and can be as efficient as integer math in hardware (Section 8).
Floating point variants for NNs
There are few studies on NNs for floating point variants beyond those provided for in CPU/GPU ISAs. [4] shows a kind of 8 bit floating point for communicating gradients, but this is not used for general computation. Flexpoint [17] and the Brainwave NPU [6] use variants of block floating point [36] , representing data as a collection of significands with a shared exponent. This requires controlled dynamic range variation and increased management cost, but saves on data movement and hardware resources. For going to 8 bits in our work, we seek to improve the encoding and hardware for a reasonable tradeoff between dynamic range and precision, with less machinery needed in software.
For different precisions, [5] shows reduced-precision floating point for training smaller networks on MNIST and CIFAR-10, with (6, 5) 1 floating point without denormal significands being comparable to float32 on these examples. (8, 7) bfloat16 is available on Google's TPUv2 [9] . This form maintains the same normalized exponent range as float32, except with reduced precision and smaller multipliers. However, the forms of encoding and computation for many of these variants are not substantially different than implementations available with common ISAs, hardened FPGA IP, and the like. We will seek to improve the encoding, precision and computation efficiency of floating point to find a solution that is quite different in practice than standard (e, s) floating point.
3 Space-efficient encodings IEEE 754-style fixed width field encodings are not optimal for most data distributions seen in practice; float32 maintains the same significand precision at 10 −10 as at 10 10 . Straightforward implementation of this design in 8 bits will result in sizable space encoding NaNs, ∼ 6% for (4, 3) float. Denormals use similar space and are expensive in hardware [26] ; not implementing them restricts the dynamic range of the type (Table 1) . Tapered floating point can solve this problem: within a fixed-sized word, exponent and significand field size varies, with a third field indicating relative size. To quote Morris (1971) : "users of floating-point numbers are seldom, if ever, concerned simultaneously with loss of accuracy and with overflow. If this is so, then the range of possible representation can be extended [with tapering] to an extreme degree and the slight loss of accuracy will be unnoticed." [25] A more efficient representation for tapered floating point is the recent posit format by Gustafson [12] . It has no explicit size field; the exponent is encoded using a Golomb-Rice prefix-free code [8, 22] , with the exponent e encoded as a Golomb-Rice quotient and remainder (q, r) with q in unary and r in binary (in posit terminology, q is the regime). Remainder encoding size is defined by the exponent scale s, where 2 s is the Golomb-Rice divisor. Any space not used by the exponent encoding is used by the significand, which unlike IEEE 754 always has a leading 1; gradual underflow (and overflow) is handled by tapering. A posit number system is characterized by (N, s), where N is the word length in bits and s is the exponent scale. The minimum and maximum positive finite numbers in (N, s) are f min = 2
s . The number line is represented much as the projective reals, with a single point at ±∞ bounding −f max and f max . ±∞ and 0 have special encodings; there is no NaN. The number system allows any choice of N ≥ 3 and 0 ≤ s ≤ N − 3.
s controls the dynamic range achievable; e.g., 8- 
192 is larger than f max in float32. (8, 0) and (8, 1) are more reasonable values to choose for 8-bit floating point representations, with f max of 64 and 4096 accordingly. Precision is maximized in the range ±[2 −(s+1) , 2 s+1 ) with N − 3 − s significand fraction bits, tapering to no fraction bits at ±f max .
Accumulator efficiency and precision
A sum of scalar products i a i b i is a frequent operation in linear algebra. For CNNs like ResNet-50 [13] , we accumulate up to 4,608 (2d convolution with k = 3 × 3, c in = 512) such products.
Integer addition is associative (excepting overflow); the order of operations does not matter and thus it allows for error-free parallelization. In typical accelerator use, the accumulation type is 32 bits. Typical floating point addition is notorious for its lack of associativity; this presents problems with reproducibility, parallelization and rounding error [26] . Facilities such as fused multiply-add (FMA) that perform a sum and product c + a i b i with a single rounding can reduce error and further pipeline operations when computing sums of products. Such machinery cannot avoid rounding error involved with tiny (8-bit) floating point types; the accumulator can become larger in magnitude than the product being accumulated into it, and the significand words no longer overlap as needed even with rounding (yielding c + ab = c); increasing accumulator size a bit only defers this problem.
There is a more efficient and precise method than FMA available. A Kulisch accumulator [19] is a fixed point register that is wide enough to contain both the largest and smallest possible scalar product of floating point values ±(f 2 max + f 2 min ). It provides associative, error-free calculation (excepting a single, final rounding) of a sum of scalar floating point products; a float significand to be accumulated is shifted based on exponent to align with the accumulator for the sum. Final rounding to floating point is performed after all sums are made. A similar operation known as Auflaufenlassen was available in Konrad Zuse's Z3 as early as 1941 [18] , though it is not found in modern computers.
We will term this operation of summing scalar products in a Kulisch accumulator exact multiply add (EMA). For an inner product, given a rounding function 2 r(·) with the argument evaluated at infinite precision, EMA calculates r( i a i b i ), whereas FMA calculates r(a n b n + r(a n−1 b n−1 + r(· · · + r(a 1 b 1 + 0) · · · ))). Both EMA and FMA can be implemented for any floating point type. Gustafson proposed Kulisch accumulators to be standard for posits, terming them quires.
Depending upon float dynamic range, EMA can be considerably more efficient than FMA in hardware. FMA must mutually align the addends c and the product ab, including renormalization logic for subtraction cancellation, and the proper alignment cannot be computed until fairly late in the process. Extra machinery to reduce latency such as the leading zero (LZ) anticipator or three path architectures have been invented [28] . If multiply-add needs to be pipelined for timing closure, EMA knows upfront the location of the floating point of c needed in alignment (as it is fixed), and can thus accumulate a new product into it every clock cycle, while a FMA must hold onto the starting value of the accumulator c until later in the process, increasing the pipeline non-combinational area and often requiring greater use of an external register file (for multiple accumulators c i in concurrent use) and effective "loop unrolling" at software level to fill all pipeline slots. The rounding performed every FMA requires additional logic, and rounding error can still compound greatly across repeated sums.
Multiplier efficiency
Floating point with EMA is still expensive, as there is added shifter, LZ counter, rounding, etc. logic. Integer MAC and float FMA/EMA both involve multiplication of fixed-point values; for int8/32 MAC this multiply is 63.4% of the combinational power in our analysis at 28 nm (Section 8).
A logarithmic number system (LNS) [16] 
As values x ≤ 0 are outside the log domain, sign and zero are handled separately [31] , as is ±∞. We encode B = 2 log numbers with a sign bit and a signed fixed-point number of the form m.f , which represents the linear domain value ±2
(m+ i fi/2 i ) . For add/sub, without loss of generality, order j ≤ i, and σ ± (x) = log 2 (1 ± 2 x ); this is the historical weak point of a LNS, as implementations use costly LUTs or piecewise linear approximation of σ ± (x). This can be more expensive than hardware multipliers. The approximation log 2 (1 + x) ≈ x for x ∈ [0, 1] could also be used [24] , but this adds significant error, especially with repeated sums. To convert a linear domain value back to log domain, we map g ∈ [0, 1) to q(g) = log 2 (1 + g). g is a linear domain fixed-point fraction; to control the size of the LUT we only consider β bits via rounding of g. q(r(g, β)) is similarly rounded to a desired γ bits; note that this latter rounding is log domain. r(q(r(g, β)), γ) is then a (2 β ×γ)-bit LUT. We also choose α ≥ f bits +1, β ≥ α, γ = f bits to ensure that log-to-linear-to-log conversion of f is the identity, or f = r(q(r(r(p(f ), α), β)), γ).
We will name this (somewhat inaccurately) exact log-linear multiply-add (ELMA). The log product and linear sum are each exact, but the log product is not represented exactly by r(p(f )) as this requires infinite precision, unlike EMA which is exact except for a final rounding. The intermediate log product avoids overflow or underflow with an extra bit for the product's m. If a linear-to-log mapping is desired (returning a log number after summation), there is also loss via r(q(g)).
Combining log-to-linear mapping with Kulisch accumulation makes log domain multiply-add efficient and reasonably accurate. Small p and q LUTs reduce well in combinational logic. They are practical for 16-bit types too, as compression can be used to reduce the size. For larger types they are impractical, as α, β, γ need to scale with 2 f bits , at which point σ ± is a better strategy. As with FMA, repeated summation via σ ± is subject to magnitude difference error (e.g., the c + ab = c case). Our approximation introduces error with r(p(f )) and r(q(g)), but mitigates repeated summation error and is immune to magnitude differences. This tradeoff seems acceptable in practice (Section 7).
An 8-bit log number by default suffers from the same problem as 8-bit IEEE-style floating point; the dynamic range is limited by the fixed point encoding. We can use the same tapering as used in (N, s) posit for m.f log numbers. m is encoded as an exponent, and f as a floating point significand. f min and f max are then exactly the same for posit-tapered base-2 log or linear domain values. Setting γ = f bits (which is at maximum (N − 3 − s) for posits) introduces additional tapering rounding error, as subsequent rounding in encoding is performed outside regimes of maximum precision. γ is increased up to 3 bits (guard, round and sticky bits in typical round-to-nearest-even) to improve accuracy here. This encoding we will refer to as (N, s, α, β, γ) log (posit tapered). We can similarly choose to encode log numbers using an IEEE 754 format (with biased exponents, NaN representations etc.); we use this for our ELMA comparison against float16 FMA in Section 8.
Additional hardware details
To make EMA/ELMA more energy efficient, we restrict accumulator range to [f 2 min , f max ]; handling temporary underflow rather than overflow is more important in our experience. Kulisch accumulator conversion back to log or linear N-bit types uses a LZ counter and shifter but can be substantially amortized in two ways. First, many sums are performed, with final conversion done only once per inner product. Energy for the majority of work is thus lower than MAC/FMA (Section 8); increased area for increased energy efficiency is generally useful in the era of "dark silicon" [32] , or conversion module instances can be rationed (limiting throughput) and/or clock gated. Second, structures with local operand reuse (e.g., systolic arrays, fixed-function convolvers) naturally require fewer converter instances, reducing area (discussion in Section 8 as well). EMA and FMA accuracy are the same for a single sum c + ab; our power advantage would disappear in this domain, but the vast majority of flops/ops in NNs require repeated rather than singular sums. Note that int8/32 usage itself requires some conversion back to int8 in the end that we do not evaluate.
FPGA experiments
Our implementation is in SystemVerilog for ASIC evaluation, built into an FPGA design with Intel FPGA OpenCL RTL integration support, with rudimentary PyTorch [27] integration. Source code is available at github.com/facebookresearch/deepfloat. We evaluate (N, s) posit and (N, s, α, β, γ) log arithmetic on the ResNet-50 CNN [13] with the ImageNet ILSVRC12 validation set [30] . We use float32 trained parameters from the PyTorch model zoo, with batch normalization fused into preceding affine layers [15] . float32 parameters and network input are converted to our formats via round-to-nearest-even; no other adjustment of these values is performed. When converting into or out of a Kulisch accumulator, we can add a small exponent bias factor, adjusting the input exponent by m, or the output exponent by n. This is effectively free (a small adder). No changes are made to any activations except for such a bias of n = −4 at the last (fully connected) layer to recenter unnormalized log probabilities from around 16.0 to 1.0. Without this we have an additional loss in top-1 of around 0.5-1%, with little change to top-5. If the Kulisch accumulator itself can be directly considered for top-k comparison, this avoids the need as well. All math is replaced with the corresponding posit or log versions; average pooling is via division of the Kulisch accumulator.
Our results are in Table 2 , along with two int8/32 quantization comparisons. (8, 0) linear posit has insufficient dynamic range to work; activations are quickly rounded to zero. Our (8, 1, 5, 5, 7) log result remains very close to (8, 1) linear posit. The int8/32 results listed do not start from the same float32 parameters as our trained network, so they are not directly comparable. They use training with simulated quantization [15] and KL-divergence calibration with sampled activations [23] , whereas we perform math in the usual way in our log or linear domain arithmetic after rounding input and parameters. We obtain reasonably similar precision without retraining, sampling activations or learning quantization parameters, while retaining general floating point representations in 8 bits.
ASIC evaluation
We use Synopsys Design Compiler and PrimeTime PX with a commercially available 28 nm library, target clock 500 MHz. Process corners are SS@-40 • C synthesis, TT@25
• C power analysis at 0.81V. Table 3 investigates multiply-add PEs, and as a proxy for an accelerator design, a 32x32 matrix multiplication systolic array with these PEs. The float16 FMA is Synopsys DesignWare dw_fp_mac. We accumulate to the C matrix in place (stationary C), shifting out values upon completion. The int8/32 array outputs unprocessed int32; for ELMA, Kulisch accumulators are shifted across the PEs for C output and converted to 8 bit log at the boundary via 32 conversion/encoder modules. The 1024 PEs within do not include these (as discussed in Section 6). 64 posit taper decoders are included for where A and B are passed as input. Power analysis uses testbench waves for 128-d vectors with elements drawn from N (0, 1); int8 quantization has a max of 2σ. PEs evaluate a variety of these inner products, and the systolic arrays a variety of GEMMs with these vectors.
ELMA saves 90.9 µW over int8/32 on multiplication, but loses 68.3 µW on the add. ELMA noncombinational demands are higher with additional state required (Kulisch and decoded log numbers), but could be reduced by not handling underflow all the way to f 2 min . Despite the larger Kulisch adder, effectively only 6 bits are summed (with carry) each cycle versus up to 16 with int8/32; strategies for 500+ bit Kulisch accumulators [34] might work in this small regime to further take advantage of this. Our 16-bit ELMA α = 11 p(f ) combinational LUT is 386 µm 2 despite compression, now a significant portion of the design. Larger α likely needs a compiled ROM or explicit compute of p(f ).
A more in-depth analysis for our work would need to determine a Pareto frontier between frequency/latency, per-operation energy, area, pipeline depth, math implementation and accuracy similar to the Galal et al. FPU generator work [7] , to see precisely in what regimes ELMA is advantageous. We provide our limited analysis here, however rough, to help motivate future investigation.
Conclusions
DNNs are resilient to many forms of numerical tinkering; they allow re-evaluation of design decisions made long ago at the bottom of the hardware stack with reduced fear of failure. The design space of hardware real number representations is indeed quite large and underexplored [22] , as is as the opportunity to improve hardware efficiency and software simplicity with alternative designs and judicious use of numerical approximation. Log domain representations, posits, Kulisch accumulation and combinations such as ELMA show that floating point efficiency and applicability can be substantially improved upon. We plan on continuing investigation of this arithmetic design space at the hardware level with DNN training, and on general numerical algorithms in the future.
