16,085 research outputs found
Flexible Multiple-Precision Fused Arithmetic Units for Efficient Deep Learning Computation
Deep Learning has achieved great success in recent years. In many fields of applications, such as computer vision, biomedical analysis, and natural language processing, deep learning can achieve a performance that is even better than human-level. However, behind this superior performance is the expensive hardware cost required to implement deep learning operations. Deep learning operations are both computation intensive and memory intensive. Many research works in the literature focused on improving the efficiency of deep learning operations. In this thesis, special focus is put on improving deep learning computation and several efficient arithmetic unit architectures are proposed and optimized for deep learning computation. The contents of this thesis can be divided into three parts: (1) the optimization of general-purpose arithmetic units for deep learning computation; (2) the design of deep learning specific arithmetic units; (3) the optimization of deep learning computation using 3D memory architecture.
Deep learning models are usually trained on graphics processing unit (GPU) and the computations are done with single-precision floating-point numbers. However, recent works proved that deep learning computation can be accomplished with low precision numbers. The half-precision numbers are becoming more and more popular in deep learning computation due to their lower hardware cost compared to the single-precision numbers. In conventional floating-point arithmetic units, single-precision and beyond are well supported to achieve a better precision. However, for deep learning computation, since the computations are intensive, low precision computation is desired to achieve better throughput. As the popularity of half-precision raises, half-precision operations are also need to be supported. Moreover, the deep learning computation contains many dot-product operations and therefore, the support of mixed-precision dot-product operations can be explored in a multiple-precision architecture. In this thesis, a multiple-precision fused multiply-add (FMA) architecture is proposed. It supports half/single/double/quadruple-precision FMA operations. In addition, it also supports 2-term mixed-precision dot-product operations. Compared to the conventional multiple-precision FMA architecture, the newly added half-precision support and mixed-precision dot-product only bring minor resource overhead. The proposed FMA can be used as general-purpose arithmetic unit. Due to the support of parallel half-precision computations and mixed-precision dot-product computations, it is especially suitable for deep learning computation.
For the design of deep learning specific computation unit, more optimizations can be performed. First, a fixed-point and floating-point merged multiply-accumulate (MAC) unit is proposed. As deep learning computation can be accomplished with low precision number formats, the support of high precision floating-point operations can be eliminated. In this design, the half-precision floating-point format is supported to provide a large dynamic range to handle small gradients for deep learning training. For deep learning inference, 8-bit fixed-point 2-term dot-product computation is supported. Second, a flexible multiple-precision MAC unit architecture is proposed. The proposed MAC unit supports both fixed-point operations and floating-point operations. For floating-point format, the proposed unit supports one 16-bit MAC operation or sum of two 8-bit multiplications plus a 16-bit addend. To make the proposed MAC unit more versatile, the bit-width of exponent and mantissa can be flexibly exchanged. By setting the bit-width of exponent to zero, the proposed MAC unit also supports fixed-point operations. For fixed-point format, the proposed unit supports one 16-bit MAC or sum of two 8-bit multiplications plus a 16-bit addend. Moreover, the proposed unit can be further divided to support sum of four 4-bit multiplications plus a 16-bit addend. At the lowest precision, the proposed MAC unit supports accumulating of eight 1-bit logic AND operations to enable the support of binary neural networks. Finally, a MAC architecture based on the posit format, a promising numerical format in deep learning computation, is proposed to facilitate the use of posit format in deep learning computation.
In addition to the above mention arithmetic units, an improved hybrid memory cube (HMC) architecture is proposed for weight-sharing deep neural network processing. By modifying the HMC instruction set and HMC logic layer, the major part of the deep learning computation can be accomplished inside memory. The proposed design reduces the memory bandwidth requirements and thus reduces the energy consumed by memory data transfer
AIR: Adaptive Dynamic Precision Iterative Refinement
In high performance computing, applications often require very accurate solutions while minimizing runtimes and power consumption. Improving the ratio of the number of logic gates implementing floating point arithmetic operations to the total number of logic gates enables greater efficiency, potentially with higher performance and lower power consumption. Software executing on the fixed hardware in Von-Neuman architectures faces limitations on improving this ratio, since processors require extensive supporting logic to fetch and decode instructions while employing arithmetic units with statically defined precision. This dissertation explores novel approaches to improve computing architectures for linear system applications not only by designing application-specific hardware but also by optimizing precision by applying adaptive dynamic precision iterative refinement (AIR). This dissertation shows that AIR is numerically stable and well behaved. Theoretically, AIR can produce up to 3 times speedup over mixed precision iterative refinement on FPGAs. Implementing an AIR prototype for the refinement procedure on a Xilinx XC6VSX475T FPGA results in an estimated around 0.5, 8, and 2 times improvement for the time-, clock-, and energy-based performance per iteration compared to mixed precision iterative refinement on the Nvidia Tesla C2075 GPU, when a user requires a prescribed accuracy between single and double precision. AIR using FPGAs can produce beyond double precision accuracy effectively, while CPUs or GPUs need software help causing substantial overhead
Mixing Low-Precision Formats in Multiply-Accumulate Units for DNN Training
International audienceThe most compute-intensive stage of deep neural network (DNN) training is matrix multiplication where the multiply-accumulate (MAC) operator is key. To reduce training costs, we consider using low-precision arithmetic for MAC operations. While low-precision training has been investigated in prior work, the focus has been on reducing the number of bits in weights or activations without compromising accuracy. In contrast, the focus in this paper is on implementation details beyond weight or activation width that affect area and accuracy. In particular, we investigate the impact of fixed-versus floating-point representations, multiplier rounding, and floatingpoint exceptional value support. Results suggest that (1) lowprecision floating-point is more area-effective than fixed-point for multiplication, (2) standard IEEE-754 rules for subnormals, NaNs, and intermediate rounding serve little to no value in terms of accuracy but contribute significantly to area, (3) lowprecision MACs require an adaptive loss-scaling step during training to compensate for limited representation range, and (4) fixed-point is more area-effective for accumulation, but the cost of format conversion and downstream logic can swamp the savings. Finally, we note that future work should investigate accumulation structures beyond the MAC level to achieve further gains
Stochastic rounding and reduced-precision fixed-point arithmetic for solving neural ordinary differential equations
Although double-precision floating-point arithmetic currently dominates
high-performance computing, there is increasing interest in smaller and simpler
arithmetic types. The main reasons are potential improvements in energy
efficiency and memory footprint and bandwidth. However, simply switching to
lower-precision types typically results in increased numerical errors. We
investigate approaches to improving the accuracy of reduced-precision
fixed-point arithmetic types, using examples in an important domain for
numerical computation in neuroscience: the solution of Ordinary Differential
Equations (ODEs). The Izhikevich neuron model is used to demonstrate that
rounding has an important role in producing accurate spike timings from
explicit ODE solution algorithms. In particular, fixed-point arithmetic with
stochastic rounding consistently results in smaller errors compared to single
precision floating-point and fixed-point arithmetic with round-to-nearest
across a range of neuron behaviours and ODE solvers. A computationally much
cheaper alternative is also investigated, inspired by the concept of dither
that is a widely understood mechanism for providing resolution below the least
significant bit (LSB) in digital signal processing. These results will have
implications for the solution of ODEs in other subject areas, and should also
be directly relevant to the huge range of practical problems that are
represented by Partial Differential Equations (PDEs).Comment: Submitted to Philosophical Transactions of the Royal Society
- …