1,428 research outputs found
Vector processing-aware advanced clock-gating techniques for low-power fused multiply-add
The need for power efficiency is driving a rethink of design decisions in processor architectures. While vector processors succeeded in the high-performance market in the past, they need a retailoring for the mobile market that they are entering now. Floating-point (FP) fused multiply-add (FMA), being a functional unit with high power consumption, deserves special attention. Although clock gating is a well-known method to reduce switching power in synchronous designs, there are unexplored opportunities for its application to vector processors, especially when considering active operating mode. In this research, we comprehensively identify, propose, and evaluate the most suitable clock-gating techniques for vector FMA units (VFUs). These techniques ensure power savings without jeopardizing the timing. We evaluate the proposed techniques using both synthetic and “real-world” application-based benchmarking. Using vector masking and vector multilane-aware clock gating, we report power reductions of up to 52%, assuming active VFU operating at the peak performance. Among other findings, we observe that vector instruction-based clock-gating techniques achieve power savings for all vector FP instructions. Finally, when evaluating all techniques together, using “real-world” benchmarking, the power reductions are up to 80%. Additionally, in accordance with processor design trends, we perform this research in a fully parameterizable and automated fashion.The research leading to these results has received funding from the RoMoL ERC Advanced Grant GA 321253 and is supported in part by the European Union (FEDER funds) under contract TTIN2015-65316-P.
The work of I. Ratkovic was supported by a FPU research grant from the Spanish MECD.Peer ReviewedPostprint (author's final draft
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
The rising popularity of intelligent mobile devices and the daunting
computational cost of deep learning-based models call for efficient and
accurate on-device inference schemes. We propose a quantization scheme that
allows inference to be carried out using integer-only arithmetic, which can be
implemented more efficiently than floating point inference on commonly
available integer-only hardware. We also co-design a training procedure to
preserve end-to-end model accuracy post quantization. As a result, the proposed
quantization scheme improves the tradeoff between accuracy and on-device
latency. The improvements are significant even on MobileNets, a model family
known for run-time efficiency, and are demonstrated in ImageNet classification
and COCO detection on popular CPUs.Comment: 14 pages, 12 figure
An efficient multiple precision floating-point Multiply-Add Fused unit
Multiply-Add Fused (MAF) units play a key role in the processor's performance for a variety of applications. The objective of this paper is to present a multi-functional, multiple precision floating-point Multiply-Add Fused (MAF) unit. The proposed MAF is reconfigurable and able to execute a quadruple precision MAF instruction, or two double precision instructions, or four single precision instructions in parallel. The MAF architecture features a dual-path organization reducing the latency of the floating-point add (FADD) instruction and utilizes the minimum number of operating components to keep the area low. The proposed MAF design was implemented on a 65 nm silicon process achieving a maximum operating frequency of 293.5 MHz at 381 mW power
Reduced-Precision Floating-Point Arithmetic in Systolic Arrays with Skewed Pipelines
The acceleration of deep-learning kernels in hardware relies on matrix
multiplications that are executed efficiently on Systolic Arrays (SA). To
effectively trade off deep-learning training/inference quality with hardware
cost, SA accelerators employ reduced-precision Floating-Point (FP) arithmetic.
In this work, we demonstrate the need for new pipeline organizations to reduce
latency and improve energy efficiency of reduced-precision FP operators for the
chained multiply-add operation imposed by the structure of the SA. The proposed
skewed pipeline design reorganizes the pipelined operation of the FP
multiply-add units to enable new forwarding paths for the exponent logic, which
allow for parallel execution of the pipeline stages of consecutive PEs. As a
result, the latency of the matrix multiplication operation within the SA is
significantly reduced with minimal hardware cost, thereby yielding an energy
reduction of 8% and 11% for the examined state-of-the-art CNNs.Comment: Accepted at IEEE International Conference on Artificial Intelligence
Circuits and Systems (AICAS) 202
Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi
Intel Xeon Phi is a recently released high-performance coprocessor which
features 61 cores each supporting 4 hardware threads with 512-bit wide SIMD
registers achieving a peak theoretical performance of 1Tflop/s in double
precision. Many scientific applications involve operations on large sparse
matrices such as linear solvers, eigensolver, and graph mining algorithms. The
core of most of these applications involves the multiplication of a large,
sparse matrix with a dense vector (SpMV). In this paper, we investigate the
performance of the Xeon Phi coprocessor for SpMV. We first provide a
comprehensive introduction to this new architecture and analyze its peak
performance with a number of micro benchmarks. Although the design of a Xeon
Phi core is not much different than those of the cores in modern processors,
its large number of cores and hyperthreading capability allow many application
to saturate the available memory bandwidth, which is not the case for many
cutting-edge processors. Yet, our performance studies show that it is the
memory latency not the bandwidth which creates a bottleneck for SpMV on this
architecture. Finally, our experiments show that Xeon Phi's sparse kernel
performance is very promising and even better than that of cutting-edge general
purpose processors and GPUs
- …