7,430 research outputs found
Vector processing-aware advanced clock-gating techniques for low-power fused multiply-add
The need for power efficiency is driving a rethink of design decisions in processor architectures. While vector processors succeeded in the high-performance market in the past, they need a retailoring for the mobile market that they are entering now. Floating-point (FP) fused multiply-add (FMA), being a functional unit with high power consumption, deserves special attention. Although clock gating is a well-known method to reduce switching power in synchronous designs, there are unexplored opportunities for its application to vector processors, especially when considering active operating mode. In this research, we comprehensively identify, propose, and evaluate the most suitable clock-gating techniques for vector FMA units (VFUs). These techniques ensure power savings without jeopardizing the timing. We evaluate the proposed techniques using both synthetic and “real-world” application-based benchmarking. Using vector masking and vector multilane-aware clock gating, we report power reductions of up to 52%, assuming active VFU operating at the peak performance. Among other findings, we observe that vector instruction-based clock-gating techniques achieve power savings for all vector FP instructions. Finally, when evaluating all techniques together, using “real-world” benchmarking, the power reductions are up to 80%. Additionally, in accordance with processor design trends, we perform this research in a fully parameterizable and automated fashion.The research leading to these results has received funding from the RoMoL ERC Advanced Grant GA 321253 and is supported in part by the European Union (FEDER funds) under contract TTIN2015-65316-P.
The work of I. Ratkovic was supported by a FPU research grant from the Spanish MECD.Peer ReviewedPostprint (author's final draft
Training deep neural networks with low precision multiplications
Multipliers are the most space and power-hungry arithmetic operators of the
digital implementation of deep neural networks. We train a set of
state-of-the-art neural networks (Maxout networks) on three benchmark datasets:
MNIST, CIFAR-10 and SVHN. They are trained with three distinct formats:
floating point, fixed point and dynamic fixed point. For each of those datasets
and for each of those formats, we assess the impact of the precision of the
multiplications on the final error after training. We find that very low
precision is sufficient not just for running trained networks but also for
training them. For example, it is possible to train Maxout networks with 10
bits multiplications.Comment: 10 pages, 5 figures, Accepted as a workshop contribution at ICLR 201
Formal Verification of an Iterative Low-Power x86 Floating-Point Multiplier with Redundant Feedback
We present the formal verification of a low-power x86 floating-point
multiplier. The multiplier operates iteratively and feeds back intermediate
results in redundant representation. It supports x87 and SSE instructions in
various precisions and can block the issuing of new instructions. The design
has been optimized for low-power operation and has not been constrained by the
formal verification effort. Additional improvements for the implementation were
identified through formal verification. The formal verification of the design
also incorporates the implementation of clock-gating and control logic. The
core of the verification effort was based on ACL2 theorem proving.
Additionally, model checking has been used to verify some properties of the
floating-point scheduler that are relevant for the correct operation of the
unit.Comment: In Proceedings ACL2 2011, arXiv:1110.447
A general framework for efficient FPGA implementation of matrix product
Original article can be found at: http://www.medjcn.com/ Copyright Softmotor LimitedHigh performance systems are required by the developers for fast processing of computationally intensive applications. Reconfigurable hardware devices in the form of Filed-Programmable Gate Arrays (FPGAs) have been proposed as viable system building blocks in the construction of high performance systems at an economical price. Given the importance and the use of matrix algorithms in scientific computing applications, they seem ideal candidates to harness and exploit the advantages offered by FPGAs. In this paper, a system for matrix algorithm cores generation is described. The system provides a catalog of efficient user-customizable cores, designed for FPGA implementation, ranging in three different matrix algorithm categories: (i) matrix operations, (ii) matrix transforms and (iii) matrix decomposition. The generated core can be either a general purpose or a specific application core. The methodology used in the design and implementation of two specific image processing application cores is presented. The first core is a fully pipelined matrix multiplier for colour space conversion based on distributed arithmetic principles while the second one is a parallel floating-point matrix multiplier designed for 3D affine transformations.Peer reviewe
- …