10,360 research outputs found

    Vector processing-aware advanced clock-gating techniques for low-power fused multiply-add

    Get PDF
    The need for power efficiency is driving a rethink of design decisions in processor architectures. While vector processors succeeded in the high-performance market in the past, they need a retailoring for the mobile market that they are entering now. Floating-point (FP) fused multiply-add (FMA), being a functional unit with high power consumption, deserves special attention. Although clock gating is a well-known method to reduce switching power in synchronous designs, there are unexplored opportunities for its application to vector processors, especially when considering active operating mode. In this research, we comprehensively identify, propose, and evaluate the most suitable clock-gating techniques for vector FMA units (VFUs). These techniques ensure power savings without jeopardizing the timing. We evaluate the proposed techniques using both synthetic and “real-world” application-based benchmarking. Using vector masking and vector multilane-aware clock gating, we report power reductions of up to 52%, assuming active VFU operating at the peak performance. Among other findings, we observe that vector instruction-based clock-gating techniques achieve power savings for all vector FP instructions. Finally, when evaluating all techniques together, using “real-world” benchmarking, the power reductions are up to 80%. Additionally, in accordance with processor design trends, we perform this research in a fully parameterizable and automated fashion.The research leading to these results has received funding from the RoMoL ERC Advanced Grant GA 321253 and is supported in part by the European Union (FEDER funds) under contract TTIN2015-65316-P. The work of I. Ratkovic was supported by a FPU research grant from the Spanish MECD.Peer ReviewedPostprint (author's final draft

    Formal Verification of an Iterative Low-Power x86 Floating-Point Multiplier with Redundant Feedback

    Full text link
    We present the formal verification of a low-power x86 floating-point multiplier. The multiplier operates iteratively and feeds back intermediate results in redundant representation. It supports x87 and SSE instructions in various precisions and can block the issuing of new instructions. The design has been optimized for low-power operation and has not been constrained by the formal verification effort. Additional improvements for the implementation were identified through formal verification. The formal verification of the design also incorporates the implementation of clock-gating and control logic. The core of the verification effort was based on ACL2 theorem proving. Additionally, model checking has been used to verify some properties of the floating-point scheduler that are relevant for the correct operation of the unit.Comment: In Proceedings ACL2 2011, arXiv:1110.447

    An efficient hardware architecture for a neural network activation function generator

    Get PDF
    This paper proposes an efficient hardware architecture for a function generator suitable for an artificial neural network (ANN). A spline-based approximation function is designed that provides a good trade-off between accuracy and silicon area, whilst also being inherently scalable and adaptable for numerous activation functions. This has been achieved by using a minimax polynomial and through optimal placement of the approximating polynomials based on the results of a genetic algorithm. The approximation error of the proposed method compares favourably to all related research in this field. Efficient hardware multiplication circuitry is used in the implementation, which reduces the area overhead and increases the throughput

    Optimising Sparse Matrix Vector multiplication for large scale FEM problems on FPGA

    Get PDF
    Sparse Matrix Vector multiplication (SpMV) is an important kernel in many scientific applications. In this work we propose an architecture and an automated customisation method to detect and optimise the architecture for block diagonal sparse matrices. We evaluate the proposed approach in the context of the spectral/hp Finite Element Method, using the local matrix assembly approach. This problem leads to a large sparse system of linear equations with block diagonal matrix which is typically solved using an iterative method such as the Preconditioned Conjugate Gradient. The efficiency of the proposed architecture combined with the effectiveness of the proposed customisation method reduces BRAM resource utilisation by as much as 10 times, while achieving identical throughput with existing state of the art designs and requiring minimal development effort from the end user. In the context of the Finite Element Method, our approach enables the solution of larger problems than previously possible, enabling the applicability of FPGAs to more interesting HPC problems

    Training deep neural networks with low precision multiplications

    Full text link
    Multipliers are the most space and power-hungry arithmetic operators of the digital implementation of deep neural networks. We train a set of state-of-the-art neural networks (Maxout networks) on three benchmark datasets: MNIST, CIFAR-10 and SVHN. They are trained with three distinct formats: floating point, fixed point and dynamic fixed point. For each of those datasets and for each of those formats, we assess the impact of the precision of the multiplications on the final error after training. We find that very low precision is sufficient not just for running trained networks but also for training them. For example, it is possible to train Maxout networks with 10 bits multiplications.Comment: 10 pages, 5 figures, Accepted as a workshop contribution at ICLR 201
    corecore