95,320 research outputs found

    Look-Up mAI GeMM: Increasing AI GeMMs Performance by Nearly 2.5x via msGeMM

    Full text link
    AI models are increasing in size and recent advancement in the community has shown that unlike HPC applications where double precision datatype are required, lower-precision datatypes such as fp8 or int4 are sufficient to bring the same model quality both for training and inference. Following these trends, GPU vendors such as NVIDIA and AMD have added hardware support for fp16, fp8 and int8 GeMM operations with an exceptional performance via Tensor Cores. However, this paper proposes a new algorithm called msGeMM which shows that AI models with low-precision datatypes can run with ~2.5x fewer multiplication and add instructions. Efficient implementation of this algorithm requires special CUDA cores with the ability to add elements from a small look-up table at the rate of Tensor Cores

    Implementation of bioinspired algorithms on the neuromorphic VLSI system SpiNNaker 2

    Get PDF
    It is believed that neuromorphic hardware will accelerate neuroscience research and enable the next generation edge AI. On the other hand, brain-inspired algorithms are supposed to work efficiently on neuromorphic hardware. But both processes don't happen automatically. To efficiently bring together hardware and algorithm, optimizations are necessary based on the understanding of both sides. In this work, software frameworks and optimizations for efficient implementation of neural network-based algorithms on SpiNNaker 2 are proposed, resulting in optimized power consumption, memory footprint and computation time. In particular, first, a software framework including power management strategies is proposed to apply dynamic voltage and frequency scaling (DVFS) to the simulation of spiking neural networks, which is also the first-ever software framework running a neural network on SpiNNaker 2. The result shows the power consumption is reduced by 60.7% in the synfire chain benchmark. Second, numerical optimizations and data structure optimizations lead to an efficient implementation of reward-based synaptic sampling, which is one of the most complex plasticity algorithms ever implemented on neuromorphic hardware. The results show a reduction of computation time by a factor of 2 and energy consumption by 62%. Third, software optimizations are proposed which effectively exploit the efficiency of the multiply-accumulate array and the flexibility of the ARM core, which results in, when compared with Loihi, 3 times faster inference speed and 5 times lower energy consumption in a keyword spotting benchmark, and faster inference speed and lower energy consumption for adaptive control benchmark in high dimensional cases. The results of this work demonstrate the potential of SpiNNaker 2, explore its range of applications and also provide feedback for the design of the next generation neuromorphic hardware

    Revisiting LFSMs

    Full text link
    Linear Finite State Machines (LFSMs) are particular primitives widely used in information theory, coding theory and cryptography. Among those linear automata, a particular case of study is Linear Feedback Shift Registers (LFSRs) used in many cryptographic applications such as design of stream ciphers or pseudo-random generation. LFSRs could be seen as particular LFSMs without inputs. In this paper, we first recall the description of LFSMs using traditional matrices representation. Then, we introduce a new matrices representation with polynomial fractional coefficients. This new representation leads to sparse representations and implementations. As direct applications, we focus our work on the Windmill LFSRs case, used for example in the E0 stream cipher and on other general applications that use this new representation. In a second part, a new design criterion called diffusion delay for LFSRs is introduced and well compared with existing related notions. This criterion represents the diffusion capacity of an LFSR. Thus, using the matrices representation, we present a new algorithm to randomly pick LFSRs with good properties (including the new one) and sparse descriptions dedicated to hardware and software designs. We present some examples of LFSRs generated using our algorithm to show the relevance of our approach.Comment: Submitted to IEEE-I

    An Efficient hardware implementation of the tate pairing in characteristic three

    Get PDF
    DL systems with bilinear structure recently became an important base for cryptographic protocols such as identity-based encryption (IBE). Since the main computational task is the evaluation of the bilinear pairings over elliptic curves, known to be prohibitively expensive, efficient implementations are required to render them applicable in real life scenarios. We present an efficient accelerator for computing the Tate Pairing in characteristic 3, using the Modified Duursma-Lee algorithm. Our accelerator shows that it is possible to improve the area-time product by 12 times on FPGA, compared to estimated values from one of the best known hardware architecture [6] implemented on the same type of FPGA. Also the computation time is improved upto 16 times compared to software applications reported in [17]. In addition, we present the result of an ASIC implementation of the algorithm, which is the first hitherto
    corecore