95,320 research outputs found
Look-Up mAI GeMM: Increasing AI GeMMs Performance by Nearly 2.5x via msGeMM
AI models are increasing in size and recent advancement in the community has
shown that unlike HPC applications where double precision datatype are
required, lower-precision datatypes such as fp8 or int4 are sufficient to bring
the same model quality both for training and inference. Following these trends,
GPU vendors such as NVIDIA and AMD have added hardware support for fp16, fp8
and int8 GeMM operations with an exceptional performance via Tensor Cores.
However, this paper proposes a new algorithm called msGeMM which shows that AI
models with low-precision datatypes can run with ~2.5x fewer multiplication and
add instructions. Efficient implementation of this algorithm requires special
CUDA cores with the ability to add elements from a small look-up table at the
rate of Tensor Cores
Implementation of bioinspired algorithms on the neuromorphic VLSI system SpiNNaker 2
It is believed that neuromorphic hardware will accelerate neuroscience research and enable the next generation edge AI. On the other hand, brain-inspired algorithms are supposed to work efficiently on neuromorphic hardware. But both processes don't happen automatically. To efficiently bring together hardware and algorithm, optimizations are necessary based on the understanding of both sides. In this work, software frameworks and optimizations for efficient implementation of neural network-based algorithms on SpiNNaker 2 are proposed, resulting in optimized power consumption, memory footprint and computation time. In particular, first, a software framework including power management strategies is proposed to apply dynamic voltage and frequency scaling (DVFS) to the simulation of spiking neural networks, which is also the first-ever software framework running a neural network on SpiNNaker 2. The result shows the power consumption is reduced by 60.7% in the synfire chain benchmark. Second, numerical optimizations and data structure optimizations lead to an efficient implementation of reward-based synaptic sampling, which is one of the most complex plasticity algorithms ever implemented on neuromorphic hardware. The results show a reduction of computation time by a factor of 2 and energy consumption by 62%. Third, software optimizations are proposed which effectively exploit the efficiency of the multiply-accumulate array and the flexibility of the ARM core, which results in, when compared with Loihi, 3 times faster inference speed and 5 times lower energy consumption in a keyword spotting benchmark, and faster inference speed and lower energy consumption for adaptive control benchmark in high dimensional cases. The results of this work demonstrate the potential of SpiNNaker 2, explore its range of applications and also provide feedback for the design of the next generation neuromorphic hardware
Revisiting LFSMs
Linear Finite State Machines (LFSMs) are particular primitives widely used in
information theory, coding theory and cryptography. Among those linear
automata, a particular case of study is Linear Feedback Shift Registers (LFSRs)
used in many cryptographic applications such as design of stream ciphers or
pseudo-random generation. LFSRs could be seen as particular LFSMs without
inputs.
In this paper, we first recall the description of LFSMs using traditional
matrices representation. Then, we introduce a new matrices representation with
polynomial fractional coefficients. This new representation leads to sparse
representations and implementations. As direct applications, we focus our work
on the Windmill LFSRs case, used for example in the E0 stream cipher and on
other general applications that use this new representation.
In a second part, a new design criterion called diffusion delay for LFSRs is
introduced and well compared with existing related notions. This criterion
represents the diffusion capacity of an LFSR. Thus, using the matrices
representation, we present a new algorithm to randomly pick LFSRs with good
properties (including the new one) and sparse descriptions dedicated to
hardware and software designs. We present some examples of LFSRs generated
using our algorithm to show the relevance of our approach.Comment: Submitted to IEEE-I
An Efficient hardware implementation of the tate pairing in characteristic three
DL systems with bilinear structure recently became an important base for cryptographic protocols such as identity-based encryption (IBE). Since the main
computational task is the evaluation of the bilinear pairings over elliptic curves, known to be prohibitively expensive, efficient implementations are required to render them applicable in real life scenarios. We present an efficient accelerator for computing the Tate Pairing in characteristic 3, using the Modified Duursma-Lee algorithm. Our accelerator shows that it is possible to improve the area-time product by 12 times on FPGA, compared to estimated values from one of the best known hardware architecture [6] implemented on the same type of FPGA. Also the computation time is improved upto 16 times compared to software applications reported in [17]. In addition, we present the result of an ASIC implementation of the algorithm, which is the first hitherto
- …