Search CORE

44 research outputs found

Open-Source GEMM Hardware Kernels Generator: Toward Numerically-Tailored Computations

Author: Casas Marc
Ledoux Louis
Publication venue
Publication date: 23/05/2023
Field of study

Many scientific computing problems can be reduced to Matrix-Matrix Multiplications (MMM), making the General Matrix Multiply (GEMM) kernels in the Basic Linear Algebra Subroutine (BLAS) of interest to the high-performance computing community. However, these workloads have a wide range of numerical requirements. Ill-conditioned linear systems require high-precision arithmetic to ensure correct and reproducible results. In contrast, emerging workloads such as deep neural networks, which can have millions up to billions of parameters, have shown resilience to arithmetic tinkering and precision lowering

arXiv.org e-Print Archive

Dynamic Power Consumption of the Full Posit Processing Unit: Analysis and Experiments

Author: Cococcioni Marco
Fornaciari William
Massari Giueppe
Piccoli Michele
Rossi Federico
Ruffaldi Emanuele
Saponara Sergio
Zoni Davide
Publication venue: OASIcs - OpenAccess Series in Informatics. 14th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 12th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2023)
Publication date: 01/01/2023
Field of study

Since its introduction in 2017, the Posit™ format for representing real numbers has attracted a lot of interest, as an alternative to IEEE 754 floating point representation. Several hardware implementations of arithmetic operations between posit numbers have also been proposed in recent years. In this work, we analyze the dynamic power consumption of the Full Posit Processing Unit (FPPU) recently developed at the University of Pisa. Experimental results show that we can model the dynamic power consumption of the FPPU with an acceptable approximation error from 2.84% (32-bit FPPU) to 7.32% (8-bit FPPU). Furthermore, from the synthesis of the power monitoring unit alongside the FPPU we demonstrate that the additional power module has an area cost that goes from ∼ 5% (32-bit FPPU) to ∼ 30% (8-bit FPPU) of the total unit area occupatio

Archivio istituzionale della ricerca - Politecnico di Milano

Dagstuhl Research Online Publication Server

FP8 Formats for Deep Learning

Author: Burgess Neil
Cornea Marius
Dubey Pradeep
Grisenthwaite Richard
Ha Sangwon
Heinecke Alexander
Judd Patrick
Kamalu John
Mellempudi Naveen
Micikevicius Paulius
Oberman Stuart
Shoeybi Mohammad
Siu Michael
Stosic Dusan
Wu Hao
Publication venue
Publication date: 29/09/2022
Field of study

FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa). While E5M2 follows IEEE 754 conventions for representatio of special values, E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs. We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions. Our study covers the main modern neural network architectures - CNNs, RNNs, and Transformer-based models, leaving all the hyperparameters unchanged from the 16-bit baseline training sessions. Our training experiments include large, up to 175B parameter, language models. We also examine FP8 post-training-quantization of language models trained using 16-bit formats that resisted fixed point int8 quantization

arXiv.org e-Print Archive