18 research outputs found
The Effects of Approximate Multiplication on Convolutional Neural Networks
This paper analyzes the effects of approximate multiplication when performing
inferences on deep convolutional neural networks (CNNs). The approximate
multiplication can reduce the cost of the underlying circuits so that CNN
inferences can be performed more efficiently in hardware accelerators. The
study identifies the critical factors in the convolution, fully-connected, and
batch normalization layers that allow more accurate CNN predictions despite the
errors from approximate multiplication. The same factors also provide an
arithmetic explanation of why bfloat16 multiplication performs well on CNNs.
The experiments are performed with recognized network architectures to show
that the approximate multipliers can produce predictions that are nearly as
accurate as the FP32 references, without additional training. For example, the
ResNet and Inception-v4 models with Mitch-6 multiplication produces Top-5
errors that are within 0.2% compared to the FP32 references. A brief cost
comparison of Mitch-6 against bfloat16 is presented, where a MAC operation
saves up to 80% of energy compared to the bfloat16 arithmetic. The most
far-reaching contribution of this paper is the analytical justification that
multiplications can be approximated while additions need to be exact in CNN MAC
operations.Comment: 12 pages, 11 figures, 4 tables, accepted for publication in the IEEE
Transactions on Emerging Topics in Computin
Optimizing Self-Organizing Maps for Bacterial Genome Identification on Parallel Ultra-Low-Power Platforms
Pathogenic bacteria significantly threaten human health, highlighting the need for precise and efficient methods
for swiftly identifying bacterial species. This paper addresses the challenges associated with performing genomics computations for pathogen identification on embedded systems with limited computational power. We propose an optimized implementation of Self-Organizing Maps (SOMs) targeting a parallel ultra-lowpower platform based on the RISC-V instruction set architecture.
We propose two mapping methods for implementing the SOM algorithm on a parallel cluster, coupled with software techniques to improve the throughput. Orthogonally to parallelization, we investigate the impact of smaller-than-32-bit floating-point formats (smallFloats) on energy savings, precision, and performance.
Our experimental results show that all smallFloat formats exhibit a 100% classification accuracy. The parallel variants achieve a speed-up of 1.98×, 3.79×, and 6.83× on 2, 4, and 8 cores, respectively. Comparing our design with a 16-bit fixed-point implementation on a coarse grain reconfigurable architecture (CGRA), the FP8 implementation achieves, on average, 1.42× energy efficiency, 1.51× speedup, and a 50% reduction in memory footprint compared to CGRA. Furthermore, FP8 vectorization increases the average speed-up by 2.5×
Large-Scale Discrete Fourier Transform on TPUs
In this work, we present two parallel algorithms for the large-scale discrete
Fourier transform (DFT) on Tensor Processing Unit (TPU) clusters. The two
parallel algorithms are associated with two formulations of DFT: one is based
on the Kronecker product, to be specific, dense matrix multiplications between
the input data and the Vandermonde matrix, denoted as KDFT in this work; the
other is based on the famous Cooley-Tukey algorithm and phase adjustment,
denoted as FFT in this work. Both KDFT and FFT formulations take full advantage
of TPU's strength in matrix multiplications. The KDFT formulation allows direct
use of nonuniform inputs without additional step. In the two parallel
algorithms, the same strategy of data decomposition is applied to the input
data. Through the data decomposition, the dense matrix multiplications in KDFT
and FFT are kept local within TPU cores, which can be performed completely in
parallel. The communication among TPU cores is achieved through the one-shuffle
scheme in both parallel algorithms, with which sending and receiving data takes
place simultaneously between two neighboring cores and along the same direction
on the interconnect network. The one-shuffle scheme is designed for the
interconnect topology of TPU clusters, minimizing the time required by the
communication among TPU cores. Both KDFT and FFT are implemented in TensorFlow.
The three-dimensional complex DFT is performed on an example of dimension with a full TPU Pod: the run time of KDFT is 12.66
seconds and that of FFT is 8.3 seconds. Scaling analysis is provided to
demonstrate the high parallel efficiency of the two DFT implementations on
TPUs
Reduced Precision Floating-Point Optimization for Deep Neural Network On-Device Learning on MicroControllers
Enabling On-Device Learning (ODL) for Ultra-Low-Power Micro-Controller Units
(MCUs) is a key step for post-deployment adaptation and fine-tuning of Deep
Neural Network (DNN) models in future TinyML applications. This paper tackles
this challenge by introducing a novel reduced precision optimization technique
for ODL primitives on MCU-class devices, leveraging the State-of-Art
advancements in RISC-V RV32 architectures with support for vectorized 16-bit
floating-point (FP16) Single-Instruction Multiple-Data (SIMD) operations. Our
approach for the Forward and Backward steps of the Back-Propagation training
algorithm is composed of specialized shape transform operators and Matrix
Multiplication (MM) kernels, accelerated with parallelization and loop
unrolling. When evaluated on a single training step of a 2D Convolution layer,
the SIMD-optimized FP16 primitives result up to 1.72 faster than the
FP32 baseline on a RISC-V-based 8+1-core MCU. An average computing efficiency
of 3.11 Multiply and Accumulate operations per clock cycle (MAC/clk) and 0.81
MAC/clk is measured for the end-to-end training tasks of a ResNet8 and a DS-CNN
for Image Classification and Keyword Spotting, respectively -- requiring 17.1
ms and 6.4 ms on the target platform to compute a training step on a single
sample. Overall, our approach results more than two orders of magnitude faster
than existing ODL software frameworks for single-core MCUs and outperforms by
1.6 previous FP32 parallel implementations on a Continual Learning
setup.Comment: Pre-print version submitted to Elsevier's Future Generation Computer
Systems journal. For the associated open-source release, see
https://github.com/pulp-platform/pulp-trainli
Reduced precision floating-point optimization for Deep Neural Network On-Device Learning on microcontrollers
Enabling On-Device Learning (ODL) for Ultra-Low-Power Micro-Controller Units (MCUs) is a key step for post-deployment adaptation and fine-tuning of Deep Neural Network (DNN) models in future TinyML applications. This paper tackles this challenge by introducing a novel reduced precision optimization technique for ODL primitives on MCU-class devices, leveraging the State-of-Art advancements in RISC-V RV32 architectures with support for vectorized 16-bit floating-point (FP16) Single-Instruction Multiple-Data (SIMD) operations. Our approach for the Forward and Backward steps of the Back Propagation training algorithm is composed of specialized shape transform operators and Matrix Multiplication (MM) kernels, accelerated with parallelization and loop unrolling. When evaluated on a single training step of a 2D Convolution layer, the SIMD-optimized FP16 primitives result up to 1.72x faster than the FP32 baseline on a RISC-V-based 8+1-core MCU. An average computing efficiency of 3.11 Multiply and Accumulate operations per clock cycle (MAC/clk) and 0.81 MAC/clk is measured for the end-to-end training tasks of a ResNet8 and a DS-CNN for Image Classification and Keyword Spotting, respectively - requiring 17.1 ms and 6.4 ms on the target platform to compute a training step on a single sample. Overall, our approach results more than two orders of magnitude faster than existing ODL software frameworks for single-core MCUs and outperforms by 1.6x previous FP32 parallel implementations on a Continual Learning setup.& COPY; 2023 Elsevier B.V. All rights reserved
Accelerating DNN Training With Photonics: A Residue Number System-Based Design
Photonic computing is a compelling avenue for performing highly efficient
matrix multiplication, a crucial operation in Deep Neural Networks (DNNs).
While this method has shown great success in DNN inference, meeting the high
precision demands of DNN training proves challenging due to the precision
limitations imposed by costly data converters and the analog noise inherent in
photonic hardware. This paper proposes Mirage, a photonic DNN training
accelerator that overcomes the precision challenges in photonic hardware using
the Residue Number System (RNS). RNS is a numeral system based on modular
arithmetic\unicode{x2014}allowing us to perform high-precision operations via
multiple low-precision modular operations. In this work, we present a novel
micro-architecture and dataflow for an RNS-based photonic tensor core
performing modular arithmetic in the analog domain. By combining RNS and
photonics, Mirage provides high energy efficiency without compromising
precision and can successfully train state-of-the-art DNNs achieving accuracy
comparable to FP32 training. Our study shows that on average across several
DNNs when compared to systolic arrays, Mirage achieves more than
faster training and lower EDP in an iso-energy scenario and
consumes lower power with comparable or better EDP in an iso-area
scenario
A single-source C++20 HLS flow for function evaluation on FPGA and beyond
International audienceThis paper presents a framework to reuse the intelligence of RTL generators in a single-source HLS setting. This framework is illustrated by a C++ fixed-point library to generate mathematical function evaluator. A compiler flow from C++20 to Vivado IPs has been developed to make the library usable with Vitis HLS. This flow is demonstrated on two applications: an adder for the logarithmic number system, and additive sound synthesis. These experiments show that the approach allows to easily tune the precision of the types used in the application. They also demonstrate the ability to generate arbitrary function evaluator at the required precision
DGEMM on Integer Matrix Multiplication Unit
Deep learning hardware achieves high throughput and low power consumption by
reducing computing precision and specializing in matrix multiplication. For
machine learning inference, fixed-point value computation is commonplace, where
the input and output values and the model parameters are quantized. Thus, many
processors are now equipped with fast integer matrix multiplication units
(IMMU). It is of significant interest to find a way to harness these IMMUs to
improve the performance of HPC applications while maintaining accuracy. We
focus on the Ozaki scheme, which computes a high-precision matrix
multiplication by using lower-precision computing units, and show the
advantages and disadvantages of using IMMU. The experiment using integer Tensor
Cores shows that we can compute double-precision matrix multiplication faster
than cuBLAS and an existing Ozaki scheme implementation on FP16 Tensor Cores on
NVIDIA consumer GPUs. Furthermore, we demonstrate accelerating a quantum
circuit simulation by up to 4.33 while maintaining the FP64 accuracy