421 research outputs found
An Application-Specific VLIW Processor with Vector Instruction Set for CNN Acceleration
In recent years, neural networks have surpassed classical algorithms in areas
such as object recognition, e.g. in the well-known ImageNet challenge. As a
result, great effort is being put into developing fast and efficient
accelerators, especially for Convolutional Neural Networks (CNNs). In this work
we present ConvAix, a fully C-programmable processor, which -- contrary to many
existing architectures -- does not rely on a hard-wired array of
multiply-and-accumulate (MAC) units. Instead it maps computations onto
independent vector lanes making use of a carefully designed vector instruction
set. The presented processor is targeted towards latency-sensitive applications
and is capable of executing up to 192 MAC operations per cycle. ConvAix
operates at a target clock frequency of 400 MHz in 28nm CMOS, thereby offering
state-of-the-art performance with proper flexibility within its target domain.
Simulation results for several 2D convolutional layers from well known CNNs
(AlexNet, VGG-16) show an average ALU utilization of 72.5% using vector
instructions with 16 bit fixed-point arithmetic. Compared to other well-known
designs which are less flexible, ConvAix offers competitive energy efficiency
of up to 497 GOP/s/W while even surpassing them in terms of area efficiency and
processing speed.Comment: Accepted for publication in the proceedings of the 2019 IEEE
International Symposium on Circuits and Systems (ISCAS
Stochastic Configuration Machines: FPGA Implementation
Neural networks for industrial applications generally have additional
constraints such as response speed, memory size and power usage. Randomized
learners can address some of these issues. However, hardware solutions can
provide better resource reduction whilst maintaining the model's performance.
Stochastic configuration networks (SCNs) are a prime choice in industrial
applications due to their merits and feasibility for data modelling. Stochastic
Configuration Machines (SCMs) extend this to focus on reducing the memory
constraints by limiting the randomized weights to a binary value with a scalar
for each node and using a mechanism model to improve the learning performance
and result interpretability. This paper aims to implement SCM models on a field
programmable gate array (FPGA) and introduce binary-coded inputs to the
algorithm. Results are reported for two benchmark and two industrial datasets,
including SCM with single-layer and deep architectures.Comment: 19 pages, 9 figures, 8 table
A RISC-V Matrix Multiplier Using Systolic Arrays
Many modern day applications can be solved with the usage of machine learning, which involves training a computer to learn on large amounts of data without direct programmer guidance. Conventional computers typically use normal general purpose central processing units, though more specialized tasks may take advantage of more parallel hardware such as graphics processing units. In the pursuit of increased performance to facilitate increasingly more complex machine learning models, researchers in both academia and industry look towards field-programmable gate arrays and application specific integrated circuits for their needs. Various implementations, both theoretical and practical, exist across a wide variety of designs. A custom design, using systolic arrays and built on the existing RISC-V Instruction Set Architecture, will be used to accelerate matrix calculations, with example performance on the MNIST dataset measured
- …