8 research outputs found
Synetgy: Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs
Using FPGAs to accelerate ConvNets has attracted significant attention in
recent years. However, FPGA accelerator design has not leveraged the latest
progress of ConvNets. As a result, the key application characteristics such as
frames-per-second (FPS) are ignored in favor of simply counting GOPs, and
results on accuracy, which is critical to application success, are often not
even reported. In this work, we adopt an algorithm-hardware co-design approach
to develop a ConvNet accelerator called Synetgy and a novel ConvNet model
called DiracDeltaNet. Both the accelerator and ConvNet are tailored
to FPGA requirements. DiracDeltaNet, as the name suggests, is a ConvNet with
only convolutions while spatial convolutions are replaced by more
efficient shift operations. DiracDeltaNet achieves competitive accuracy on
ImageNet (88.7\% top-5), but with 42 fewer parameters and 48
fewer OPs than VGG16. We further quantize DiracDeltaNet's weights to 4-bit and
activations to 4-bits, with less than 1\% accuracy loss. These quantizations
exploit well the nature of FPGA hardware. In short, DiracDeltaNet's small model
size, low computational OP count, low precision and simplified operators allow
us to co-design a highly customized computing unit for an FPGA. We implement
the computing units for DiracDeltaNet on an Ultra96 SoC system through
high-level synthesis. Our accelerator's final top-5 accuracy of 88.1\% on
ImageNet, is higher than all the previously reported embedded FPGA
accelerators. In addition, the accelerator reaches an inference speed of 66.3
FPS on the ImageNet classification task, surpassing prior works with similar
accuracy by at least 11.6.Comment: Update to the latest result
Very Low Power Neural Network FPGA Accelerators for Tag-Less Remote Person Identification Using Capacitive Sensors
Human detection, identification, and monitoring are essential for many applications aiming to make smarter the indoor environments, where most people spend much of their time (like home, office, transportation, or public spaces). The capacitive sensors can meet stringent privacy, power, cost, and unobtrusiveness requirements, they do not rely on wearables or specific human interactions, but they may need significant on-board data processing to increase their performance. We comparatively analyze in terms of overall processing time and energy several data processing implementations of multilayer perceptron neural networks (NNs) on board capacitive sensors. The NN architecture, optimized using augmented experimental data, consists of six 17-bit inputs, two hidden layers with eight neurons each, and one four-bit output. For the software (SW) NN implementation, we use two STMicroelectronics STM32 low-power ARM microcontrollers (MCUs): one MCU optimized for power and one for performance. For hardware (HW) implementations, we use four ultralow-power field-programmable gate arrays (FPGAs), with different sizes, dedicated computation blocks, and data communication interfaces (one FPGA from the Lattice iCE40 family and three FPGAs from the Microsemi IGLOO family). Our shortest SW implementation latency is 54.4 µs and the lowest energy per inference is 990 nJ, while the shortest HW implementation latency is 1.99 µs and the lowest energy is 39 nJ (including the data transfer between MCU and FPGA). The FPGAs active power ranges between 6.24 and 34.7 mW, while their static power is between 79 and 277 µW. They compare very favorably with the static power consumption of Xilinx and Altera low-power device families, which is around 40 mW. The experimental results show that NN inferences offloaded to external FPGAs have lower latency and energy than SW ones (even when using HW multipliers), and the FPGAs with dedicated computational blocks (multiply-accumulate) perform best
Survey and Benchmarking of Machine Learning Accelerators
Advances in multicore processors and accelerators have opened the flood gates
to greater exploration and application of machine learning techniques to a
variety of applications. These advances, along with breakdowns of several
trends including Moore's Law, have prompted an explosion of processors and
accelerators that promise even greater computational and machine learning
capabilities. These processors and accelerators are coming in many forms, from
CPUs and GPUs to ASICs, FPGAs, and dataflow accelerators. This paper surveys
the current state of these processors and accelerators that have been publicly
announced with performance and power consumption numbers. The performance and
power values are plotted on a scatter graph and a number of dimensions and
observations from the trends on this plot are discussed and analyzed. For
instance, there are interesting trends in the plot regarding power consumption,
numerical precision, and inference versus training. We then select and
benchmark two commercially-available low size, weight, and power (SWaP)
accelerators as these processors are the most interesting for embedded and
mobile machine learning inference applications that are most applicable to the
DoD and other SWaP constrained users. We determine how they actually perform
with real-world images and neural network models, compare those results to the
reported performance and power consumption values and evaluate them against an
Intel CPU that is used in some embedded applications.Comment: 9 pages, 3 figures, IEEE-HPEC conference, Waltham, MA, September
24-26, 201
Low power and high performance heterogeneous computing on FPGAs
L'abstract è presente nell'allegato / the abstract is in the attachmen