6 research outputs found

    Fast convolutional neural networks on FPGAs with hls4ml

    Get PDF
    We introduce an automated tool for deploying ultra low-latency, low-power deep neural networks with convolutional layers on FPGAs. By extending the hls4ml library, we demonstrate an inference latency of 5 μ5\,\mus using convolutional architectures, targeting microsecond latency applications like those at the CERN Large Hadron Collider. Considering benchmark models trained on the Street View House Numbers Dataset, we demonstrate various methods for model compression in order to fit the computational constraints of a typical FPGA device used in trigger and data acquisition systems of particle detectors. In particular, we discuss pruning and quantization-aware training, and demonstrate how resource utilization can be significantly reduced with little to no loss in model accuracy. We show that the FPGA critical resource consumption can be reduced by 97% with zero loss in model accuracy, and by 99% when tolerating a 6% accuracy degradation.Comment: 18 pages, 18 figures, 4 table

    Tiny Neural Networks for EnvironmentalPredictions: an integrated approach with Miosix

    No full text
    Collecting vast amount of data and performing complex calculations to feed modern Numerical Weather Prediction (NWP) algorithms require to centralize intelligence into some of the most powerful energy and resource hungry supercomputers in the world. This is due to the chaotic complex nature of the atmosphere which interpretation require virtually unlimited computing and storage resources. With Machine Learning(ML) techniques, a statistical approach can be designed in order to perform weather forecasting activity. Moreover, the recently growing interest in Edge Computing Tiny Intelligent architectures is proposing a shift towards the deployment of ML algorithms on Tiny Embedded Systems (ES). This paper describes how Deep but Tiny Neural Networks (DTNN) can be designed to be parsimonious and can be automatically converted into a STM32 microcontroller-optimized C-library through X-CUBE-AI toolchain; we propose the integration of the obtained library with Miosix, a Real Time Operating System (RTOS) tailored for resource constrained and tiny processors, which is an enabling factor for system scalability and multi tasking. With our experiments we demonstrate that it is possible to deploy a DTNN, with a FLASH and RAM occupation of 45,5 KByte and480 Byte respectively, for atmospheric pressure forecasting in an affordable cost effective system. We deployed the system in areal context, obtaining the same prediction quality as the same DNN model deployed on the cloud but with the advantage of processing all the necessary data to perform the prediction close to environmental sensors, avoiding raw data traffic to the cloud

    Real-time semantic segmentation on FPGAs for autonomous vehicles with hls4ml

    Get PDF
    In this paper, we investigate how field programmable gate arrays can serve as hardware accelerators for real-time semantic segmentation tasks relevant for autonomous driving. Considering compressed versions of the ENet convolutional neural network architecture, we demonstrate a fully-on-chip deployment with a latency of 4.9 ms per image, using less than 30% of the available resources on a Xilinx ZCU102 evaluation board. The latency is reduced to 3 ms per image when increasing the batch size to ten, corresponding to the use case where the autonomous vehicle receives inputs from multiple cameras simultaneously. We show, through aggressive filter reduction and heterogeneous quantization-aware training, and an optimized implementation of convolutional layers, that the power consumption and resource utilization can be significantly reduced while maintaining accuracy on the Cityscapes dataset.ISSN:2632-215

    Autoencoders on field-programmable gate arrays for real-time, unsupervised new physics detection at 40 MHz at the Large Hadron Collider

    No full text
    In this paper, we show how to adapt and deploy anomaly detection algorithms based on deep autoencoders, for the unsupervised detection of new physics signatures in the extremely challenging environment of a real-time event selection system at the Large Hadron Collider (LHC). We demonstrate that new physics signatures can be enhanced by three orders of magnitude, while staying within the strict latency and resource constraints of a typical LHC event filtering system. This would allow for collecting datasets potentially enriched with high-purity contributions from new physics processes. Through per-layer, highly parallel implementations of network layers, support for autoencoder-specific losses on FPGAs and latent space based inference, we demonstrate that anomaly detection can be performed in as little as 80 80\,ns using less than 3% of the logic resources in the Xilinx Virtex VU9P FPGA. Opening the way to real-life applications of this idea during the next data-taking campaign of the LHC

    Open-source FPGA-ML codesign for the MLPerf Tiny Benchmark

    No full text
    We present our development experience and recent results for the MLPerf Tiny Inference Benchmark on field-programmable gate array (FPGA) platforms. We use the open-source hls4ml and FINN workflows, which aim to democratize AI-hardware codesign of optimized neural networks on FPGAs. We present the design and implementation process for the keyword spotting, anomaly detection, and image classification benchmark tasks. The resulting hardware implementations are quantized, configurable, spatial dataflow architectures tailored for speed and efficiency and introduce new generic optimizations and common workflows developed as a part of this work. The full workflow is presented from quantization-aware training to FPGA implementation. The solutions are deployed on system-on-chip (Pynq-Z2) and pure FPGA (Arty A7-100T) platforms. The resulting submissions achieve latencies as low as 20 ÎĽ\mus and energy consumption as low as 30 ÎĽ\muJ per inference. We demonstrate how emerging ML benchmarks on heterogeneous hardware platforms can catalyze collaboration and the development of new techniques and more accessible tools
    corecore