682 research outputs found
FPGA-accelerated machine learning inference as a service for particle physics computing
New heterogeneous computing paradigms on dedicated hardware with increased
parallelization, such as Field Programmable Gate Arrays (FPGAs), offer exciting
solutions with large potential gains. The growing applications of machine
learning algorithms in particle physics for simulation, reconstruction, and
analysis are naturally deployed on such platforms. We demonstrate that the
acceleration of machine learning inference as a web service represents a
heterogeneous computing solution for particle physics experiments that
potentially requires minimal modification to the current computing model. As
examples, we retrain the ResNet-50 convolutional neural network to demonstrate
state-of-the-art performance for top quark jet tagging at the LHC and apply a
ResNet-50 model with transfer learning for neutrino event classification. Using
Project Brainwave by Microsoft to accelerate the ResNet-50 image classification
model, we achieve average inference times of 60 (10) milliseconds with our
experimental physics software framework using Brainwave as a cloud (edge or
on-premises) service, representing an improvement by a factor of approximately
30 (175) in model inference latency over traditional CPU inference in current
experimental hardware. A single FPGA service accessed by many CPUs achieves a
throughput of 600--700 inferences per second using an image batch of one,
comparable to large batch-size GPU throughput and significantly better than
small batch-size GPU throughput. Deployed as an edge or cloud service for the
particle physics computing model, coprocessor accelerators can have a higher
duty cycle and are potentially much more cost-effective.Comment: 16 pages, 14 figures, 2 table
Fast Neural Network Inference on FPGAs for Triggering on Long-Lived Particles at Colliders
Experimental particle physics demands a sophisticated trigger and acquisition
system capable to efficiently retain the collisions of interest for further
investigation. Heterogeneous computing with the employment of FPGA cards may
emerge as a trending technology for the triggering strategy of the upcoming
high-luminosity program of the Large Hadron Collider at CERN. In this context,
we present two machine-learning algorithms for selecting events where neutral
long-lived particles decay within the detector volume studying their accuracy
and inference time when accelerated on commercially available Xilinx FPGA
accelerator cards. The inference time is also confronted with a CPU- and
GPU-based hardware setup. The proposed new algorithms are proven efficient for
the considered benchmark physics scenario and their accuracy is found to not
degrade when accelerated on the FPGA cards. The results indicate that all
tested architectures fit within the latency requirements of a second-level
trigger farm and that exploiting accelerator technologies for real-time
processing of particle-physics collisions is a promising research field that
deserves additional investigations, in particular with machine-learning models
with a large number of trainable parameters.Comment: 12 pages, 9 figures, 2 table
hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices
Accessible machine learning algorithms, software, and diagnostic tools for
energy-efficient devices and systems are extremely valuable across a broad
range of application domains. In scientific domains, real-time near-sensor
processing can drastically improve experimental design and accelerate
scientific discoveries. To support domain scientists, we have developed hls4ml,
an open-source software-hardware codesign workflow to interpret and translate
machine learning algorithms for implementation with both FPGA and ASIC
technologies. We expand on previous hls4ml work by extending capabilities and
techniques towards low-power implementations and increased usability: new
Python APIs, quantization-aware pruning, end-to-end FPGA workflows, long
pipeline kernels for low power, and new device backends include an ASIC
workflow. Taken together, these and continued efforts in hls4ml will arm a new
generation of domain scientists with accessible, efficient, and powerful tools
for machine-learning-accelerated discovery.Comment: 10 pages, 8 figures, TinyML Research Symposium 202
Fast convolutional neural networks on FPGAs with hls4ml
We introduce an automated tool for deploying ultra low-latency, low-power
deep neural networks with convolutional layers on FPGAs. By extending the
hls4ml library, we demonstrate an inference latency of s using
convolutional architectures, targeting microsecond latency applications like
those at the CERN Large Hadron Collider. Considering benchmark models trained
on the Street View House Numbers Dataset, we demonstrate various methods for
model compression in order to fit the computational constraints of a typical
FPGA device used in trigger and data acquisition systems of particle detectors.
In particular, we discuss pruning and quantization-aware training, and
demonstrate how resource utilization can be significantly reduced with little
to no loss in model accuracy. We show that the FPGA critical resource
consumption can be reduced by 97% with zero loss in model accuracy, and by 99%
when tolerating a 6% accuracy degradation.Comment: 18 pages, 18 figures, 4 table
Accelerated Charged Particle Tracking with Graph Neural Networks on FPGAs
We develop and study FPGA implementations of algorithms for charged particle
tracking based on graph neural networks. The two complementary FPGA designs are
based on OpenCL, a framework for writing programs that execute across
heterogeneous platforms, and hls4ml, a high-level-synthesis-based compiler for
neural network to firmware conversion. We evaluate and compare the resource
usage, latency, and tracking performance of our implementations based on a
benchmark dataset. We find a considerable speedup over CPU-based execution is
possible, potentially enabling such algorithms to be used effectively in future
computing workflows and the FPGA-based Level-1 trigger at the CERN Large Hadron
Collider.Comment: 8 pages, 4 figures, To appear in Third Workshop on Machine Learning
and the Physical Sciences (NeurIPS 2020
Snowmass 2021 Computational Frontier CompF03 Topical Group Report: Machine Learning
The rapidly-developing intersection of machine learning (ML) with high-energy
physics (HEP) presents both opportunities and challenges to our community. Far
beyond applications of standard ML tools to HEP problems, genuinely new and
potentially revolutionary approaches are being developed by a generation of
talent literate in both fields. There is an urgent need to support the needs of
the interdisciplinary community driving these developments, including funding
dedicated research at the intersection of the two fields, investing in
high-performance computing at universities and tailoring allocation policies to
support this work, developing of community tools and standards, and providing
education and career paths for young researchers attracted by the intellectual
vitality of machine learning for high energy physics.Comment: Contribution to Snowmass 202
FPGA Implementation of Convolutional Neural Networks with Fixed-Point Calculations
Neural network-based methods for image processing are becoming widely used in
practical applications. Modern neural networks are computationally expensive
and require specialized hardware, such as graphics processing units. Since such
hardware is not always available in real life applications, there is a
compelling need for the design of neural networks for mobile devices. Mobile
neural networks typically have reduced number of parameters and require a
relatively small number of arithmetic operations. However, they usually still
are executed at the software level and use floating-point calculations. The use
of mobile networks without further optimization may not provide sufficient
performance when high processing speed is required, for example, in real-time
video processing (30 frames per second). In this study, we suggest
optimizations to speed up computations in order to efficiently use already
trained neural networks on a mobile device. Specifically, we propose an
approach for speeding up neural networks by moving computation from software to
hardware and by using fixed-point calculations instead of floating-point. We
propose a number of methods for neural network architecture design to improve
the performance with fixed-point calculations. We also show an example of how
existing datasets can be modified and adapted for the recognition task in hand.
Finally, we present the design and the implementation of a floating-point gate
array-based device to solve the practical problem of real-time handwritten
digit classification from mobile camera video feed
- …