5,019 research outputs found
FPGA-accelerated machine learning inference as a service for particle physics computing
New heterogeneous computing paradigms on dedicated hardware with increased
parallelization, such as Field Programmable Gate Arrays (FPGAs), offer exciting
solutions with large potential gains. The growing applications of machine
learning algorithms in particle physics for simulation, reconstruction, and
analysis are naturally deployed on such platforms. We demonstrate that the
acceleration of machine learning inference as a web service represents a
heterogeneous computing solution for particle physics experiments that
potentially requires minimal modification to the current computing model. As
examples, we retrain the ResNet-50 convolutional neural network to demonstrate
state-of-the-art performance for top quark jet tagging at the LHC and apply a
ResNet-50 model with transfer learning for neutrino event classification. Using
Project Brainwave by Microsoft to accelerate the ResNet-50 image classification
model, we achieve average inference times of 60 (10) milliseconds with our
experimental physics software framework using Brainwave as a cloud (edge or
on-premises) service, representing an improvement by a factor of approximately
30 (175) in model inference latency over traditional CPU inference in current
experimental hardware. A single FPGA service accessed by many CPUs achieves a
throughput of 600--700 inferences per second using an image batch of one,
comparable to large batch-size GPU throughput and significantly better than
small batch-size GPU throughput. Deployed as an edge or cloud service for the
particle physics computing model, coprocessor accelerators can have a higher
duty cycle and are potentially much more cost-effective.Comment: 16 pages, 14 figures, 2 table
The Potential of the Intel Xeon Phi for Supervised Deep Learning
Supervised learning of Convolutional Neural Networks (CNNs), also known as
supervised Deep Learning, is a computationally demanding process. To find the
most suitable parameters of a network for a given application, numerous
training sessions are required. Therefore, reducing the training time per
session is essential to fully utilize CNNs in practice. While numerous research
groups have addressed the training of CNNs using GPUs, so far not much
attention has been paid to the Intel Xeon Phi coprocessor. In this paper we
investigate empirically and theoretically the potential of the Intel Xeon Phi
for supervised learning of CNNs. We design and implement a parallelization
scheme named CHAOS that exploits both the thread- and SIMD-parallelism of the
coprocessor. Our approach is evaluated on the Intel Xeon Phi 7120P using the
MNIST dataset of handwritten digits for various thread counts and CNN
architectures. Results show a 103.5x speed up when training our large network
for 15 epochs using 244 threads, compared to one thread on the coprocessor.
Moreover, we develop a performance model and use it to assess our
implementation and answer what-if questions.Comment: The 17th IEEE International Conference on High Performance Computing
and Communications (HPCC 2015), Aug. 24 - 26, 2015, New York, US
PerfSAGE: Generalized Inference Performance Predictor for Arbitrary Deep Learning Models on Edge Devices
The ability to accurately predict deep neural network (DNN) inference
performance metrics, such as latency, power, and memory footprint, for an
arbitrary DNN on a target hardware platform is essential to the design of DNN
based models. This ability is critical for the (manual or automatic) design,
optimization, and deployment of practical DNNs for a specific hardware
deployment platform. Unfortunately, these metrics are slow to evaluate using
simulators (where available) and typically require measurement on the target
hardware. This work describes PerfSAGE, a novel graph neural network (GNN) that
predicts inference latency, energy, and memory footprint on an arbitrary DNN
TFlite graph (TFL, 2017). In contrast, previously published performance
predictors can only predict latency and are restricted to pre-defined
construction rules or search spaces. This paper also describes the EdgeDLPerf
dataset of 134,912 DNNs randomly sampled from four task search spaces and
annotated with inference performance metrics from three edge hardware
platforms. Using this dataset, we train PerfSAGE and provide experimental
results that demonstrate state-of-the-art prediction accuracy with a Mean
Absolute Percentage Error of <5% across all targets and model search spaces.
These results: (1) Outperform previous state-of-art GNN-based predictors
(Dudziak et al., 2020), (2) Accurately predict performance on accelerators (a
shortfall of non-GNN-based predictors (Zhang et al., 2021)), and (3)
Demonstrate predictions on arbitrary input graphs without modifications to the
feature extractor
- …