1,127 research outputs found
High throughput spatial convolution filters on FPGAs
Digital signal processing (DSP) on field- programmable gate arrays (FPGAs) has long been appealing because of the inherent parallelism in these computations that can be easily exploited to accelerate such algorithms. FPGAs have evolved significantly to further enhance the mapping of these algorithms, included additional hard blocks, such as the DSP blocks found in modern FPGAs. Although these DSP blocks can offer more efficient mapping of DSP computations, they are primarily designed for 1-D filter structures. We present a study on spatial convolutional filter implementations on FPGAs, optimizing around the structure of the DSP blocks to offer high throughput while maintaining the coefficient flexibility that other published architectures usually sacrifice. We show that it is possible to implement large filters for large 4K resolution image frames at frame rates of 30–60 FPS, while maintaining functional flexibility
LightRW: FPGA Accelerated Graph Dynamic Random Walks
Graph dynamic random walks (GDRWs) have recently emerged as a powerful
paradigm for graph analytics and learning applications, including graph
embedding and graph neural networks. Despite the fact that many existing
studies optimize the performance of GDRWs on multi-core CPUs, massive random
memory accesses and costly synchronizations cause severe resource
underutilization, and the processing of GDRWs is usually the key performance
bottleneck in many graph applications. This paper studies an alternative
architecture, FPGA, to address these issues in GDRWs, as FPGA has the ability
of hardware customization so that we are able to explore fine-grained pipeline
execution and specialized memory access optimizations. Specifically, we propose
{LightRW}, a novel FPGA-based accelerator for GDRWs. LightRW embraces a series
of optimizations to enable fine-grained pipeline execution on the chip and to
exploit the massive parallelism of FPGA while significantly reducing memory
accesses. As current commonly used sampling methods in GDRWs do not efficiently
support fine-grained pipeline execution, we develop a parallelized reservoir
sampling method to sample multiple vertices per cycle for efficient pipeline
execution. To address the random memory access issues, we propose a
degree-aware configurable caching method that buffers hot vertices on-chip to
alleviate random memory accesses and a dynamic burst access engine that
efficiently retrieves neighbors. Experimental results show that our
optimization techniques are able to improve the performance of GDRWs on FPGA
significantly. Moreover, LightRW delivers up to 9.55x and 9.10x speedup over
the state-of-the-art CPU-based MetaPath and Node2vec random walks,
respectively. This work is open-sourced on GitHub at
https://github.com/Xtra-Computing/LightRW.Comment: Accepted to SIGMOD 202
MLPerf Inference Benchmark
Machine-learning (ML) hardware and software system demand is burgeoning.
Driven by ML applications, the number of different ML inference systems has
exploded. Over 100 organizations are building ML inference chips, and the
systems that incorporate existing models span at least three orders of
magnitude in power consumption and five orders of magnitude in performance;
they range from embedded devices to data-center solutions. Fueling the hardware
are a dozen or more software frameworks and libraries. The myriad combinations
of ML hardware and ML software make assessing ML-system performance in an
architecture-neutral, representative, and reproducible manner challenging.
There is a clear need for industry-wide standard ML benchmarking and evaluation
criteria. MLPerf Inference answers that call. In this paper, we present our
benchmarking method for evaluating ML inference systems. Driven by more than 30
organizations as well as more than 200 ML engineers and practitioners, MLPerf
prescribes a set of rules and best practices to ensure comparability across
systems with wildly differing architectures. The first call for submissions
garnered more than 600 reproducible inference-performance measurements from 14
organizations, representing over 30 systems that showcase a wide range of
capabilities. The submissions attest to the benchmark's flexibility and
adaptability.Comment: ISCA 202
- …