Search CORE

2,747 research outputs found

Optimising Sparse Matrix Vector multiplication for large scale FEM problems on FPGA

Author: Burovskiy P
Grigoras P
Luk W
Sherwin S
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 29/08/2016
Field of study

Sparse Matrix Vector multiplication (SpMV) is an important kernel in many scientific applications. In this work we propose an architecture and an automated customisation method to detect and optimise the architecture for block diagonal sparse matrices. We evaluate the proposed approach in the context of the spectral/hp Finite Element Method, using the local matrix assembly approach. This problem leads to a large sparse system of linear equations with block diagonal matrix which is typically solved using an iterative method such as the Preconditioned Conjugate Gradient. The efficiency of the proposed architecture combined with the effectiveness of the proposed customisation method reduces BRAM resource utilisation by as much as 10 times, while achieving identical throughput with existing state of the art designs and requiring minimal development effort from the end user. In the context of the Finite Element Method, our approach enables the solution of larger problems than previously possible, enabling the applicability of FPGAs to more interesting HPC problems

Spiral - Imperial College Digital Repository

A unified approach for managing heterogeneous processing elements on FPGAs

Author: Denholm S
Luk W
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/02/2023
Field of study

FPGA designs do not typically include all available processing elements, e.g., LUTs, DSPs and embedded cores. Additional work is required to manage their different implementations and behaviour, which can unbalance parallel pipelines and complicate development. In this paper we introduce a novel management architecture to unify heterogeneous processing elements into compute pools. A pool formed of E processing elements, each implementing the same function, serves D parallel function calls. A call-and-response approach to computation allows for different processing element implementations, connections, latencies and non-deterministic behaviour. Our rotating scheduler automatically arbitrates access to processing elements, uses greatly simplified routing, and scales linearly with D parallel accesses to the compute pool. Processing elements can easily be added to improve performance, or removed to reduce resource use and routing, facilitating higher operating frequencies. Migrating to larger or smaller FPGAs thus comes at a known performance cost. We assess our framework with a range of neural network activation functions (ReLU, LReLU, ELU, GELU, sigmoid, swish, softplus and tanh) on the Xilinx Alveo U280

Spiral - Imperial College Digital Repository

Accelerating fully spectral CNNs with adaptive activation functions on FPGA

Author: Fan H
Liu S
Luk W
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/07/2021
Field of study

Computing convolutional layers in frequency domain can largely reduce the computation overhead for training and inference of convolutional neural networks (CNNs). However, existing designs with such an idea require repeated spatial- and frequency-domain transforms due to the absence of nonlinear functions in the frequency domain, as such it makes the benefit less attractive for low-latency inference. This paper presents a fully spectral CNN approach by proposing a novel adaptive Rectified Linear Unit (ReLU) activation in spectral domain. The proposed design maintains the non-linearity in the network while taking into account the hardware efficiency in algorithm level. The spectral model size is further optimized by merging and fusing layers. Then, a customized hardware architecture is proposed to implement the designed spectral network on FPGA device with DSP optimizations for 8-bit fixed point multipliers. Our hardware accelerator is implemented on Intel's Arria 10 device and applied to the MNIST, SVHN, AT&T and CIFAR-10 datasets. Experimental results show a speed improvement of 6 × ~ 10 × and 4 × ~ 5.7 × compared to state-of-the-art spatial or FFT-based designs respectively, while achieving similar accuracy across the benchmark datasets

Spiral - Imperial College Digital Repository

The range of the tangential Cauchy-Riemann system on a CR embedded manifold

Author: F.L. Harvey
H.S. Luk
J.J. Kohn
Luca Baracco
S. Yau
W. Rothstein
Y.-T. Siu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 17/05/2011
Field of study

We prove that every compact, pseudoconvex, orientable, CR manifold of \C^n, bounds a complex manifold in the

C^\infty

sense. In particular, the tangential Cauchy-Riemann system has closed range

arXiv.org e-Print Archive

Crossref

An FPGA implementation of the simplex algorithm

Author: Bayliss S
Bouganis CS
Constantinides GA
Luk W
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2006
Field of study

Published versio

Crossref

Spiral - Imperial College Digital Repository

Sampling Distributions of Random Electromagnetic Fields in Mesoscopic or Dynamical Systems

Author: A. T. McKay
F. McNolty
H. Cramér
I. S. Gradstheyn
Luk R. Arnaut
M. Nakagami
P. Beckmann
W. S. Gosset
Publication venue: 'American Physical Society (APS)'
Publication date: 29/07/2009
Field of study

We derive the sampling probability density function (pdf) of an ideal localized random electromagnetic field, its amplitude and intensity in an electromagnetic environment that is quasi-statically time-varying statistically homogeneous or static statistically inhomogeneous. The results allow for the estimation of field statistics and confidence intervals when a single spatial or temporal stochastic process produces randomization of the field. Results for both coherent and incoherent detection techniques are derived, for Cartesian, planar and full-vectorial fields. We show that the functional form of the sampling pdf depends on whether the random variable is dimensioned (e.g., the sampled electric field proper) or is expressed in dimensionless standardized or normalized form (e.g., the sampled electric field divided by its sampled standard deviation). For dimensioned quantities, the electric field, its amplitude and intensity exhibit different types of Bessel

K

sampling pdfs, which differ significantly from the asymptotic Gauss normal and

\chi^{(2)}_{2p}

ensemble pdfs when

\nu

is relatively small. By contrast, for the corresponding standardized quantities, Student

t

, Fisher-Snedecor

F

and root-

F

sampling pdfs are obtained that exhibit heavier tails than comparable Bessel

K

pdfs. Statistical uncertainties obtained from classical small-sample theory for dimensionless quantities are shown to be overestimated compared to dimensioned quantities. Differences in the sampling pdfs arising from de-normalization versus de-standardization are obtained.Comment: 12 pages, 15 figures, accepted for publication in Phys. Rev. E, minor typos correcte

arXiv.org e-Print Archive

Crossref

Efficient queue-balancing switch for FPGAs

Author: A. Adhi B
Kentaro S
Luk W
Papaphilippou P
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 14/10/2021
Field of study

This paper presents a novel FPGA-based switch design that achieves high algorithmic performance and an efficient FPGA implementation. Crossbar switches based on virtual output queues (VOQs) and variations have been rather popular for implementing switches on FPGAs, with applications to network-on-chip (NoC) routers and network switches. The efficiency of VOQs is well-documented on ASICs, though we show that their disadvantages can outweigh their advantages on FPGAs. Our proposed design uses an output-queued switch internally for simplifying scheduling, and a queue balancing technique to avoid queue fragmentation and reduce the need for memory-sharing VOQs. Our implementation approaches the scheduling performance of the state-of-the-art, while requiring considerably fewer FPGA resources

Spiral - Imperial College Digital Repository

High-performance acceleration of 2-D and 3D CNNs on FPGAs using static block floating point

Author: Fan H
Liu S
Luk W
Niu X
Que Z
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/10/2021
Field of study

Over the past few years, 2-D convolutional neural networks (CNNs) have demonstrated their great success in a wide range of 2-D computer vision applications, such as image classification and object detection. At the same time, 3-D CNNs, as a variant of 2-D CNNs, have shown their excellent ability to analyze 3-D data, such as video and geometric data. However, the heavy algorithmic complexity of 2-D and 3-D CNNs imposes a substantial overhead over the speed of these networks, which limits their deployment in real-life applications. Although various domain-specific accelerators have been proposed to address this challenge, most of them only focus on accelerating 2-D CNNs, without considering their computational efficiency on 3-D CNNs. In this article, we propose a unified hardware architecture to accelerate both 2-D and 3-D CNNs with high hardware efficiency. Our experiments demonstrate that the proposed accelerator can achieve up to 92.4% and 85.2% multiply-accumulate efficiency on 2-D and 3-D CNNs, respectively. To improve the hardware performance, we propose a hardware-friendly quantization approach called static block floating point (BFP), which eliminates the frequent representation conversions required in traditional dynamic BFP arithmetic. Comparing with the integer linear quantization using zero-point, the static BFP quantization can decrease the logic resource consumption of the convolutional kernel design by nearly 50% on a field-programmable gate array (FPGA). Without time-consuming retraining, the proposed static BFP quantization is able to quantize the precision to 8-bit mantissa with negligible accuracy loss. As different CNNs on our reconfigurable system require different hardware and software parameters to achieve optimal hardware performance and accuracy, we also propose an automatic tool for parameter optimization. Based on our hardware design and optimization, we demonstrate that the proposed accelerator can achieve 3.8-5.6 times higher energy efficiency than graphics processing unit (GPU) implementation. Comparing with the state-of-the-art FPGA-based accelerators, our design achieves higher generality and up to 1.4-2.2 times higher resource efficiency on both 2-D and 3-D CNNs

Spiral - Imperial College Digital Repository