552 research outputs found
Optimising runtime reconfigurable designs for high performance applications
This thesis proposes novel optimisations for high performance runtime reconfigurable designs.
For a reconfigurable design,
the proposed approach investigates idle resources introduced by static design approaches,
and exploits runtime reconfiguration to eliminate the inefficient resources.
The approach covers the circuit level, the function level, and the system level.
At the circuit level, a method is proposed for tuning reconfigurable designs with two analytical models:
a resource model for computational and memory resources and memory bandwidth,
and a performance model for estimating execution time.
This method is applied to tuning implementations of finite-difference algorithms,
optimising arithmetic operators and memory bandwidth based on algorithmic parameters,
and eliminating idle resources by runtime reconfiguration.
At the function level, a method is proposed to automatically identify and exploit runtime
reconfiguration opportunities while optimising resource utilisation.
The method is based on Reconfiguration Data Flow Graph,
a new hierarchical graph structure enabling runtime reconfigurable designs to be synthesised in three steps:
function analysis, configuration organisation, and runtime solution generation.
At the system level, a method is proposed for optimising reconfigurable designs by dynamically adapting the designs to available runtime resources in a reconfigurable system. This method includes two steps: compile-time optimisation and runtime scaling, which enable efficient workload distribution, asynchronous communication scheduling, and domain-specific optimisations. It can be used in developing effective servers for high performance applications.Open Acces
Smart technologies for effective reconfiguration: the FASTER approach
Current and future computing systems increasingly require that their functionality stays flexible after the system is operational, in order to cope with changing user requirements and improvements in system features, i.e. changing protocols and data-coding standards, evolving demands for support of different user applications, and newly emerging applications in communication, computing and consumer electronics. Therefore, extending the functionality and the lifetime of products requires the addition of new functionality to track and satisfy the customers needs and market and technology trends. Many contemporary products along with the software part incorporate hardware accelerators for reasons of performance and power efficiency. While adaptivity of software is straightforward, adaptation of the hardware to changing requirements constitutes a challenging problem requiring delicate solutions. The FASTER (Facilitating Analysis and Synthesis Technologies for Effective Reconfiguration) project aims at introducing a complete methodology to allow designers to easily implement a system specification on a platform which includes a general purpose processor combined with multiple accelerators running on an FPGA, taking as input a high-level description and fully exploiting, both at design time and at run time, the capabilities of partial dynamic reconfiguration. The goal is that for selected application domains, the FASTER toolchain will be able to reduce the design and verification time of complex reconfigurable systems providing additional novel verification features that are not available in existing tool flows
Optimising Sparse Matrix Vector multiplication for large scale FEM problems on FPGA
Sparse Matrix Vector multiplication (SpMV) is an important kernel in many scientific applications. In this work we propose an architecture and an automated customisation method to detect and optimise the architecture for block diagonal sparse matrices. We evaluate the proposed approach in the context of the spectral/hp Finite Element Method, using the local matrix assembly approach. This problem leads to a large sparse system of linear equations with block diagonal matrix which is typically solved using an iterative method such as the Preconditioned Conjugate Gradient. The efficiency of the proposed architecture combined with the effectiveness of the proposed customisation method reduces BRAM resource utilisation by as much as 10 times, while achieving identical throughput with existing state of the art designs and requiring minimal development effort from the end user. In the context of the Finite Element Method, our approach enables the solution of larger problems than previously possible, enabling the applicability of FPGAs to more interesting HPC problems
Approximate FPGA-based LSTMs under Computation Time Constraints
Recurrent Neural Networks and in particular Long Short-Term Memory (LSTM)
networks have demonstrated state-of-the-art accuracy in several emerging
Artificial Intelligence tasks. However, the models are becoming increasingly
demanding in terms of computational and memory load. Emerging latency-sensitive
applications including mobile robots and autonomous vehicles often operate
under stringent computation time constraints. In this paper, we address the
challenge of deploying computationally demanding LSTMs at a constrained time
budget by introducing an approximate computing scheme that combines iterative
low-rank compression and pruning, along with a novel FPGA-based LSTM
architecture. Combined in an end-to-end framework, the approximation method's
parameters are optimised and the architecture is configured to address the
problem of high-performance LSTM execution in time-constrained applications.
Quantitative evaluation on a real-life image captioning application indicates
that the proposed methods required up to 6.5x less time to achieve the same
application-level accuracy compared to a baseline method, while achieving an
average of 25x higher accuracy under the same computation time constraints.Comment: Accepted at the 14th International Symposium in Applied
Reconfigurable Computing (ARC) 201
Optimising algorithm and hardware for deep neural networks on FPGAs
This thesis proposes novel algorithm and hardware optimisation approaches to accelerate Deep Neural Networks (DNNs), including both Convolutional Neural Networks (CNNs) and Bayesian Neural Networks (BayesNNs).
The first contribution of this thesis is to propose an adaptable and reconfigurable hardware design to accelerate CNNs. By analysing the computational patterns of different CNNs, a unified hardware architecture is proposed for both 2-Dimension and 3-Dimension CNNs. The accelerator is also designed with runtime adaptability, which adopts different parallelism strategies for different convolutional layers at runtime.
The second contribution of this thesis is to propose a novel neural network architecture and hardware design co-optimisation approach, which improves the performance of CNNs at both algorithm and hardware levels. Our proposed three-phase co-design framework decouples network training from design space exploration, which significantly reduces the time-cost of the co-optimisation process.
The third contribution of this thesis is to propose an algorithmic and hardware co-optimisation framework for accelerating BayesNNs. At the algorithmic level, three categories of structured sparsity are explored to reduce the computational complexity of BayesNNs. At the hardware level, we propose a novel hardware architecture with the aim of exploiting the structured sparsity for BayesNNs. Both algorithmic and hardware optimisations are jointly applied to push the performance limit.Open Acces
It's all about data movement: Optimising FPGA data access to boost performance
The use of reconfigurable computing, and FPGAs in particular, to accelerate
computational kernels has the potential to be of great benefit to scientific
codes and the HPC community in general. However, whilst recent advanced in FPGA
tooling have made the physical act of programming reconfigurable architectures
much more accessible, in order to gain good performance the entire algorithm
must be rethought and recast in a dataflow style. Reducing the cost of data
movement for all computing devices is critically important, and in this paper
we explore the most appropriate techniques for FPGAs. We do this by describing
the optimisation of an existing FPGA implementation of an atmospheric model's
advection scheme. By taking an FPGA code that was over four times slower than
running on the CPU, mainly due to data movement overhead, we describe the
profiling and optimisation strategies adopted to significantly reduce the
runtime and bring the performance of our FPGA kernels to a much more practical
level for real-world use. The result of this work is a set of techniques,
steps, and lessons learnt that we have found significantly improves the
performance of FPGA based HPC codes and that others can adopt in their own
codes to achieve similar results.Comment: Preprint of article in 2019 IEEE/ACM International Workshop on
Heterogeneous High-performance Reconfigurable Computing (H2RC
Accelerating Reconfigurable Financial Computing
This thesis proposes novel approaches to the design, optimisation, and management of reconfigurable
computer accelerators for financial computing. There are three contributions. First, we propose novel
reconfigurable designs for derivative pricing using both Monte-Carlo and quadrature methods. Such
designs involve exploring techniques such as control variate optimisation for Monte-Carlo, and multi-dimensional
analysis for quadrature methods. Significant speedups and energy savings are achieved
using our Field-Programmable Gate Array (FPGA) designs over both Central Processing Unit (CPU)
and Graphical Processing Unit (GPU) designs. Second, we propose a framework for distributing computing
tasks on multi-accelerator heterogeneous clusters. In this framework, different computational
devices including FPGAs, GPUs and CPUs work collaboratively on the same financial problem based
on a dynamic scheduling policy. The trade-off in speed and in energy consumption of different accelerator
allocations is investigated. Third, we propose a mixed precision methodology for optimising
Monte-Carlo designs, and a reduced precision methodology for optimising quadrature designs. These
methodologies enable us to optimise throughput of reconfigurable designs by using datapaths with
minimised precision, while maintaining the same accuracy of the results as in the original designs
FPGA dynamic and partial reconfiguration : a survey of architectures, methods, and applications
Dynamic and partial reconfiguration are key differentiating capabilities of field programmable gate arrays (FPGAs). While they have been studied extensively in academic literature, they find limited use in deployed systems. We review FPGA reconfiguration, looking at architectures built for the purpose, and the properties of modern commercial architectures. We then investigate design flows, and identify the key challenges in making reconfigurable FPGA systems easier to design. Finally, we look at applications where reconfiguration has found use, as well as proposing new areas where this capability places FPGAs in a unique position for adoption
- …