Search CORE

459 research outputs found

Automatic generation of high-throughput systolic tree-based solvers for modern FPGAs

Author: Tavakkoli Aryan
Publication venue: Electrical and Electronic Engineering, Imperial College London
Publication date: 01/06/2019
Field of study

Tree-based models are a class of numerical methods widely used in financial option pricing, which have a computational complexity that is quadratic with respect to the solution accuracy. Previous research has employed reconfigurable computing with small degrees of parallelism to provide faster hardware solutions compared with general-purpose processing software designs. However, due to the nature of their vector hardware architectures, they cannot scale their compute resources efficiently, leaving them with pricing latency figures which are quadratic with respect to the problem size, and hence to the solution accuracy. Also, their solutions are not productive as they require hardware engineering effort, and can only solve one type of tree problems, known as the standard American option. This thesis presents a novel methodology in the form of a high-level design framework which can capture any common tree-based problem, and automatically generates high-throughput field-programmable gate array (FPGA) solvers based on proposed scalable hardware architectures. The thesis has made three main contributions. First, systolic architectures were proposed for solving binomial and trinomial trees, which due to their custom systolic data-movement mechanisms, can scale their compute resources efficiently to provide linear latency scaling for medium-size trees and improved quadratic latency scaling for large trees. Using the proposed systolic architectures, throughput speed-ups of up to 5.6X and 12X were achieved for modern FPGAs, compared to previous vector designs, for medium and large trees, respectively. Second, a productive high-level design framework was proposed, that can capture any common binomial and trinomial tree problem, and a methodology was suggested to generate high-throughput systolic solvers with custom data precision, where the methodology requires no hardware design effort from the end user. Third, a fully-automated tool-chain methodology was proposed that, compared to previous tree-based solvers, improves user productivity by removing the manual engineering effort of applying the design framework to option pricing problems. Using the productive design framework, high-throughput systolic FPGA solvers have been automatically generated from simple end-user C descriptions for several tree problems, such as American, Bermudan, and barrier options.Open Acces

Spiral - Imperial College Digital Repository

NanoStreams: A Microserver Architecture for Real-time Analytics on Fast Data Streams

Author: Barber P.
Bilos A.
Georgakoudis G.
Gillan C.
Kaloutsakis S.
Minhas U. I.
Nikolopoulos D. S.
Russell M.
Woods R.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/07/2018
Field of study

Queen's University Belfast Research Portal

Energy-Efficient FPGA Implementation for Binomial Option Pricing Using OpenCL

Author: Baghdadi Amer
HOCHAPFEL Erik
HORREIN Pierre-Henri
MENA MORALES Valentin
Vaton Sandrine
Publication venue: HAL CCSD
Publication date: 01/01/2014
Field of study

International audienceEnergy efficiency of financial computations is a performance criterion that can no longer be dismissed, and is as crucial as raw acceleration and accuracy of the solution. In order to reduce the energy consumption of financial accelerators, FPGAs offer a good compromise with low power consumption and high parallelism. However, designing and prototyping an application on an FPGA-based platform are typically very time-consuming and requires significant skills in hardware design. This issue constitutes a major drawback with respect to software-centric acceleration platforms and approaches. A high-level approach has been chosen, using Altera’s implementation of the OpenCL standard, to answer this issue. We present two FPGA implementations of the binomial option pricing model on American options. The results obtained on a Terasic DE4 - Stratix IV board form a solid basis to hold all the constraints necessary for a real world application. The best implementation can evaluate more than 2000 options/s with an average power of less than 20W

HAL-CentraleSupelec

Crossref

INRIA a CCSD electronic archive server

HAL-Université de Bretagne Occidentale

HAL-Rennes 1

Parallelization of Numerical Methods on Parallel Processor Architectures

Author: László Endre
Publication venue
Publication date: 01/01/2016
Field of study

REAL-PhD

Evolutionary Algorithms and Computational Methods for Derivatives Pricing

Author: Palmer Samuel
Publication venue: UCL (University College London)
Publication date: 28/02/2019
Field of study

This work aims to provide novel computational solutions to the problem of derivative pricing. To achieve this, a novel hybrid evolutionary algorithm (EA) based on particle swarm optimisation (PSO) and differential evolution (DE) is introduced and applied, along with various other state-of-the-art variants of PSO and DE, to the problem of calibrating the Heston stochastic volatility model. It is found that state-of-the-art DEs provide excellent calibration performance, and that previous use of rudimentary DEs in the literature undervalued the use of these methods. The use of neural networks with EAs for approximating the solution to derivatives pricing models is next investigated. A set of neural networks are trained from Monte Carlo (MC) simulation data to approximate the closed form solution for European, Asian and American style options. The results are comparable to MC pricing, but with offline evaluation of the price using the neural networks being orders of magnitudes faster and computationally more efficient. Finally, the use of custom hardware for numerical pricing of derivatives is introduced. The solver presented here provides an energy efficient data-flow implementation for pricing derivatives, which has the potential to be incorporated into larger high-speed/low energy trading systems

UCL Discovery

Custom optimization algorithms for efficient hardware implementation

Author: A. Constantinides
Eric C. Kerrigan
Juan Luis Jerez
Supervised George
Publication venue: Electrical and Electronic Engineering, Imperial College London
Publication date: 01/01/2013
Field of study

The focus is on real-time optimal decision making with application in advanced control systems. These computationally intensive schemes, which involve the repeated solution of (convex) optimization problems within a sampling interval, require more efficient computational methods than currently available for extending their application to highly dynamical systems and setups with resource-constrained embedded computing platforms. A range of techniques are proposed to exploit synergies between digital hardware, numerical analysis and algorithm design. These techniques build on top of parameterisable hardware code generation tools that generate VHDL code describing custom computing architectures for interior-point methods and a range of first-order constrained optimization methods. Since memory limitations are often important in embedded implementations we develop a custom storage scheme for KKT matrices arising in interior-point methods for control, which reduces memory requirements significantly and prevents I/O bandwidth limitations from affecting the performance in our implementations. To take advantage of the trend towards parallel computing architectures and to exploit the special characteristics of our custom architectures we propose several high-level parallel optimal control schemes that can reduce computation time. A novel optimization formulation was devised for reducing the computational effort in solving certain problems independent of the computing platform used. In order to be able to solve optimization problems in fixed-point arithmetic, which is significantly more resource-efficient than floating-point, tailored linear algebra algorithms were developed for solving the linear systems that form the computational bottleneck in many optimization methods. These methods come with guarantees for reliable operation. We also provide finite-precision error analysis for fixed-point implementations of first-order methods that can be used to minimize the use of resources while meeting accuracy specifications. The suggested techniques are demonstrated on several practical examples, including a hardware-in-the-loop setup for optimization-based control of a large airliner.Open Acces

CiteSeerX

Spiral - Imperial College Digital Repository

High-performance and hardware-aware computing: proceedings of the second International Workshop on New Frontiers in High-performance and Hardware-aware Computing (HipHaC\u2711), San Antonio, Texas, USA, February 2011 ; (in conjunction with HPCA-17)

Author: Buchty Rainer
Weiß Jan-Philipp
Publication venue: KIT Scientific Publishing, Karlsruhe
Publication date: 01/01/2011
Field of study

High-performance system architectures are increasingly exploiting heterogeneity. The HipHaC workshop aims at combining new aspects of parallel, heterogeneous, and reconfigurable microprocessor technologies with concepts of high-performance computing and, particularly, numerical solution methods. Compute- and memory-intensive applications can only benefit from the full hardware potential if all features on all levels are taken into account in a holistic approach

KITopen

Methodology for complex dataflow application development

Author: Voss Nils
Publication venue: Computing, Imperial College London
Publication date: 01/06/2021
Field of study

This thesis addresses problems inherent to the development of complex applications for reconfig- urable systems. Many projects fail to complete or take much longer than originally estimated by relying on traditional iterative software development processes typically used with conventional computers. Even though designer productivity can be increased by abstract programming and execution models, e.g., dataflow, development methodologies considering the specific properties of reconfigurable systems do not exist. The first contribution of this thesis is a design methodology to facilitate systematic develop- ment of complex applications using reconfigurable hardware in the context of High-Performance Computing (HPC). The proposed methodology is built upon a careful analysis of the original application, a software model of the intended hardware system, an analytical prediction of performance and on-chip area usage, and an iterative architectural refinement to resolve identi- fied bottlenecks before writing a single line of code targeting the reconfigurable hardware. It is successfully validated using two real applications and both achieve state-of-the-art performance. The second contribution extends this methodology to provide portability between devices in two steps. First, additional tool support for contemporary multi-die Field-Programmable Gate Arrays (FPGAs) is developed. An algorithm to automatically map logical memories to hetero- geneous physical memories with special attention to die boundaries is proposed. As a result, only the proposed algorithm managed to successfully place and route all designs used in the evaluation while the second-best algorithm failed on one third of all large applications. Second, best practices for performance portability between different FPGA devices are collected and evaluated on a financial use case, showing efficient resource usage on five different platforms. The third contribution applies the extended methodology to a real, highly demanding emerging application from the radiotherapy domain. A Monte-Carlo based simulation of dose accumu- lation in human tissue is accelerated using the proposed methodology to meet the real time requirements of adaptive radiotherapy.Open Acces

Spiral - Imperial College Digital Repository

High-level FPGA accelerator design for structured-mesh-based numerical solvers

Author: Kamalakkannan Kamalavasan
Publication venue
Publication date
Field of study

Field Programmable Gate Arrays (FPGAs) have become highly attractive as accelerators due to their low power consumption and re-programmability. However, a key limitation is the time and know-how required to program them. Even with high-level synthesis tools, they still require significant hand-tuned/low-level customizations and design space exploration to gain good performance. The need to program FPGAs using the dataflow programming model, much less well known and practised by the high-performance computing (HPC) community, is a major barrier for adoption for HPC. The underlying motivation of this work is to bridge this gap - attaining near-optimal performance vs the ease of programming. To this end, we target the important class of applications based on structured meshes, focusing on numerical algorithms based on explicit and implicit techniques. We leverage the main characteristics of the application class, its computation-communication pattern and the hardware features. For explicit schemes, characterized by stencil computations, we unify the state-of-the-art techniques such as vectorization and unrolling with a number of new high-gain optimizations such as creating perfect data reuse data-paths, batching and tiling. A key new feature is their applicability to multiple stencil loops enabling the development of real-world workloads. For implicit schemes, we re-evaluate the characteristics of the tridiagonal system solver algorithms for FPGAs and develop a new high throughput batched multi-dimensional tridiagonal system solver library with orders of magnitude better performance than the state-of-the-art. New analytic models are developed to support the solvers, elucidating and modelling the critical path of execution and parameterizing the design. This together with the optimal designs and new library lead to a unified design work-flow for synthesis on FPGAs. The new workflow is used to implement a range of applications, from simple single stencil designs, multiple stencil loops to solvers with real-world utility. They are synthesized on the currently dominant Xilinx and Intel FPGAs. Benchmarking indicate the FPGAs matching or outperforming the best GPU implementations, the current best traditional architecture device solution. Over 30% energy saving can also be observed. The performance model demonstrates over 85% accuracy. The thesis discusses the determinants for these applications to be amenable for FPGA implementation, providing insights into the feasibility and profitability of a design. Finally we propose initial steps in automating the workflow to be used through a DSL

Warwick Research Archives Portal Repository