1,336 research outputs found
Automatic generation of high-throughput systolic tree-based solvers for modern FPGAs
Tree-based models are a class of numerical methods widely used in financial option pricing, which have a computational complexity that is quadratic with respect to the solution accuracy. Previous research has employed reconfigurable computing with small degrees of parallelism to provide faster hardware solutions compared with general-purpose processing software designs. However, due to the nature of their vector hardware architectures, they cannot scale their compute resources efficiently, leaving them with pricing latency figures which are quadratic with respect to the problem size, and hence to the solution accuracy. Also, their solutions are not productive as they require hardware engineering effort, and can only solve one type of tree problems, known as the standard American option. This thesis presents a novel methodology in the form of a high-level design framework which can capture any common tree-based problem, and automatically generates high-throughput field-programmable gate array (FPGA) solvers based on proposed scalable hardware architectures. The thesis has made three main contributions. First, systolic architectures were proposed for solving binomial and trinomial trees, which due to their custom systolic data-movement mechanisms, can scale their compute resources efficiently to provide linear latency scaling for medium-size trees and improved quadratic latency scaling for large trees. Using the proposed systolic architectures, throughput speed-ups of up to 5.6X and 12X were achieved for modern FPGAs, compared to previous vector designs, for medium and large trees, respectively. Second, a productive high-level design framework was proposed, that can capture any common binomial and trinomial tree problem, and a methodology was suggested to generate high-throughput systolic solvers with custom data precision, where the methodology requires no hardware design effort from the end user. Third, a fully-automated tool-chain methodology was proposed that, compared to previous tree-based solvers, improves user productivity by removing the manual engineering effort of applying the design framework to option pricing problems. Using the productive design framework, high-throughput systolic FPGA solvers have been automatically generated from simple end-user C descriptions for several tree problems, such as American, Bermudan, and barrier options.Open Acces
Automatic Creation of High-Bandwidth Memory Architectures from Domain-Specific Languages: The Case of Computational Fluid Dynamics
Numerical simulations can help solve complex problems. Most of these algorithms are massively parallel and thus good candidates for FPGA acceleration thanks to spatial parallelism. Modern FPGA devices can leverage high-bandwidth memory technologies, but when applications are memory-bound designers must craft advanced communication and memory architectures for efficient data movement and on-chip storage. This development process requires hardware design skills that are uncommon in domain-specific experts.
In this paper, we propose an automated tool flow from a domain-specific language (DSL) for tensor expressions to generate massively-parallel accelerators on HBM-equipped FPGAs. Designers can use this flow to integrate and evaluate various compiler or hardware optimizations. We use computational fluid dynamics (CFD) as a paradigmatic example.
Our flow starts from the high-level specification of tensor operations and combines an MLIR-based compiler with an in-house hardware generation flow to generate systems with parallel accelerators and a specialized memory architecture that moves data efficiently, aiming at fully exploiting the available CPU-FPGA bandwidth.
We simulated applications with millions of elements, achieving up to 103 GFLOPS with one compute unit and custom precision when targeting a Xilinx Alveo U280. Our FPGA implementation is up to 25x more energy efficient than expert-crafted Intel CPU implementations
Neuro-memristive Circuits for Edge Computing: A review
The volume, veracity, variability, and velocity of data produced from the
ever-increasing network of sensors connected to Internet pose challenges for
power management, scalability, and sustainability of cloud computing
infrastructure. Increasing the data processing capability of edge computing
devices at lower power requirements can reduce several overheads for cloud
computing solutions. This paper provides the review of neuromorphic
CMOS-memristive architectures that can be integrated into edge computing
devices. We discuss why the neuromorphic architectures are useful for edge
devices and show the advantages, drawbacks and open problems in the field of
neuro-memristive circuits for edge computing
Numerical solutions of differential equations on FPGA-enhanced computers
Conventionally, to speed up scientific or engineering (S&E) computation programs
on general-purpose computers, one may elect to use faster CPUs, more memory, systems
with more efficient (though complicated) architecture, better software compilers, or even
coding with assembly languages. With the emergence of Field Programmable Gate
Array (FPGA) based Reconfigurable Computing (RC) technology, numerical scientists
and engineers now have another option using FPGA devices as core components to
address their computational problems. The hardware-programmable, low-cost, but
powerful “FPGA-enhanced computer” has now become an attractive approach for many
S&E applications.
A new computer architecture model for FPGA-enhanced computer systems and its
detailed hardware implementation are proposed for accelerating the solutions of
computationally demanding and data intensive numerical PDE problems. New FPGAoptimized
algorithms/methods for rapid executions of representative numerical methods
such as Finite Difference Methods (FDM) and Finite Element Methods (FEM) are
designed, analyzed, and implemented on it. Linear wave equations based on seismic
data processing applications are adopted as the targeting PDE problems to demonstrate
the effectiveness of this new computer model. Their sustained computational
performances are compared with pure software programs operating on commodity CPUbased
general-purpose computers. Quantitative analysis is performed from a hierarchical
set of aspects as customized/extraordinary computer arithmetic or function units, compact but flexible system architecture and memory hierarchy, and hardwareoptimized
numerical algorithms or methods that may be inappropriate for conventional
general-purpose computers. The preferable property of in-system hardware
reconfigurability of the new system is emphasized aiming at effectively accelerating the
execution of complex multi-stage numerical applications. Methodologies for
accelerating the targeting PDE problems as well as other numerical PDE problems, such
as heat equations and Laplace equations utilizing programmable hardware resources are
concluded, which imply the broad usage of the proposed FPGA-enhanced computers
Fast algorithm for real-time rings reconstruction
The GAP project is dedicated to study the application of GPU in several contexts in which
real-time response is important to take decisions. The definition of real-time depends on
the application under study, ranging from answer time of μs up to several hours in case
of very computing intensive task. During this conference we presented our work in low
level triggers [1] [2] and high level triggers [3] in high energy physics experiments, and
specific application for nuclear magnetic resonance (NMR) [4] [5] and cone-beam CT [6].
Apart from the study of dedicated solution to decrease the latency due to data transport
and preparation, the computing algorithms play an essential role in any GPU application.
In this contribution, we show an original algorithm developed for triggers application, to
accelerate the ring reconstruction in RICH detector when it is not possible to have seeds
for reconstruction from external trackers
Studying Light-Harvesting Models with Superconducting Circuits
The process of photosynthesis, the main source of energy in the animate
world, converts sunlight into chemical energy. The surprisingly high efficiency
of this process is believed to be enabled by an intricate interplay between the
quantum nature of molecular structures in photosynthetic complexes and their
interaction with the environment. Investigating these effects in biological
samples is challenging due to their complex and disordered structure. Here we
experimentally demonstrate a new approach for studying photosynthetic models
based on superconducting quantum circuits. In particular, we demonstrate the
unprecedented versatility and control of our method in an engineered three-site
model of a pigment protein complex with realistic parameters scaled down in
energy by a factor of . With this system we show that the excitation
transport between quantum coherent sites disordered in energy can be enabled
through the interaction with environmental noise. We also show that the
efficiency of the process is maximized for structured noise resembling
intramolecular phononic environments found in photosynthetic complexes.Comment: 8+12 pages, 4+12 figure
Recommended from our members
ReSCon '09, Research Student Conference: Book of Abstracts
The second SED Research Student Conference (ReSCon2009) was hosted over three days, 22-24 June 2009, in the Lecture Centre at Brunel University. The conference consisted of technical presentations, a poster session and social events. The abstracts and presentations were the result of ongoing research by postgraduate research students from the School of Engineering and Design at Brunel University. The conference is held annually, and ReSCon plays a key role in contributing to research and innovations within the School
Algorithm and Hardware Co-design for Learning On-a-chip
abstract: Machine learning technology has made a lot of incredible achievements in recent years. It has rivalled or exceeded human performance in many intellectual tasks including image recognition, face detection and the Go game. Many machine learning algorithms require huge amount of computation such as in multiplication of large matrices. As silicon technology has scaled to sub-14nm regime, simply scaling down the device cannot provide enough speed-up any more. New device technologies and system architectures are needed to improve the computing capacity. Designing specific hardware for machine learning is highly in demand. Efforts need to be made on a joint design and optimization of both hardware and algorithm.
For machine learning acceleration, traditional SRAM and DRAM based system suffer from low capacity, high latency, and high standby power. Instead, emerging memories, such as Phase Change Random Access Memory (PRAM), Spin-Transfer Torque Magnetic Random Access Memory (STT-MRAM), and Resistive Random Access Memory (RRAM), are promising candidates providing low standby power, high data density, fast access and excellent scalability. This dissertation proposes a hierarchical memory modeling framework and models PRAM and STT-MRAM in four different levels of abstraction. With the proposed models, various simulations are conducted to investigate the performance, optimization, variability, reliability, and scalability.
Emerging memory devices such as RRAM can work as a 2-D crosspoint array to speed up the multiplication and accumulation in machine learning algorithms. This dissertation proposes a new parallel programming scheme to achieve in-memory learning with RRAM crosspoint array. The programming circuitry is designed and simulated in TSMC 65nm technology showing 900X speedup for the dictionary learning task compared to the CPU performance.
From the algorithm perspective, inspired by the high accuracy and low power of the brain, this dissertation proposes a bio-plausible feedforward inhibition spiking neural network with Spike-Rate-Dependent-Plasticity (SRDP) learning rule. It achieves more than 95% accuracy on the MNIST dataset, which is comparable to the sparse coding algorithm, but requires far fewer number of computations. The role of inhibition in this network is systematically studied and shown to improve the hardware efficiency in learning.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201
- …