245 research outputs found
Precision analysis for hardware acceleration of numerical algorithms
The precision used in an algorithm affects the error and performance of individual computations, the
memory usage, and the potential parallelism for a fixed hardware budget. However, when migrating
an algorithm onto hardware, the potential improvements that can be obtained by tuning the precision
throughout an algorithm to meet a range or error specification are often overlooked; the major reason
is that it is hard to choose a number system which can guarantee any such specification can be met.
Instead, the problem is mitigated by opting to use IEEE standard double precision arithmetic so as to be
āno worseā than a software implementation. However, the flexibility in the number representation is one
of the key factors that can be exploited on reconfigurable hardware such as FPGAs, and hence ignoring
this potential significantly limits the performance achievable.
In order to optimise the performance of hardware reliably, we require a method that can tractably
calculate tight bounds for the error or range of any variable within an algorithm, but currently only a
handful of methods to calculate such bounds exist, and these either sacrifice tightness or tractability,
whilst simulation-based methods cannot guarantee the given error estimate. This thesis presents a new
method to calculate these bounds, taking into account both input ranges and finite precision effects,
which we show to be, in general, tighter in comparison to existing methods; this in turn can be used to
tune the hardware to the algorithm specifications.
We demonstrate the use of this software to optimise hardware for various algorithms to accelerate
the solution of a system of linear equations, which forms the basis of many problems in engineering
and science, and show that significant performance gains can be obtained by using this new approach in
conjunction with more traditional hardware optimisations
High sample-rate Givens rotations for recursive least squares
The design of an application-specific integrated circuit of a parallel array processor is considered
for recursive least squares by QR decomposition using Givens rotations, applicable
in adaptive filtering and beamforming applications. Emphasis is on high sample-rate operation,
which, for this recursive algorithm, means that the time to perform arithmetic operations
is critical. The algorithm, architecture and arithmetic are considered in a single
integrated design procedure to achieve optimum results.
A realisation approach using standard arithmetic operators, add, multiply and divide is
adopted. The design of high-throughput operators with low delay is addressed for fixed- and
floating-point number formats, and the application of redundant arithmetic considered. New
redundant multiplier architectures are presented enabling reductions in area of up to 25%,
whilst maintaining low delay. A technique is presented enabling the use of a conventional
tree multiplier in recursive applications, allowing savings in area and delay. Two new divider
architectures are presented showing benefits compared with the radix-2 modified SRT algorithm.
Givens rotation algorithms are examined to determine their suitability for VLSI implementation.
A novel algorithm, based on the Squared Givens Rotation (SGR) algorithm, is developed
enabling the sample-rate to be increased by a factor of approximately 6 and offering
area reductions up to a factor of 2 over previous approaches. An estimated sample-rate of
136 MHz could be achieved using a standard cell approach and O.35pm CMOS technology.
The enhanced SGR algorithm has been compared with a CORDIC approach and shown to
benefit by a factor of 3 in area and over 11 in sample-rate. When compared with a recent implementation
on a parallel array of general purpose (GP) DSP chips, it is estimated that a single
application specific chip could offer up to 1,500 times the computation obtained from a
single OP DSP chip
Comparison of logarithmic and floating-point number systems implemented on Xilinx Virtex-II field-programmable gate arrays
The aim of this thesis is to compare the implementation of parameterisable LNS (logarithmic number system) and floating-point high dynamic range number systems on FPGA. The Virtex/Virtex-II range of FPGAs from Xilinx, which are the most popular FPGA technology, are used to implement the designs. The study focuses on using the low level primitives of the technology in an efficient way and so initially the design issues in implementing fixed-point operators are considered. The four basic operations of addition, multiplication, division and square root are considered. Carry- free adders, ripple-carry adders, parallel multipliers and digit recurrence division and square root are discussed. The floating-point operators use the word format and exceptions as described by the IEEE std-754. A dual-path adder implementation is described in detail, as are floating-point multiplier, divider and square root components. Results and comparisons with other works are given. The efficient implementation of function evaluation methods is considered next. An overview of current FPGA methods is given and a new piecewise polynomial implementation using the Taylor series is presented and compared with other designs in the literature. In the next section the LNS word format, accuracy and exceptions are described and two new LNS addition/subtraction function approximations are described. The algorithms for performing multiplication, division and powering in the LNS domain are also described and are compared with other designs in the open literature. Parameterisable conversion algorithms to convert to/from the fixed-point domain from/to the LNS and floating-point domain are described and implementation results given. In the next chapter MATLAB bit-true software models are given that have the exact functionality as the hardware models. The interfaces of the models are given and a serial communication system to perform low speed system tests is described. A comparison of the LNS and floating-point number systems in terms of area and delay is given. Different functions implemented in LNS and floating-point arithmetic are also compared and conclusions are drawn. The results show that when the LNS is implemented with a 6-bit or less characteristic it is superior to floating-point. However, for larger characteristic lengths the floating-point system is more efficient due to the delay and exponential area increase of the LNS addition operator. The LNS is beneficial for larger characteristics than 6-bits only for specialist applications that require a high portion of division, multiplication, square root, powering operations and few additions
Transformations of High-Level Synthesis Codes for High-Performance Computing
Specialized hardware architectures promise a major step in performance and
energy efficiency over the traditional load/store devices currently employed in
large scale computing systems. The adoption of high-level synthesis (HLS) from
languages such as C/C++ and OpenCL has greatly increased programmer
productivity when designing for such platforms. While this has enabled a wider
audience to target specialized hardware, the optimization principles known from
traditional software design are no longer sufficient to implement
high-performance codes. Fast and efficient codes for reconfigurable platforms
are thus still challenging to design. To alleviate this, we present a set of
optimizing transformations for HLS, targeting scalable and efficient
architectures for high-performance computing (HPC) applications. Our work
provides a toolbox for developers, where we systematically identify classes of
transformations, the characteristics of their effect on the HLS code and the
resulting hardware (e.g., increases data reuse or resource consumption), and
the objectives that each transformation can target (e.g., resolve interface
contention, or increase parallelism). We show how these can be used to
efficiently exploit pipelining, on-chip distributed fast memory, and on-chip
streaming dataflow, allowing for massively parallel architectures. To quantify
the effect of our transformations, we use them to optimize a set of
throughput-oriented FPGA kernels, demonstrating that our enhancements are
sufficient to scale up parallelism within the hardware constraints. With the
transformations covered, we hope to establish a common framework for
performance engineers, compiler developers, and hardware developers, to tap
into the performance potential offered by specialized hardware architectures
using HLS
Application-Specific Number Representation
Reconfigurable devices, such as Field Programmable Gate Arrays (FPGAs), enable application-
specific number representations. Well-known number formats include fixed-point, floating-
point, logarithmic number system (LNS), and residue number system (RNS). Such different
number representations lead to different arithmetic designs and error behaviours, thus produc-
ing implementations with different performance, accuracy, and cost.
To investigate the design options in number representations, the first part of this thesis presents
a platform that enables automated exploration of the number representation design space. The
second part of the thesis shows case studies that optimise the designs for area, latency or
throughput from the perspective of number representations.
Automated design space exploration in the first part addresses the following two major issues:
Ā² Automation requires arithmetic unit generation. This thesis provides optimised
arithmetic library generators for logarithmic and residue arithmetic units, which support
a wide range of bit widths and achieve significant improvement over previous designs.
Ā² Generation of arithmetic units requires specifying the bit widths for each
variable. This thesis describes an automatic bit-width optimisation tool called R-Tool,
which combines dynamic and static analysis methods, and supports different number
systems (fixed-point, floating-point, and LNS numbers).
Putting it all together, the second part explores the effects of application-specific number
representation on practical benchmarks, such as radiative Monte Carlo simulation, and seismic
imaging computations. Experimental results show that customising the number representations
brings benefits to hardware implementations: by selecting a more appropriate number format,
we can reduce the area cost by up to 73.5% and improve the throughput by 14.2% to 34.1%; by
performing the bit-width optimisation, we can further reduce the area cost by 9.7% to 17.3%.
On the performance side, hardware implementations with customised number formats achieve
5 to potentially over 40 times speedup over software implementations
Numerical solutions of differential equations on FPGA-enhanced computers
Conventionally, to speed up scientific or engineering (S&E) computation programs
on general-purpose computers, one may elect to use faster CPUs, more memory, systems
with more efficient (though complicated) architecture, better software compilers, or even
coding with assembly languages. With the emergence of Field Programmable Gate
Array (FPGA) based Reconfigurable Computing (RC) technology, numerical scientists
and engineers now have another option using FPGA devices as core components to
address their computational problems. The hardware-programmable, low-cost, but
powerful āFPGA-enhanced computerā has now become an attractive approach for many
S&E applications.
A new computer architecture model for FPGA-enhanced computer systems and its
detailed hardware implementation are proposed for accelerating the solutions of
computationally demanding and data intensive numerical PDE problems. New FPGAoptimized
algorithms/methods for rapid executions of representative numerical methods
such as Finite Difference Methods (FDM) and Finite Element Methods (FEM) are
designed, analyzed, and implemented on it. Linear wave equations based on seismic
data processing applications are adopted as the targeting PDE problems to demonstrate
the effectiveness of this new computer model. Their sustained computational
performances are compared with pure software programs operating on commodity CPUbased
general-purpose computers. Quantitative analysis is performed from a hierarchical
set of aspects as customized/extraordinary computer arithmetic or function units, compact but flexible system architecture and memory hierarchy, and hardwareoptimized
numerical algorithms or methods that may be inappropriate for conventional
general-purpose computers. The preferable property of in-system hardware
reconfigurability of the new system is emphasized aiming at effectively accelerating the
execution of complex multi-stage numerical applications. Methodologies for
accelerating the targeting PDE problems as well as other numerical PDE problems, such
as heat equations and Laplace equations utilizing programmable hardware resources are
concluded, which imply the broad usage of the proposed FPGA-enhanced computers
Vector Pascal: a computer programming language for the FPS-164 array processor
Support for vector operations in computer programming languages is analyzed to determine if programs employing such operations run faster;The programming language Vector Pascal is defined and compared to Fortran 8X and Actus. Vector Pascal contains definitions for matrix and vector operations and the Vector Pascal compiler translates vector expressions. The Vector Pascal compiler executes on an IBM Personal Computer AT and produces code for a Floating Point Systems FPS-164 Scientific Computer;The standard benchmark LINPACK, which solves systems of linear equations, is transcribed from Fortran to Standard Pascal and Vector Pascal. The Vector Pascal version of LINPACK exploits vector operations defined in the language. The speedup of the Vector Pascal version of LINPACK over the Standard Pascal version is presented
Custom optimization algorithms for efficient hardware implementation
The focus is on real-time optimal decision making with application in advanced control
systems. These computationally intensive schemes, which involve the repeated solution of
(convex) optimization problems within a sampling interval, require more efficient computational
methods than currently available for extending their application to highly dynamical
systems and setups with resource-constrained embedded computing platforms.
A range of techniques are proposed to exploit synergies between digital hardware, numerical
analysis and algorithm design. These techniques build on top of parameterisable
hardware code generation tools that generate VHDL code describing custom computing
architectures for interior-point methods and a range of first-order constrained optimization
methods. Since memory limitations are often important in embedded implementations we
develop a custom storage scheme for KKT matrices arising in interior-point methods for
control, which reduces memory requirements significantly and prevents I/O bandwidth
limitations from affecting the performance in our implementations. To take advantage of
the trend towards parallel computing architectures and to exploit the special characteristics
of our custom architectures we propose several high-level parallel optimal control
schemes that can reduce computation time. A novel optimization formulation was devised
for reducing the computational effort in solving certain problems independent of the computing
platform used. In order to be able to solve optimization problems in fixed-point
arithmetic, which is significantly more resource-efficient than floating-point, tailored linear
algebra algorithms were developed for solving the linear systems that form the computational
bottleneck in many optimization methods. These methods come with guarantees
for reliable operation. We also provide finite-precision error analysis for fixed-point implementations
of first-order methods that can be used to minimize the use of resources while
meeting accuracy specifications. The suggested techniques are demonstrated on several
practical examples, including a hardware-in-the-loop setup for optimization-based control
of a large airliner.Open Acces
Compiling dataflow graphs into hardware
Department Head: L. Darrell Whitley.2005 Fall.Includes bibliographical references (pages 121-126).Conventional computers are programmed by supplying a sequence of instructions that perform the desired task. A reconfigurable processor is "programmed" by specifying the interconnections between hardware components, thereby creating a "hardwired" system to do the particular task. For some applications such as image processing, reconfigurable processors can produce dramatic execution speedups. However, programming a reconfigurable processor is essentially a hardware design discipline, making programming difficult for application programmers who are only familiar with software design techniques. To bridge this gap, a programming language, called SA-C (Single Assignment C, pronounced "sassy"), has been designed for programming reconfigurable processors. The process involves two main steps - first, the SA-C compiler analyzes the input source code and produces a hardware-independent intermediate representation of the program, called a dataflow graph (DFG). Secondly, this DFG is combined with hardware-specific information to create the final configuration. This dissertation describes the design and implementation of a system that performs the DFG to hardware translation. The DFG is broken up into three sections: the data generators, the inner loop body, and the data collectors. The second of these, the inner loop body, is used to create a computational structure that is unique for each program. The other two sections are implemented by using prebuilt modules, parameterized for the particular problem. Finally, a "glue module" is created to connect the various pieces into a complete interconnection specification. The dissertation also explores optimizations that can be applied while processing the DFG, to improve performance. A technique for pipelining the inner loop body is described that uses an estimation tool for the propagation delay of the nodes within the dataflow graph. A scheme is also described that identifies subgraphs with the dataflow graph that can be replaced with lookup tables. The lookup tables provide a faster implementation than random logic in some instances
- ā¦