481 research outputs found
Video Processing Acceleration using Reconfigurable Logic and Graphics Processors
A vexing question is `which architecture will prevail as the core feature of the next state of
the art video processing system?' This thesis examines the substitutive and collaborative
use of the two alternatives of the reconfigurable logic and graphics processor architectures.
A structured approach to executing architecture comparison is presented - this includes a
proposed `Three Axes of Algorithm Characterisation' scheme and a formulation of perfor-
mance drivers. The approach is an appealing platform for clearly defining the problem,
assumptions and results of a comparison. In this work it is used to resolve the advanta-
geous factors of the graphics processor and reconfigurable logic for video processing, and
the conditions determining which one is superior. The comparison results prompt the
exploration of the customisable options for the graphics processor architecture. To clearly
define the architectural design space, the graphics processor is first identifed as part of
a wider scope of homogeneous multi-processing element (HoMPE) architectures. A novel
exploration tool is described which is suited to the investigation of the customisable op-
tions of HoMPE architectures. The tool adopts a systematic exploration approach and a
high-level parameterisable system model, and is used to explore pre- and post-fabrication
customisable options for the graphics processor. A positive result of the exploration is the
proposal of a reconfigurable engine for data access (REDA) to optimise graphics processor
performance for video processing-specific memory access patterns. REDA demonstrates
the viability of the use of reconfigurable logic as collaborative `glue logic' in the graphics
processor architecture
Hardware compilation of deep neural networks: an overview
Deploying a deep neural network model on a reconfigurable platform, such as an FPGA, is challenging due to the enormous design spaces of both network models and hardware design. A neural network model has various layer types, connection patterns and data representations, and the corresponding implementation can be customised with different architectural and modular parameters. Rather than manually exploring this design space, it is more effective to automate optimisation throughout an end-to-end compilation process. This paper provides an overview of recent literature proposing novel approaches to achieve this aim. We organise materials to mirror a typical compilation flow: front end, platform-independent optimisation and back end. Design templates for neural network accelerators are studied with a specific focus on their derivation methodologies. We also review previous work on network compilation and optimisation for other hardware platforms to gain inspiration regarding FPGA implementation. Finally, we propose some future directions for related research
Profile-directed specialisation of custom floating-point hardware
We present a methodology for generating
floating-point arithmetic hardware
designs which are, for suitable applications, much reduced in size, while still
retaining performance and IEEE-754 compliance. Our system uses three
key parts: a profiling tool, a set of customisable
floating-point units and a
selection of system integration methods.
We use a profiling tool for
floating-point behaviour to identify arithmetic
operations where fundamental elements of IEEE-754
floating-point may be
compromised, without generating erroneous results in the common case.
In the uncommon case, we use simple detection logic to determine when
operands lie outside the range of capabilities of the optimised hardware.
Out-of-range operations are handled by a separate, fully capable,
floatingpoint
implementation, either on-chip or by returning calculations to a host
processor. We present methods of system integration to achieve this errorcorrection.
Thus the system suffers no compromise in IEEE-754 compliance,
even when the synthesised hardware would generate erroneous results.
In particular, we identify from input operands the shift amounts required
for input operand alignment and post-operation normalisation. For operations
where these are small, we synthesise hardware with reduced-size
barrel-shifters. We also propose optimisations to take advantage of other
profile-exposed behaviours, including removing the hardware required to
swap operands in a floating-point adder or subtractor, and reducing the
exponent range to fit observed values.
We present profiling results for a range of applications, including a selection
of computational science programs, Spec FP 95 benchmarks and the
FFMPEG media processing tool, indicating which would be amenable to
our method. Selected applications which demonstrate potential for optimisation
are then taken through to a hardware implementation. We show up
to a 45% decrease in hardware size for a
floating-point datapath, with a
correctable error-rate of less then 3%, even with non-profiled datasets
Increasing the efficacy of automated instruction set extension
The use of Instruction Set Extension (ISE) in customising embedded processors for a specific
application has been studied extensively in recent years. The addition of a set of complex
arithmetic instructions to a baseline core has proven to be a cost-effective means of meeting
design performance requirements. This thesis proposes and evaluates a reconfigurable ISE
implementation called “Configurable Flow Accelerators” (CFAs), a number of refinements to
an existing Automated ISE (AISE) algorithm called “ISEGEN”, and the effects of source form
on AISE.
The CFA is demonstrated repeatedly to be a cost-effective design for ISE implementation.
A temporal partitioning algorithm called “staggering” is proposed and demonstrated on average
to reduce the area of CFA implementation by 37% for only an 8% reduction in acceleration.
This thesis then turns to concerns within the ISEGEN AISE algorithm. A methodology
for finding a good static heuristic weighting vector for ISEGEN is proposed and demonstrated.
Up to 100% of merit is shown to be lost or gained through the choice of vector. ISEGEN
early-termination is introduced and shown to improve the runtime of the algorithm by up to
7.26x, and 5.82x on average. An extension to the ISEGEN heuristic to account for pipelining
is proposed and evaluated, increasing acceleration by up to an additional 1.5x. An energyaware
heuristic is added to ISEGEN, which reduces the energy used by a CFA implementation
of a set of ISEs by an average of 1.6x, up to 3.6x. This result directly contradicts the frequently
espoused notion that “bigger is better” in ISE.
The last stretch of work in this thesis is concerned with source-level transformation: the effect
of changing the representation of the application on the quality of the combined hardwaresoftware
solution. A methodology for combined exploration of source transformation and ISE
is presented, and demonstrated to improve the acceleration of the result by an average of 35%
versus ISE alone. Floating point is demonstrated to perform worse than fixed point, for all
design concerns and applications studied here, regardless of ISEs employed
Improving low latency applications for reconfigurable devices
This thesis seeks to improve low latency application performance via architectural improvements in reconfigurable devices. This is achieved by improving resource utilisation and access, and by exploiting the different environments within which reconfigurable devices are deployed.
Our first contribution leverages devices deployed at the network level to enable the low latency processing of financial market data feeds. Financial exchanges transmit messages via two identical data feeds to reduce the chance of message loss. We present an approach to arbitrate these redundant feeds at the network level using a Field-Programmable Gate Array (FPGA). With support for any messaging protocol, we evaluate our design using the NASDAQ TotalView-ITCH, OPRA, and ARCA data feed protocols, and provide two simultaneous outputs: one prioritising low latency, and one prioritising high reliability with three dynamically configurable windowing methods.
Our second contribution is a new ring-based architecture for low latency, parallel access to FPGA memory. Traditional FPGA memory is formed by grouping block memories (BRAMs) together and accessing them as a single device. Our architecture accesses these BRAMs independently and in parallel. Targeting memory-based computing, which stores pre-computed function results in memory, we benefit low latency applications that rely on: highly-complex functions; iterative computation; or many parallel accesses to a shared resource. We assess square root, power, trigonometric, and hyperbolic functions within the FPGA, and provide a tool to convert Python functions to our new architecture.
Our third contribution extends the ring-based architecture to support any FPGA processing element. We unify E heterogeneous processing elements within compute pools, with each element implementing the same function, and the pool serving D parallel function calls. Our implementation-agnostic approach supports processing elements with different latencies, implementations, and pipeline lengths, as well as non-deterministic latencies. Compute pools evenly balance access to processing elements across the entire application, and are evaluated by implementing eight different neural network activation functions within an FPGA.Open Acces
Doctor of Philosophy
dissertationThe embedded system space is characterized by a rapid evolution in the complexity and functionality of applications. In addition, the short time-to-market nature of the business motivates the use of programmable devices capable of meeting the conflicting constraints of low-energy, high-performance, and short design times. The keys to achieving these conflicting constraints are specialization and maximally extracting available application parallelism. General purpose processors are flexible but are either too power hungry or lack the necessary performance. Application-specific integrated circuits (ASICS) efficiently meet the performance and power needs but are inflexible. Programmable domain-specific architectures (DSAs) are an attractive middle ground, but their design requires significant time, resources, and expertise in a variety of specialties, which range from application algorithms to architecture and ultimately, circuit design. This dissertation presents CoGenE, a design framework that automates the design of energy-performance-optimal DSAs for embedded systems. For a given application domain and a user-chosen initial architectural specification, CoGenE consists of a a Compiler to generate execution binary, a simulator Generator to collect performance/energy statistics, and an Explorer that modifies the current architecture to improve energy-performance-area characteristics. The above process repeats automatically until the user-specified constraints are achieved. This removes or alleviates the time needed to understand the application, manually design the DSA, and generate object code for the DSA. Thus, CoGenE is a new design methodology that represents a significant improvement in performance, energy dissipation, design time, and resources. This dissertation employs the face recognition domain to showcase a flexible architectural design methodology that creates "ASIC-like" DSAs. The DSAs are instruction set architecture (ISA)-independent and achieve good energy-performance characteristics by coscheduling the often conflicting constraints of data access, data movement, and computation through a flexible interconnect. This represents a significant increase in programming complexity and code generation time. To address this problem, the CoGenE compiler employs integer linear programming (ILP)-based 'interconnect-aware' scheduling techniques for automatic code generation. The CoGenE explorer employs an iterative technique to search the complete design space and select a set of energy-performance-optimal candidates. When compared to manual designs, results demonstrate that CoGenE produces superior designs for three application domains: face recognition, speech recognition and wireless telephony. While CoGenE is well suited to applications that exhibit a streaming behavior, multithreaded applications like ray tracing present a different but important challenge. To demonstrate its generality, CoGenE is evaluated in designing a novel multicore N-wide SIMD architecture, known as StreamRay, for the ray tracing domain. CoGenE is used to synthesize the SIMD execution cores, the compiler that generates the application binary, and the interconnection subsystem. Further, separating address and data computations in space reduces data movement and contention for resources, thereby significantly improving performance compared to existing ray tracing approaches
Recommended from our members
Complexity-reduced hardware-based track-trigger for CMS upgrade
This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonThe Compact Muon Solenoid (CMS) detector at the Large Hadron Collider (LHC)
is designed to study the results of proton-proton collisions. The Tracker
sub-detector is designed to detect and reconstruct the trajectories of charged
particles produced by the collisions. During the lifetime of the CMS detector,
there have been several upgrades aimed at increasing the chance of discovering
new physics through increased luminosity levels and instrumentation of
advanced technology. The High-Luminosity upgrade optimises the LHC to
accelerate high-energy particles with an average of 200 proton-proton
interactions per bunch crossing. The Level-1 Trigger system promptly analyses
and filters collisions using hardware to reduce the data volume in real-time. For
the upgrade, the trigger mechanism will use a particle trajectory estimator that
discriminates between particles based on their transverse momentum (pT ).
Particles with pT ≥ 2 GeV/c will be transmitted to the Level-1 Track-Trigger
system for trajectory reconstruction within a fixed 3 μs latency. This thesis
presents a novel Hardware-based Multivariate Linear Fitter (MVLF) system
focusing on robustness in tracking efficiency and reduction in logic resource
usage within the specified latency. The system components are implemented in
Field Programmable Gate Arrays (FPGA), targeting 16 nm FinFET UltraScale+
silicon technology. The development was performed using the High-Level
Synthesis (HLS) automation tools and the Hardware acceleration platform for
Application-Specific Integrated Circuits (ASIC). A firmware demonstrator has
been assembled to verify the feasibility and compatibility of the scaled system
with the CMS Level-1 Track-Trigger infrastructure. The system’s performance is
compared to past and current system developments, and the results are
presented accordingly
Improving Energy Efficiency of Application-Specific Instruction-Set Processors
Present-day consumer mobile devices seem to challenge the concept of embedded computing by bringing the equivalent of supercomputing power from two decades ago into hand-held devices. This challenge, however, is well met by pushing the boundaries of embedded computing further into areas previously monopolised by Application-Specific Integrated Circuits (ASICs). Furthermore, in areas traditionally associated with embedded computing, an increase in the complexity of algorithms and applications requires a continuous rise in availability of computing power and energy efficiency in order to fit within the same, or smaller, power budget. It is, ultimately, the amount of energy the application execution consumes that dictates the usefulness of a programmable embedded system, in comparison with implementation of an ASIC.This Thesis aimed to explore the energy efficiency overheads of Application-Specific InstructionSet Processors (ASIPs), a class of embedded processors aiming to compete with ASICs. While an ASIC can be designed to provide precise performance and energy efficiency required by a specific application without unnecessary overheads, the cost of design and verification, as well as the inability to upgrade or modify, favour more flexible programmable solutions. The ASIP designs can match the computing performance of the ASIC for specific applications. What is left, therefore, is achieving energy efficiency of a similar order of magnitude.In the past, one area of ASIP design that has been identified as a major consumer of energy is storage of temporal values produced during computation – the Register File (RF), with the associated interconnection network to transport those values between registers and computational Function Units (FUs). In this Thesis, the energy efficiency of RF and interconnection network is studied using the Transport Triggered Architectures (TTAs) template. Specifically, compiler optimisations aiming at reducing the traffic of temporal values between RF and FUs are presented in this Thesis. Bypassing of the temporal value, from the output of the FU which produces it directly in the input ports of the FUs that require it to continue with the computation, saves multiple RF reads. In addition, if all the uses of such a temporal value can be bypassed, the RF write can be eliminated as well. Such optimisations result in a simplification of the RF, via a reduction in the actual number of registers present or a reduction in the number of read and write ports in the RF and improved energy efficiency. In cases where the limited number of the simultaneous RF reads or writes cause a performance bottleneck, such optimisations result in performance improvements leading to faster execution times, therefore, allowing for execution at lower clock frequencies resulting in additional energy savings.Another area of the ASIP design consuming a significant amount of energy is the instruction memory subsystem, which is the artefact required for the programmability of the embedded processor. As this subsystem is not present in ASIC, the energy consumed for storing an application program and reading it from the instruction memories to control processor execution is an overhead that needs to be minimised. In this Thesis, one particular tool to improve the energy efficiency of the instruction memory subsystem – instruction buffer – is examined. While not trivially obvious, the presence of buffers for storing loop bodies, or parts of them, results in a reduced number of reads from the instruction memories. As a result, memories can be put to lower power state leading to lower overall energy consumption, pending energy-efficient buffer implementation. Specifically, an energy-efficient implementation of the instruction buffer is presented in this Thesis, together with analysis tools to identify candidate loops and assess their suitability for storing in the instruction buffer.The studies presented in this Thesis show that the energy overheads associated with the use of embedded processors, in comparison to ad-hoc ASIC solutions, are manageable when carefully considered during the design of an embedded system for a particular application, or application domain. Finally, the methods presented in this Thesis do not restrict the reprogrammability of the embedded system
- …