53 research outputs found
Automated application-specific optimisation of interconnects in multi-core systems
In embedded computer systems there are often tasks, implemented as stand-alone devices,
that are both application-specific and compute intensive. A recurring problem
in this area is to design these application-specific embedded systems as close to the
power and efficiency envelope as possible. Work has been done on optimizing singlecore
systems and memory organisation, but current methods for achieving system design
goals are proving limited as the system capabilities and system size increase in the
multi- and many-core era. To address this problem, this thesis investigates machine
learning approaches to managing the design space presented in the interconnect design
of embedded multi-core systems. The design space presented is large due to the
system scale and level of interconnectivity, and also feature inter-dependant parameters,
further complicating analysis. The results presented in this thesis demonstrate
that machine learning approaches, particularly wkNN and random forest, work well
in handling the complexity of the design space. The benefits of this approach are in
automation, saving time and effort in the system design phase as well as energy and
execution time in the finished system
A multi-level functional IR with rewrites for higher-level synthesis of accelerators
Specialised accelerators deliver orders of magnitude higher energy-efficiency than
general-purpose processors. Field Programmable Gate Arrays (FPGAs) have become
the substrate of choice, because the ever-changing nature of modern workloads, such
as machine learning, demands reconfigurability. However, they are notoriously hard
to program directly using Hardware Description Languages (HDLs). Traditional High-Level Synthesis (HLS) tools improve productivity, but come with their own problems.
They often produce sub-optimal designs and programmers are still required to write
hardware-specific code, thus development cycles remain long.
This thesis proposes Shir, a higher-level synthesis approach for high-performance
accelerator design with a hardware-agnostic programming entry point, a multi-level
Intermediate Representation (IR), a compiler and rewrite rules for optimisation.
First, a novel, multi-level functional IR structure for accelerator design is described.
The IRs operate on different levels of abstraction, cleanly separating different hardware
concerns. They enable the expression of different forms of parallelism and standard
memory features, such as asynchronous off-chip memories or synchronous on-chip
buffers, as well as arbitration of such shared resources. Exposing these features at the
IR level is essential for achieving high performance.
Next, mechanical lowering procedures are introduced to automatically compile
a program specification through Shir’s functional IRs until low-level HDL code for
FPGA synthesis is emitted. Each lowering step gradually adds implementation details.
Finally, this thesis presents rewrite rules for automatic optimisations around parallelisation, buffering and data reshaping. Reshaping operations pose a challenge to
functional approaches in particular. They introduce overheads that compromise performance or even prevent the generation of synthesisable hardware designs altogether.
This fundamental issue is solved by the application of rewrite rules.
The viability of this approach is demonstrated by running matrix multiplication
and 2D convolution on an Intel Arria 10 FPGA. A limited design space exploration is
conducted, confirming the ability of the IR to exploit various hardware features. Using
rewrite rules for optimisation, it is possible to generate high-performance designs
that are competitive with highly tuned OpenCL implementations and that outperform
hardware-agnostic OpenCL code. The performance impact of the optimisations is
further evaluated showing that they are essential to achieving high performance, and
in many cases also necessary to produce hardware that fits the resource constraints
Correct synthesis and integration of compiler-generated function units
PhD ThesisComputer architectures can use custom logic in addition to general pur-
pose processors to improve performance for a variety of applications. The
use of custom logic allows greater parallelism for some algorithms. While
conventional CPUs typically operate on words, ne-grained custom logic
can improve e ciency for many bit level operations. The commodi ca-
tion of eld programmable devices, particularly FPGAs, has improved
the viability of using custom logic in an architecture.
This thesis introduces an approach to reasoning about the correctness of
compilers that generate custom logic that can be synthesized to provide
hardware acceleration for a given application. Compiler intermediate
representations (IRs) and transformations that are relevant to genera-
tion of custom logic are presented. Architectures may vary in the way
that custom logic is incorporated, and suitable abstractions are used in
order that the results apply to compilation for a variety of the design
parameters that are introduced by the use of custom logic
Interpreted graph models
A model class called an Interpreted Graph Model (IGM) is defined. This class includes a large number of graph-based models that are used in asynchronous circuit design and other applications of concurrecy. The defining characteristic of this model class is an underlying static graph-like structure where behavioural semantics are attached using additional entities, such as tokens or node/arc states. The similarities in notation and expressive power allow a number of operations on these formalisms, such as visualisation, interactive simulation, serialisation, schematic entry and model conversion to be generalised. A software framework called Workcraft was developed to take advantage of these properties of IGMs. Workcraft provides an environment for rapid prototyping of graph-like models and related tools. It provides a large set of standardised functions that considerably facilitate the task of providing tool support for any IGM. The concept of Interpreted Graph Models is the result of research on methods of application of lower level models, such as Petri nets, as a back-end for simulation and verification of higher level models that are more easily manipulated. The goal is to achieve a high degree of automation of this process. In particular, a method for verification of speed-independence of asynchronous circuits is presented. Using this method, the circuit is specified as a gate netlist and its environment is specified as a Signal Transition Graph. The circuit is then automatically translated into a behaviourally equivalent Petri net model. This model is then composed with the specification of the environment. A number of important properties can be established on this compound model, such as the absence of deadlocks and hazards. If a trace is found that violates the required property, it is automatically interpreted in terms of switching of the gates in the original gate-level circuit specification and may be presented visually to the circuit designer. A similar technique is also used for the verification of a model called Static Data Flow Structure (SDFS). This high level model describes the behaviour of an asynchronous data path. SDFS is particularly interesting because it models complex behaviours such as preemption, early evaluation and speculation. Preemption is a technique which allows to destroy data objects in a computation pipeline if the result of computation is no longer needed, reducing the power consumption. Early evaluation allows a circuit to compute the output using a subset of its inputs and preempting the inputs which are not needed. In speculation, all conflicting branches of computation run concurrently without waiting for the selecting condition; once the selecting condition is computed the unneeded branches are preempted. The automated Petri net based verification technique is especially useful in this case because of the complex nature of these features. As a result of this work, a number of cases are presented where the concept of IGMs and the Workcraft tool were instrumental. These include the design of two different types of arbiter circuits, the design and debugging of the SDFS model, synthesis of asynchronous circuits from the Conditional Partial Order Graph model and the modification of the workflow of Balsa asynchronous circuit synthesis system.EThOS - Electronic Theses Online ServiceEPSRCGBUnited Kingdo
Dynamically reconfigurable asynchronous processor
The main design requirements for today's mobile applications are:
· high throughput performance.
· high energy efficiency.
· high programmability.
Until now, the choice of platform has often been limited to Application-Specific
Integrated Circuits (ASICs), due to their best-of-breed performance and power
consumption. The economies of scale possible with these high-volume markets have
traditionally been able to hide the high Non-Recurring Engineering (NRE) costs
required for designing and fabricating new ASICs. However, with the NREs and
design time escalating with each generation of mobile applications, this practice may
be reaching its limit.
Designers today are looking at programmable solutions, so that they can respond
more rapidly to changes in the market and spread costs over several generations of
mobile applications. However, there have been few feasible alternatives to ASICs:
Digital Signals Processors (DSPs) and microprocessors cannot meet the throughput
requirements, whereas Field-Programmable Gate Arrays (FPGAs) require too much
area and power.
Coarse-grained dynamically reconfigurable architectures offer better solutions for
high throughput applications, when power and area considerations are taken into
account. One promising example is the Reconfigurable Instruction Cell Array
(RICA). RICA consists of an array of cells with an interconnect that can be
dynamically reconfigured on every cycle. This allows quite complex datapaths to be
rendered onto the fabric and executed in a single configuration - making these
architectures particularly suitable to stream processing. Furthermore, RICA can be
programmed from C, making it a good fit with existing design methodologies.
However the RICA architecture has a drawback: poor scalability in terms of area and
power. As the core gets bigger, the number of sequential elements in the array must
be increased significantly to maintain the ability to achieve high throughputs through
pipelining. As a result, a larger clock tree is required to synchronise the increased
number of sequential elements. The clock tree therefore takes up a larger percentage
of the area and power consumption of the core.
This thesis presents a novel Dynamically Reconfigurable Asynchronous Processor
(DRAP), aimed at high-throughput mobile applications. DRAP is based on the RICA
architecture, but uses asynchronous design techniques - methods of designing digital
systems without clocks. The absence of a global clock signal makes DRAP more
scalable in terms of power and area overhead than its synchronous counterpart.
The DRAP architecture maintains most of the benefits of custom asynchronous
design, whilst also providing programmability via conventional high-level languages.
Results show that the DRAP processor delivers considerably lower power
consumption when compared to a market-leading Very Long Instruction Word
(VLIW) processor and a low-power ARM processor. For example, DRAP resulted in
a reduction in power consumption of 20 times compared to the ARM7 processor, and
29 times compared to the TIC64x VLIW, when running the same benchmark capped
to the same throughput and for the same process technology (0.13μm). When
compared to an equivalent RICA design, DRAP was up to 22% larger than RICA but
resulted in a power reduction of up to 1.9 times. It was also capable of achieving up
to 2.8 times higher throughputs than RICA for the same benchmarks
6502 emulator on FPGA
6502 microprocessor was once used in almost all of the microcomputer in the 80s,
including the Apple II lines of computer, the Commodore PET, the Commodore 64,
the Atari 8-bit series and even on the Nintendo Entertainment System (NES) video
game console.
The objective of this project is to emulate the once famous 6502 microprocessor onto a
FPGA chip. The FPGA-based 6502 microprocessor had to emulate the functionality of
a real 6502 microprocessor. Accurate pinouts emulation is desired but not a must.
The 6502 assembly language is easy to learn and building a computer based on this
microprocessor requires very few parts, thus making this project a great experiential
learning process.
The scope of this project requires the student to have an in-depth understanding on
computer system architecture, especially on 6502 architecture; V erilog to understand
existing 6502 source code from Bird Computer and also FPGA development process
(synthesis tools) to transfer the Verilog code to the FPGA chip.
Thus far, the resources and information on 6502 microprocessor looks promising. The
student earlier scope was to come up with the 6502 code in Verilog HDL, but as there
is available code from Bird Computer (State Machine coded) so the student had
chanced his objectives to understand the existing code and implement it on FPGA
only. But as along the way, problems occur on hardware implementation, focus had
been switched again to simulate the existing code or ALU or simple processor to build
up student understanding and for documentation for future project expansion. To test
the functionality of the 6502 system, the student will either find existing application or
come up with simple program to run using the FPGA-based 6502 system
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
In the past decade, Convolutional Neural Networks (CNNs) have demonstrated
state-of-the-art performance in various Artificial Intelligence tasks. To
accelerate the experimentation and development of CNNs, several software
frameworks have been released, primarily targeting power-hungry CPUs and GPUs.
In this context, reconfigurable hardware in the form of FPGAs constitutes a
potential alternative platform that can be integrated in the existing deep
learning ecosystem to provide a tunable balance between performance, power
consumption and programmability. In this paper, a survey of the existing
CNN-to-FPGA toolflows is presented, comprising a comparative study of their key
characteristics which include the supported applications, architectural
choices, design space exploration methods and achieved performance. Moreover,
major challenges and objectives introduced by the latest trends in CNN
algorithmic research are identified and presented. Finally, a uniform
evaluation methodology is proposed, aiming at the comprehensive, complete and
in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal,
201
- …