1,642 research outputs found
An Intermediate Language and Estimator for Automated Design Space Exploration on FPGAs
We present the TyTra-IR, a new intermediate language intended as a
compilation target for high-level language compilers and a front-end for HDL
code generators. We develop the requirements of this new language based on the
design-space of FPGAs that it should be able to express and the
estimation-space in which each configuration from the design-space should be
mappable in an automated design flow. We use a simple kernel to illustrate
multiple configurations using the semantics of TyTra-IR. The key novelty of
this work is the cost model for resource-costs and throughput for different
configurations of interest for a particular kernel. Through the realistic
example of a Successive Over-Relaxation kernel implemented both in TyTra-IR and
HDL, we demonstrate both the expressiveness of the IR and the accuracy of our
cost model.Comment: Pre-print and extended version of poster paper accepted at
international symposium on Highly Efficient Accelerators and Reconfigurable
Technologies (HEART2015) Boston, MA, USA, June 1-2, 201
GPU peer-to-peer techniques applied to a cluster interconnect
Modern GPUs support special protocols to exchange data directly across the
PCI Express bus. While these protocols could be used to reduce GPU data
transmission times, basically by avoiding staging to host memory, they require
specific hardware features which are not available on current generation
network adapters. In this paper we describe the architectural modifications
required to implement peer-to-peer access to NVIDIA Fermi- and Kepler-class
GPUs on an FPGA-based cluster interconnect. Besides, the current software
implementation, which integrates this feature by minimally extending the RDMA
programming model, is discussed, as well as some issues raised while employing
it in a higher level API like MPI. Finally, the current limits of the technique
are studied by analyzing the performance improvements on low-level benchmarks
and on two GPU-accelerated applications, showing when and how they seem to
benefit from the GPU peer-to-peer method.Comment: paper accepted to CASS 201
Precision analysis for hardware acceleration of numerical algorithms
The precision used in an algorithm affects the error and performance of individual computations, the
memory usage, and the potential parallelism for a fixed hardware budget. However, when migrating
an algorithm onto hardware, the potential improvements that can be obtained by tuning the precision
throughout an algorithm to meet a range or error specification are often overlooked; the major reason
is that it is hard to choose a number system which can guarantee any such specification can be met.
Instead, the problem is mitigated by opting to use IEEE standard double precision arithmetic so as to be
‘no worse’ than a software implementation. However, the flexibility in the number representation is one
of the key factors that can be exploited on reconfigurable hardware such as FPGAs, and hence ignoring
this potential significantly limits the performance achievable.
In order to optimise the performance of hardware reliably, we require a method that can tractably
calculate tight bounds for the error or range of any variable within an algorithm, but currently only a
handful of methods to calculate such bounds exist, and these either sacrifice tightness or tractability,
whilst simulation-based methods cannot guarantee the given error estimate. This thesis presents a new
method to calculate these bounds, taking into account both input ranges and finite precision effects,
which we show to be, in general, tighter in comparison to existing methods; this in turn can be used to
tune the hardware to the algorithm specifications.
We demonstrate the use of this software to optimise hardware for various algorithms to accelerate
the solution of a system of linear equations, which forms the basis of many problems in engineering
and science, and show that significant performance gains can be obtained by using this new approach in
conjunction with more traditional hardware optimisations
Real time unsupervised learning of visual stimuli in neuromorphic VLSI systems
Neuromorphic chips embody computational principles operating in the nervous
system, into microelectronic devices. In this domain it is important to
identify computational primitives that theory and experiments suggest as
generic and reusable cognitive elements. One such element is provided by
attractor dynamics in recurrent networks. Point attractors are equilibrium
states of the dynamics (up to fluctuations), determined by the synaptic
structure of the network; a `basin' of attraction comprises all initial states
leading to a given attractor upon relaxation, hence making attractor dynamics
suitable to implement robust associative memory. The initial network state is
dictated by the stimulus, and relaxation to the attractor state implements the
retrieval of the corresponding memorized prototypical pattern. In a previous
work we demonstrated that a neuromorphic recurrent network of spiking neurons
and suitably chosen, fixed synapses supports attractor dynamics. Here we focus
on learning: activating on-chip synaptic plasticity and using a theory-driven
strategy for choosing network parameters, we show that autonomous learning,
following repeated presentation of simple visual stimuli, shapes a synaptic
connectivity supporting stimulus-selective attractors. Associative memory
develops on chip as the result of the coupled stimulus-driven neural activity
and ensuing synaptic dynamics, with no artificial separation between learning
and retrieval phases.Comment: submitted to Scientific Repor
- …