513 research outputs found
Cooperative high-performance computing with FPGAs - matrix multiply case-study
In high-performance computing, there is great opportunity for systems
that use FPGAs to handle communication while also performing
computation on data in transit in an ``altruistic'' manner--that is,
using resources for computation that might otherwise be used for
communication, and in a way that improves overall system performance
and efficiency. We provide a specific definition of \textbf{Computing
in the Network} that captures this opportunity. We then outline some
overall requirements and guidelines for cooperative computing that
include this ability, and make suggestions for specific computing
capabilities to be added to the networking hardware in a system. We
then explore some algorithms running on a network so equipped
for a few specific computing tasks: dense matrix multiplication,
sparse matrix transposition and sparse matrix multiplication. In the
first instance we give limits of problem size and estimates of
performance that should be attainable with present-day FPGA hardware
Choose-Your-Own Adventure: A Lightweight, High-Performance Approach To Defect And Variation Mitigation In Reconfigurable Logic
For field-programmable gate arrays (FPGAs), fine-grained pre-computed alternative configurations, combined with simple test-based selection, produce limited per-chip specialization to counter yield loss, increased delay, and increased energy costs that come from fabrication defects and variation. This lightweight approach achieves much of the benefit of knowledge-based full specialization while reducing to practical, palatable levels the computational, testing, and load-time costs that obstruct the application of the knowledge-based approach. In practice this may more than double the power-limited computational capabilities of dies fabricated with 22nm technologies.
Contributions of this work:
• Choose-Your-own-Adventure (CYA), a novel, lightweight, scalable methodology to achieve defect and variation mitigation
• Implementation of CYA, including preparatory components (generation of diverse alternative paths) and FPGA load-time components
• Detailed performance characterization of CYA
– Comparison to conventional loading and dynamic frequency and voltage scaling (DFVS)
– Limit studies to characterize the quality of the CYA implementation and identify potential areas for further optimizatio
Neural networks-on-chip for hybrid bio-electronic systems
PhD ThesisBy modelling the brains computation we can further our understanding
of its function and develop novel treatments for neurological disorders. The
brain is incredibly powerful and energy e cient, but its computation does
not t well with the traditional computer architecture developed over the
previous 70 years. Therefore, there is growing research focus in developing
alternative computing technologies to enhance our neural modelling capability,
with the expectation that the technology in itself will also bene t from
increased awareness of neural computational paradigms.
This thesis focuses upon developing a methodology to study the design
of neural computing systems, with an emphasis on studying systems suitable
for biomedical experiments. The methodology allows for the design to be
optimized according to the application. For example, di erent case studies
highlight how to reduce energy consumption, reduce silicon area, or to
increase network throughput.
High performance processing cores are presented for both Hodgkin-Huxley
and Izhikevich neurons incorporating novel design features. Further, a complete
energy/area model for a neural-network-on-chip is derived, which is
used in two exemplar case-studies: a cortical neural circuit to benchmark
typical system performance, illustrating how a 65,000 neuron network could
be processed in real-time within a 100mW power budget; and a scalable highperformance
processing platform for a cerebellar neural prosthesis. From
these case-studies, the contribution of network granularity towards optimal
neural-network-on-chip performance is explored
SPICE²: A Spatial, Parallel Architecture for Accelerating the Spice Circuit Simulator
Spatial processing of sparse, irregular floating-point computation using a single FPGA enables up to an order of magnitude speedup (mean 2.8X speedup) over a conventional microprocessor for the SPICE circuit simulator. We deliver this speedup using a hybrid parallel architecture that spatially implements the heterogeneous forms of parallelism available in SPICE. We decompose SPICE into its three constituent phases: Model-Evaluation, Sparse Matrix-Solve, and Iteration Control and parallelize each phase independently. We exploit data-parallel device evaluations in the Model-Evaluation phase, sparse dataflow parallelism in the Sparse Matrix-Solve phase and compose the complete design in streaming fashion. We name our parallel architecture SPICE²: Spatial Processors Interconnected for Concurrent Execution for accelerating the SPICE circuit simulator. We program the parallel architecture with a high-level, domain-specific framework that identifies, exposes and exploits parallelism available in the SPICE circuit simulator. This design is optimized with an auto-tuner that can scale the design to use larger FPGA capacities without expert intervention and can even target other parallel architectures with the assistance of automated code-generation. This FPGA architecture is able to outperform conventional processors due to a combination of factors including high utilization of statically-scheduled resources, low-overhead dataflow scheduling of fine-grained tasks, and overlapped processing of the control algorithms.
We demonstrate that we can independently accelerate Model-Evaluation by a mean factor of 6.5X(1.4--23X) across a range of non-linear device models and Matrix-Solve by 2.4X(0.6--13X) across various benchmark matrices while delivering a mean combined speedup of 2.8X(0.2--11X) for the two together when comparing a Xilinx Virtex-6 LX760 (40nm) with an Intel Core i7 965 (45nm). With our high-level framework, we can also accelerate Single-Precision Model-Evaluation on NVIDIA GPUs, ATI GPUs, IBM Cell, and Sun Niagara 2 architectures.
We expect approaches based on exploiting spatial parallelism to become important as frequency scaling slows down and modern processing architectures turn to parallelism (\eg multi-core, GPUs) due to constraints of power consumption. This thesis shows how to express, exploit and optimize spatial parallelism for an important class of problems that are challenging to parallelize.</p
- …