2 research outputs found
Unifying software and hardware of multithreaded reconfigurable applications within operating system processes
Novel reconfigurable System-on-Chip (SoC) devices offer combining software with application-specific hardware accelerators to speed up applications. However, by mixing user software and user hardware, principal programming abstractions and system-software commodities are usually lost, since hardware accelerators (1) do not have execution context —it is typically the programmer who is supposed to provide it, for each accelerator, (2) do not have virtual memory abstraction —it is again programmer who shall communicate data from user software space to user hardware, even if it is usually burdensome (or sometimes impossible!), (3) cannot invoke system services (e.g., to allocate memory, open files, communicate), and (4) are not easily portable —they depend mostly on system-level interfacing, although they logically belong to the application level. We introduce a unified Operating System (OS) process for codesigned reconfigurable applications that provides (1) unified memory abstraction for software and hardware application parts, (2) execution transfers from software to hardware and vice versa, thus enabling hardware accelerators to use systems services and callback other software and hardware functions, and (3) multithreaded execution of multiple software and hardware threads. The unified OS process ensures portability of codesigned applications, by providing standardised means of interfacing. Having just-another abstraction layer usually affects performance: we show that the runtime optimisations in the system layer supporting the unified OS process can minimise the performance loss and even outperform typical approaches. The unified OS process also fosters unrestricted automated synthesis of software to hardware, thus allowing unlimited migration of application components. We demonstrate the advantages of the unified OS process in practice, for Linux systems running on Xilinx Virtex-II Pro and Altera Excalibur reconfigurable devices
Datapath and memory co-optimization for FPGA-based computation
With the large resource densities available on modern FPGAs it is often the available
memory bandwidth that limits the parallelism (and therefore performance) that can be
achieved. For this reason the focus of this thesis is the development of an integrated
scheduling and memory optimisation methodology to allow high levels of parallelism to be
exploited in FPGA based designs.
A manual translation from C to hardware is first investigated as a case study,
exposing a number of potential optimisation techniques that have not been exploited in
existing work. An existing outer loop pipelining approach, originally developed for VLIW
processors, is extended and adapted for application to FPGAs. The outer loop pipelining
methodology is first developed to use a fixed memory subsystem design and then extended
to automate the optimisation of the memory subsystem. This approach allocates arrays
to physical memories and selects the set of data reuse structures to implement to match
the available and required memory bandwidths as the pipelining search progresses. The
final extension to this work is to include the partitioning of data from a single array across
multiple physical memories, increasing the number of memory ports through which data
my be accessed. The facility for loop unrolling is also added to increase the potential for
parallelism and exploit the additional bandwidth that partitioning can provide.
We describe our approach based on formal methodologies and present the results
achieved when these methods are applied to a number of benchmarks. These results show
the advantages of both extending pipelining to levels above the innermost loop and the
co-optimisation of the datapath and memory subsystem