1,441 research outputs found
Empowering parallel computing with field programmable gate arrays
After more than 30 years, reconfigurable computing has grown from a concept to a mature field of science and technology. The cornerstone of this evolution is the field programmable gate array, a building block enabling the configuration of a custom hardware architecture. The departure from static von Neumannlike architectures opens the way to eliminate the instruction overhead and to optimize the execution speed and power consumption. FPGAs now live in a growing ecosystem of development tools, enabling software programmers to map algorithms directly onto hardware. Applications abound in many directions, including data centers, IoT, AI, image processing and space exploration. The increasing success of FPGAs is largely due to an improved toolchain with solid high-level synthesis support as well as a better integration with processor and memory systems. On the other hand, long compile times and complex design exploration remain areas for improvement. In this paper we address the evolution of FPGAs towards advanced multi-functional accelerators, discuss different programming models and their HLS language implementations, as well as high-performance tuning of FPGAs integrated into a heterogeneous platform. We pinpoint fallacies and pitfalls, and identify opportunities for language enhancements and architectural refinements
Survey on Combinatorial Register Allocation and Instruction Scheduling
Register allocation (mapping variables to processor registers or memory) and
instruction scheduling (reordering instructions to increase instruction-level
parallelism) are essential tasks for generating efficient assembly code in a
compiler. In the last three decades, combinatorial optimization has emerged as
an alternative to traditional, heuristic algorithms for these two tasks.
Combinatorial optimization approaches can deliver optimal solutions according
to a model, can precisely capture trade-offs between conflicting decisions, and
are more flexible at the expense of increased compilation time.
This paper provides an exhaustive literature review and a classification of
combinatorial optimization approaches to register allocation and instruction
scheduling, with a focus on the techniques that are most applied in this
context: integer programming, constraint programming, partitioned Boolean
quadratic programming, and enumeration. Researchers in compilers and
combinatorial optimization can benefit from identifying developments, trends,
and challenges in the area; compiler practitioners may discern opportunities
and grasp the potential benefit of applying combinatorial optimization
The Potential for a GPU-Like Overlay Architecture for FPGAs
We propose a soft processor programming
model and architecture inspired by graphics processing units
(GPUs) that are well-matched to the strengths of FPGAs,
namely, highly parallel and pipelinable computation. In
particular, our soft processor architecture exploits multithreading,
vector operations, and predication to supply a
floating-point pipeline of 64 stages via hardware support
for up to 256 concurrent thread contexts. The key new
contributions of our architecture are mechanisms for managing
threads and register files that maximize data-level and
instruction-level parallelism while overcoming the challenges
of port limitations of FPGA block memories as well as
memory and pipeline latency. Through simulation of a
system that (i) is programmable via NVIDIA's high-level
Cg language, (ii) supports AMD's CTM r5xx GPU ISA, and
(iii) is realizable on an XtremeData XD1000 FPGA-based
accelerator system, we demonstrate the potential for such
a system to achieve 100% utilization of a deeply pipelined
floating-point datapath
Recommended from our members
Silicon compilation
Silicon compilation is a term used for many different purposes. In this paper we define silicon compilation as a mapping from some higher level description into layout. We define the basic issues in structural and behavioral silicon compilation and some possible solutions to those issues. Finally, we define the concept of an intelligent silicon compiler in which the compiler evaluates the quality of the generated design and attempts to improve it if it is not satisfactory
Coarse-grained reconfigurable array architectures
Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code
Generic Pipelined Processor Modeling and High Performance Cycle-Accurate Simulator Generation
Detailed modeling of processors and high performance cycle-accurate
simulators are essential for today's hardware and software design. These
problems are challenging enough by themselves and have seen many previous
research efforts. Addressing both simultaneously is even more challenging, with
many existing approaches focusing on one over another. In this paper, we
propose the Reduced Colored Petri Net (RCPN) model that has two advantages:
first, it offers a very simple and intuitive way of modeling pipelined
processors; second, it can generate high performance cycle-accurate simulators.
RCPN benefits from all the useful features of Colored Petri Nets without
suffering from their exponential growth in complexity. RCPN processor models
are very intuitive since they are a mirror image of the processor pipeline
block diagram. Furthermore, in our experiments on the generated cycle-accurate
simulators for XScale and StrongArm processor models, we achieved an order of
magnitude (~15 times) speedup over the popular SimpleScalar ARM simulator.Comment: Submitted on behalf of EDAA (http://www.edaa.com/
M2: An architectural system for computer design
The number of embedded computer systems has been growing rapidly as system costs have declined and capabilities have increased. The rationale behind design decisions for embedded systems is often informal and based on estimates of key values rather than actual measurements. Because of the small number of programs typically executed by an embedded processor, significant opportunities for optimization exist;M2 is an architectural system for computer design. It consists of language tools, architectural tools, and implementation tools. The language tools gather information about programs at compile time and at execution time. This information is used by the implementation tools to generate candidate processor implementations which are evaluated with the architectural tools. The evaluation involves comparing the size, speed, power, cost, and reliability of candidates to constraints set by the M2 user;An M2 design is based on actual program measurements and is documented so its derivation can be publicly considered. It is generated in less time and with fewer errors than manual methods;The M2 project is an extension of work being performed at Stanford University on a workbench for computer architects and of work being performed at the University of Southwestern Louisiana on plausibility-driven design
- …