16 research outputs found

    Fast evaluation methodology for automatic custom hardware prototyping

    Get PDF
    Hardware customization for scientific applications has shown a big potential for reducing power consumption and increasing performance. In particular, the automatic generation of ISA extensions for General-Purpose Processors (GPPs) to accelerate domain-specific applications is an active field of research to accelerate. Those domain-specific accelerated processors are mostly evaluated in simulation environments due to technical and programmability issues while using real hardware. There is no automatic mechanism to test those custom units in a real hardware environment. In this paper we present a toolchain that can automatically identify candidate parts of the code suitable for reconfigurable hardware acceleration. We validate our toolchain using ClustalW.Postprint (published version

    Preliminary work on a mechanism for testing a customized architecture

    Get PDF
    Hardware customization for scientific applications has shown a big potential for reducing power consumption and increasing performance. In particular, the automatic generation of ISA extensions for General-Purpose Processors (GPPs) to accelerate domain-specific applications is an active field of research. Those domain-specific customized processors are mostly evaluated in simulation environments due to technical and programmability issues while using real hardware. There is no automatic mechanism to test ISA extensions in a real hardware environment. In this paper we present a toolchain that can automatically identify candidate parts of the code suitable for acceleration to test them in a reconfigurable hardware. We validate our toolchain using a bioinformatic application, ClustalW, obtaining an overall speed-up over 2x.Postprint (published version

    The hArtes Tool Chain

    Get PDF
    This chapter describes the different design steps needed to go from legacy code to a transformed application that can be efficiently mapped on the hArtes platform

    Rapid Prototyping and Exploration Environment for Generating C-to-Hardware-Compilers

    Get PDF
    There is today an ever-increasing demand for more computational power coupled with a desire to minimize energy requirements. Hardware accelerators currently appear to be the best solution to this problem. While general purpose computation with GPUs seem to be very successful in this area, they perform adequately only in those cases where the data access patterns and utilized algorithms fit the underlying architecture. ASICs on the other hand can yield even better results in terms of performance and energy consumption, but are very inflexible, as they are manufactured with an application specific circuitry. Field Programmable Gate Arrays (FPGAs) represent a combination of approaches: With their application specific hardware they provide high computational power while requiring, for many applications, less energy than a CPU or a GPU. On the other hand they are far more flexible than an ASIC due to their reconfigurability. The only remaining problem is the programming of the FPGAs, as they are far more difficult to program compared to regular software. To allow common software developers, who have at best very limited knowledge in hardware design, to make use of these devices, tools were developed that take a regular high level language and generate hardware from it. Among such tools, C-to-HDL compilers are a particularly wide-spread approach. These compilers attempt to translate common C code into a hardware description language from which a datapath is generated. Most of these compilers have many restrictions for the input and differ in their underlying generated micro architecture, their scheduling method, their applied optimizations, their execution model and even their target hardware. Thus, a comparison of a certain aspect alone, like their implemented scheduling method or their generated micro architecture, is almost impossible, as they differ in so many other aspects. This work provides a survey of the existing C-to-HDL compilers and presents a new approach to evaluating and exploring different micro architectures for dynamic scheduling used by such compilers. From a mathematically formulated rule set the Triad compiler generates a backend for the Scale compiler framework, which then implements a hardware generation backend with described dynamic scheduling. While more than a factor of four slower than hardware from highly optimized compilers, this environment allows easy comparison and exploration of different rule sets and the micro architecture for the dynamically scheduled datapaths generated from them. For demonstration purposes a rule set modeling the COCOMA token flow model from the COMRADE 2.0 compiler was implemented. Multiple variants of it were explored: Savings of up to 11% of the required hardware resources were possible

    Dynamisches Scheduling in der Hochsprachen-Compilierung für adaptive Rechner

    Get PDF
    The single-thread performance of conventional CPUs has not improved significantly due to the stagnation of the CPU frequencies since 2003. Adaptive computers, which combine a CPU with a reconfigurable hardware unit used as hardware accelerator, represent a promising, alternative compute platform. During the past 10 years, much research has been done to develop tools that enhance the usability of adaptive computers. An important goal here is the development of an adaptive compiler, which compiles hardware descriptions from common high-level languages such as C in a fully automated way. Most of the compilers developed until today use static scheduling for the generated hardware. However, for complex programs containing nested loops, irregular control flow, and arbitrary pointers, dynamic scheduling is more appropriate. This work examines the feasibility of compiling to dynamically scheduled hardware, an approach that has been the subject of only limited research efforts so far. Based on previous work we have developed the adaptive compiler COMRADE 2.0, which generates synthesizable hardware descriptions (using dynamic scheduling) from ANSI C. For this, the compiler utilizes our COMRADE Controller Micro-Architecture (COCOMA), an intermediate representation which models even complex control and memory dependences and is thus especially suitable in a compile flow that supports complex C programs. We examine the effects of parameter variations and low-level optimizations on the simulation and synthesis results. The most promising optimization technique considering the runtime is memory localization which can significantly increase the memory bandwidth available to the compiled hardware kernels. Using memory localization we have obtained hardware kernel speed-ups of up to 37x over an embedded superscalar CPU.Bedingt durch die Stagnation der CPU-Frequenzen stagniert seit 2003 auch die Single-Thread-Rechenleistung herkömmlicher CPUs. Adaptive Rechner bieten eine vielversprechende, alternative Rechenarchitektur, indem sie die CPU um eine rekonfigurierbare Einheit, die als Hardware-Beschleuniger verwendet wird, erweitern. In den vergangenen zehn Jahren wurde viel Forschung in Entwurfswerkzeuge investiert, die eine praktikablere Verwendung adaptiver Rechner ermöglichen sollen. Ein wesentliches Ziel ist dabei die Entwicklung eines adaptiven Compilers, der aus einer allgemein verwendeten Hochsprache wie C vollautomatisch Hardwarebeschreibungen erzeugen kann. Die meisten der bisher entwickelten Compiler setzen in der erzeugten Hardware statisches Scheduling ein. Für komplexere Programme mit verschachtelten Schleifen, irregulärem Kontrollfluss und beliebigen Zeigerzugriffen ist jedoch dynamisches Scheduling besser geeignet. Diese Arbeit untersucht die praktische Machbarkeit des dynamischen Schedulings, zu dem es bislang kaum Untersuchungen im Kontext adaptiver Rechner gibt. Auf der Grundlage bestehender Vorarbeiten haben wir den Compiler COMRADE 2.0 entwickelt, der aus ANSI C vollautomatisch synthetisierbare Hardwarebeschreibungen generieren kann, die dynamisches Scheduling verwenden. Eine zentrale Rolle spielt dabei die von uns entwickelte Zwischendarstellung COMRADE Controller Micro-Architecture (COCOMA), die auch komplexere Kontroll- und Speicherabhängigkeiten korrekt modelliert und daher besonders für die Compilierung komplexerer C-Programme geeignet ist. Wir untersuchen die Auswirkungen von Parameteränderungen und Low-Level-Optimierungen auf die Simulations- und Syntheseergebnisse. Als sehr vielversprechend hat sich die Speicherlokalisierung herausgestellt, die eine wesentlich höhere Speicherbandbreite bei geringer Latenz bietet. Wir messen hier Beschleunigungen eines Hardware-Kernels gegenüber einer eingebetteten, superskalaren CPU von bis zu 37-fach

    Exploration of communication strategies for computation intensive Systems-On-Chip

    Get PDF

    Methoden und Werkzeuge zum Einsatz von rekonfigurierbaren Akzeleratoren in Mehrkernsystemen

    Get PDF
    Rechensysteme mit Mehrkernprozessoren werden häufig um einen rekonfigurierbaren Akzelerator wie einen FPGA erweitert. Die Verlagerung von Anwendungsteilen in Hardware wird meist von Spezialisten vorgenommen. Damit Anwender selbst rekonfigurierbare Hardware programmieren können, ist mein Beitrag die komponentenbasierte Programmierung und Verwendung mit automatischer Beachtung der Datenlokalität. So lässt sich auch bei datenintensiven Anwendungen Nutzen aus den Akzeleratoren erzielen

    REAL-TIME ADAPTIVE PULSE COMPRESSION ON RECONFIGURABLE, SYSTEM-ON-CHIP (SOC) PLATFORMS

    Get PDF
    New radar applications need to perform complex algorithms and process a large quantity of data to generate useful information for the users. This situation has motivated the search for better processing solutions that include low-power high-performance processors, efficient algorithms, and high-speed interfaces. In this work, hardware implementation of adaptive pulse compression algorithms for real-time transceiver optimization is presented, and is based on a System-on-Chip architecture for reconfigurable hardware devices. This study also evaluates the performance of dedicated coprocessors as hardware accelerator units to speed up and improve the computation of computing-intensive tasks such matrix multiplication and matrix inversion, which are essential units to solve the covariance matrix. The tradeoffs between latency and hardware utilization are also presented. Moreover, the system architecture takes advantage of the embedded processor, which is interconnected with the logic resources through high-performance buses, to perform floating-point operations, control the processing blocks, and communicate with an external PC through a customized software interface. The overall system functionality is demonstrated and tested for real-time operations using a Ku-band testbed together with a low-cost channel emulator for different types of waveforms
    corecore