Search CORE

18 research outputs found

TERPS: The Embedded Reliable Processing System

Author: Gole Amol Vishwas
Publication venue
Publication date: 22/12/2003
Field of study

Electromagnetic Interference (EMI) can have an adverse effect on commercial electronics. As feature sizes of integrated circuits become smaller, their susceptibility to EMI increases. In light of this, integrated circuits will face substantial problems in the future either from electromagnetic disturbances or intentionally generated EMI from a malicious source. The Embedded Reliable Processing System (TERPS) is a fault tolerant system architecture which can significantly reduce the threat of EMI in computer systems. TERPS employs a checkpoint and rollback recovery mechanism tied with a multi-phase commit protocol and 3D IC technology. This enables it to recover from substantial EMI without having to shutdown or reboot. In the face of such EMI, only a loss in performance dictated by the strength and duration of the interference and the frequency of checkpointing will be seen. Various conditions in which chips can fail under the influence of EMI are described. The checkpoint and rollback recovery mechanism and the resulting TERPS architecture is stipulated. A thorough evaluation of the design correctness is provided. The technique is implemented in Verilog HDL using a 16-bit, 5-stage pipelined processor to show proof of concept. The performance overhead is calculated for different checkpointing intervals and is shown to be very reasonable (5-6% for checkpointing every 128 CPU cycles)

Digital Repository at the University of Maryland

Dynamic instruction scheduling and data forwarding in asynchronous superscalar processors

Author: Mullins Robert D.
Publication venue: The University of Edinburgh
Publication date: 01/01/2001
Field of study

Edinburgh Research Archive

Floating Point Arithmetic for Transport Triggered Architectures

Author: Viitanen Timo Tapani
Publication venue
Publication date: 05/12/2012
Field of study

Laskentajärjestelmiin kohdistuu usein suorituskyky- ja virrankulutusvaatimuksia, joita ei pystytä saavuttamaan yleiskäyttöisellä prosessorilla. Toistaalta laitteistokiihdyttimien suunnittelu voi vaatia kohtuuttoman paljon työaikaa. Ongelmaa voidaan lähestyä käyttämällä sovellusta varten räätälöityä sovelluskohtaista käskykantaprosessoria (Application-Specific Instruction set Processor, ASIP), joka on kuitenkin ohjelmoitava. Prosessorin räätälöinnin täytyy olla pitkälle automatisoitua säästääkseen kustannuksia. TTA-based Codesign Environment (TCE) on siirtoliipaistuun prosessoriarkkitehtuuriin (Transport Triggered Architecture, TTA) perustuva ASIP-kehitysympäristö. TTA on arkkitehtuurina helposti räätälöitävä ja joustaa pienistä ytimistä suuritehoisiin pitkän käskysanan suorittimiin. Useat tieteellisen laskennan ja signaalinkäsittelyn sovellukset, joissa TTA:n skaalautuvuudesta ja käskytason rinnakkaisuudesta olisi erityistä hyötyä, vaativat tuen laitteistokiihdytetylle liukulukulaskennalle. Tässä diplomityössä suunniteltiin ja toteutettiin TCE-projektia varten sarja liukulukuyksiköitä. Yksiköiden suunnittelussa pyrittiin alustariippumattomuuteen sekä korkeaan suorituskykyyn Field Programmable Gate Array alustoilla (FPGA) jopa tinkimällä tuetusta liukulukustandardista. Yksiköt sisältävät työkalut puolen tarkkuuden liukulukulaskentaan. Lisäksi työssä esitetään erikoiskäskyihin perustuvat nopeat algoritmit liukulukujakolaskun ja -neliöjuuren laskentaan. Yksiköiden toiminta varmistettiin automaattisella rekisterisiirtotason (Register Transfer Level, RTL) testipenkillä. Vertailussa Altera Stratix-II-FPGA:lla yksiköt pääsivät lähelle Alteran omien liukulukuyksiköiden suorituskykyä. Uudemmalla Xilinx Virtex-6-FPGA:lla korkein mahdollinen suorituskyky vaatisi tiheämpää liukuhihnoitusta

Trepo - Institutional Repository of Tampere University

Introduction to parallel computer architectures

Author: Thornton Phyllis Johnson
Publication venue
Publication date: 01/07/1988
Field of study

SHAREOK repository

In-Line Interrupt Handling and Lock-Up Free Translation Lookaside Buffers (TLBs)

Author: Jacob Bruce
Jaleel Aamer
Publication venue
Publication date: 01/05/2006
Field of study

The effects of the general-purpose precise interrupt mechanisms in use for the past few decades have received very little attention. When modern out-of-order processors handle interrupts precisely, they typically begin by flushing the pipeline to make the CPU available to execute handler instructions. In doing so, the CPU ends up flushing many instructions that have been brought in to the reorder buffer. In particular, these instructions may have reached a very deep stage in the pipeline—representing significant work that is wasted. In addition, an overhead of several cycles and wastage of energy (per exception detected) can be expected in refetching and reexecuting the instructions flushed. This paper concentrates on improving the performance of precisely handling software managed translation look-aside buffer (TLB) interrupts, one of the most frequently occurring interrupts. The paper presents a novel method of in-lining the interrupt handler within the reorder buffer. Since the first level interrupt-handlers of TLBs are usually small, they could potentially fit in the reorder buffer along with the user-level code already there. In doing so, the instructions that would otherwise be flushed from the pipe need not be refetched and reexecuted. Additionally, it allows for instructions independent of the exceptional instruction to continue to execute in parallel with the handler code. By in-lining the TLB interrupt handler, this provides lock-up free TLBs. This paper proposes the prepend and append schemes of in-lining the interrupt handler into the available reorder buffer space. The two schemes are implemented on a performance model of the Alpha 21264 processor built by Alpha designers at the Palo Alto Design Center (PADC), California. We compare the overhead and performance impact of handling TLB interrupts by the traditional scheme, the append in-lined scheme, and the prepend in-lined scheme. For small, medium, and large memory footprints, the overhead is quantified by comparing the number and pipeline state of instructions flushed, the energy savings, and the performance improvements. We find that lock-up free TLBs reduce the overhead of refetching and reexecuting the instructions flushed by 30-95 percent, reduce the execution time by 5-25 percent, and also reduce the energy wasted by 30-90 percent

Digital Repository at the University of Maryland

VLSI design of a twin register file for reducing the effects of conditional branches in a pipelined architecture

Author: Shah Manish
Publication venue: 'Oklahoma State University Library'
Publication date: 01/12/1992
Field of study

Pipelining is a major organizational technique which has been used by computer engineers to enhance the performance of computers. Pipelining improves the performance of computer systems by exploiting the instruction level parallelism of a program. In a pipelined processor the execution of instructions is overlapped, and each instruction is executed in a different stage of the pipeline. Most pipelined architectures are based on a sequential model of program execution in which a program counter sequences through instructions one by one.A fundamental disadvantage of pipelined processing is the loss incurred due to conditional branches. When a conditional branch instruction is encountered, more than one possible paths are following the instruction. The correct path can be known only upon the completion of the conditional branch instruction. The execution of the next instruction following a conditional branch cannot be started until the conditional branch instruction is resolved, resulting in stalling of the pipeline. One approach to avoid stalling is to predict the path to be executed and continue the execution of instructions along the predicted path. But in this case an incorrect prediction results in the execution of incorrect instructions. Hence . the results of these incorrect instructions have to be purged. Also, the instructions in the various stages of the pipeline must be removed and the pipeline has to start fetching instructions from the correct path. Thus incorrect prediction involves a flushing of the pipeline. This thesis proposes a twin processor architecture for reducing the effects of conditional branches. In such an architecture, both the paths following a conditional branch are executed simultaneously on two processors. When the conditional branch is resolved, the results of the incorrect path are discarded. Such an architecture requires a special purpose twin register file. It is the purpose of this thesis to design a twin register file consisting of two register files which can be independently accessed by the two processors. Each of the register files also has the capability of being copied into the other, making the design of the twin register file a complicated issue. The special pwpose twin register file is designed using layout tools Lager and Magic. The twin register file consists of two three-port register files which are capable of executing the 'read', 'write' and 'transfer' operations. The transfer of data from one register f.tle to another is accomplished in a single phase of the cl<X!k. The functionality of a 32-word-by-16-bit twin register file is verified by simulating it on IRSIM. The timing requirements for the read, write and transfer operations are detennined by simulating the twin register file on SPICE.Electrical Engineerin

SHAREOK repository

Reducing exception management overhead with software restart markers

Author: Hampton Mark Jerome, 1977-
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2008
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.Includes bibliographical references (p. 181-196).Modern processors rely on exception handling mechanisms to detect errors and to implement various features such as virtual memory. However, these mechanisms are typically hardware-intensive because of the need to buffer partially-completed instructions to implement precise exceptions and enforce in-order instruction commit, often leading to issues with performance and energy efficiency. The situation is exacerbated in highly parallel machines with large quantities of programmer-visible state, such as VLIW or vector processors. As architects increasingly rely on parallel architectures to achieve higher performance, the problem of exception handling is becoming critical. In this thesis, I present software restart markers as the foundation of an exception handling mechanism for explicitly parallel architectures. With this model, the compiler is responsible for delimiting regions of idempotent code. If an exception occurs, the operating system will resume execution from the beginning of the region. One advantage of this approach is that instruction results can be committed to architectural state in any order within a region, eliminating the need to buffer those values. Enabling out-of-order commit can substantially reduce the exception management overhead found in precise exception implementations, and enable the use of new architectural features that might be prohibitively costly with conventional precise exception implementations. Additionally, software restart markers can be used to reduce context switch overhead in a multiprogrammed environment. This thesis demonstrates the applicability of software restart markers to vector, VLIW, and multithreaded architectures. It also contains an implementation of this exception handling approach that uses the Trimaran compiler infrastructure to target the Scale vectorthread architecture. I show that using software restart markers incurs very little performance overhead for vector-style execution on Scale.(cont.) Finally, I describe the Scale compiler flow developed as part of this work and discuss how it targets certain features facilitated by the use of software restart markersby Mark Jerome Hampton.Ph.D

DSpace@MIT

Recommended from our members

Approaches to Distributed UNIX Systems

Author: Smith Jonathan M.
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/1986
Field of study

This paper examines several approaches to developing a distributed version of the UNIX Operating system. Relevant UNIX concepts are introduced and brief overviews of a number of distributed UNIX implementations are provided. The major issues discussed are concurrency and file system architecture: 1. The multiprocessor designs examined have the common problem of implementing critical sections in the kernel; several approaches are studied. 2. The me system architectures studied reflect differing views as to how the UNIX name space should be implemented in a distributed environment, and what the names will refer to; the issue of transparency is central to both, and is examined in some detail

Columbia University Academic Commons

Models, Design Methods and Tools for Improved Partial Dynamic Reconﬁguration

Author: Rullmann Markus
Publication venue
Publication date: 02/09/2006
Field of study

Partial dynamic reconﬁguration of FPGAs has attracted high attention from both academia and industry in recent years. With this technique, the functionality of the programmable devices can be adapted at runtime to changing requirements. The approach allows designers to use FPGAs more efﬁciently: E. g. FPGA resources can be time-shared between different functions and the functions itself can be adapted to changing workloads at runtime. Thus partial dynamic reconﬁguration enables a unique combination of software-like ﬂexibility and hardware-like performance. Still there exists no common understanding on how to assess the overhead introduced by partial dynamic reconﬁguration. This dissertation presents a new cost model for both the runtime and the memory overhead that results from partial dynamic reconﬁguration. It is shown how the model can be incorporated into all stages of the design optimization for reconﬁgurable hardware. In particular digital circuits can be mapped onto FPGAs such that only small fractions of the hardware must be reconﬁgured at runtime, which saves time, memory, and energy. The design optimization is most efﬁcient if it is applied during high level synthesis. This book describes how the cost model has been integrated into a new high level synthesis tool. The tool allows the designer to trade-off FPGA resource use versus reconﬁguration overhead. It is shown that partial reconﬁguration causes only small overhead if the design is optimized with regard to reconﬁguration cost. A wide range of experimental results is provided that demonstrates the beneﬁts of the applied method.:1 Introduction 1 1.1 Reconfigurable Computing . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.1 Reconfigurable System on a Chip (RSOC) . . . . . . . . . . . . 4 1.1.2 Anatomy of an Application . . . . . . . . . . . . . . . . . . . . . . 6 1.1.3 RSOC Design Characteristics and Trade-offs . . . . . . . . . . . 7 1.2 Classification of Reconfigurable Architectures . . . . . . . . . . . . . . . 10 1.2.1 Partial Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2.2 Runtime Reconfiguration (RTR) . . . . . . . . . . . . . . . . . . . 10 1.2.3 Multi-Context Configuration . . . . . . . . . . . . . . . . . . . . . 11 1.2.4 Fine-Grain Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2.5 Coarse-Grain Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3 Reconfigurable Computing Specific Design Issues . . . . . . . . . . . . 12 1.4 Overview of this Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . 14 2 Reconfigurable Computing Systems – Background 17 2.1 Examples for RSOCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Partially Reconfigurable FPGAs: Xilinx Virtex Device Family . . . . . . 20 2.2.1 Virtex-II/Virtex-II Pro Logic Architecture . . . . . . . . . . . . . 20 2.2.2 Reconfiguration Architecture and Reconfiguration Control . . 21 2.3 Methods for Design Entry . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.1 Behavioural Design Entry . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.2 Design Entry at Register-Transfer Level (RTL) . . . . . . . . . . 25 2.3.3 Xilinx Early Access Partial Reconfiguration Design Flow . . . . 26 2.4 Task Management in Reconfigurable Computing . . . . . . . . . . . . . 27 2.4.1 Online and Offline Task Management . . . . . . . . . . . . . . . 28 2.4.2 Task Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.3 Task Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.4 Reconfiguration Runtime Overhead . . . . . . . . . . . . . . . . 31 2.5 Configuration Data Compression . . . . . . . . . . . . . . . . . . . . . . . 32 2.6 Evaluation of Reconfigurable Systems . . . . . . . . . . . . . . . . . . . . 35 2.6.1 Energy Efficiency Models . . . . . . . . . . . . . . . . . . . . . . . 35 2.6.2 Area Efficiency Models . . . . . . . . . . . . . . . . . . . . . . . . 37 2.6.3 Runtime Efficiency Models . . . . . . . . . . . . . . . . . . . . . . 37 2.7 Similarity Based Reduction of Reconfiguration Overhead . . . . . . . . 38 2.7.1 Configuration Data Generation Methods . . . . . . . . . . . . . 39 2.7.2 Device Mapping Methods . . . . . . . . . . . . . . . . . . . . . . . 40 2.7.3 Circuit Design Methods . . . . . . . . . . . . . . . . . . . . . . . . 41 2.7.4 Model for Partial Configuration . . . . . . . . . . . . . . . . . . . 44 2.8 Contributions of this Work . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3 Runtime Reconfiguration Cost and Optimization Methods 47 3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2 Reconfiguration State Graph . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2.1 Reconfiguration Time Overhead . . . . . . . . . . . . . . . . . . 52 3.2.2 Dynamic Configuration Data Overhead . . . . . . . . . . . . . . 52 3.3 Configuration Cost at Bitstream Level . . . . . . . . . . . . . . . . . . . . 54 3.4 Configuration Cost at Structural Level . . . . . . . . . . . . . . . . . . . 56 3.4.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.4.2 Virtual Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.4.3 Reconfiguration Costs in the VA Context . . . . . . . . . . . . . 65 3.5 Allocation Functions with Minimal Reconfiguration Costs . . . . . . . 67 3.5.1 Allocation of Node Pairs . . . . . . . . . . . . . . . . . . . . . . . 68 3.5.2 Direct Allocation of Nodes . . . . . . . . . . . . . . . . . . . . . . 76 3.5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4 Implementation Tools for Reconfigurable Computing 95 4.1 Mapping of Netlists to FPGA Resources . . . . . . . . . . . . . . . . . . . 96 4.1.1 Mapping to Device Resources . . . . . . . . . . . . . . . . . . . . 96 4.1.2 Connectivity Transformations . . . . . . . . . . . . . . . . . . . . 99 4.1.3 Mapping Variants and Reconfiguration Costs . . . . . . . . . . . 100 4.1.4 Mapping of Circuit Macros . . . . . . . . . . . . . . . . . . . . . . 101 4.1.5 Global Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.1.6 Netlist Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.2 Mapping Aware Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.2.1 Generalized Node Mapping . . . . . . . . . . . . . . . . . . . . . 104 4.2.2 Successive Node Allocation . . . . . . . . . . . . . . . . . . . . . 105 4.2.3 Node Allocation with Ant Colony Optimization . . . . . . . . . 107 4.2.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.3 Netlist Mapping with Minimized Reconfiguration Cost . . . . . . . . . 110 4.3.1 Mapping Database . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.3.2 Mapping and Packing of Elements into Logic Blocks . . . . . . 112 4.3.3 Logic Element Selection . . . . . . . . . . . . . . . . . . . . . . . 114 4.3.4 Logic Element Selection for Min. Routing Reconfiguration . . 115 4.3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5 High-Level Synthesis for Reconfigurable Computing 125 5.1 Introduction to HLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.1.1 HLS Tool Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.1.2 Realization of the Hardware Tasks . . . . . . . . . . . . . . . . . 128 5.2 New Concepts for Task-based Reconfiguration . . . . . . . . . . . . . . 131 5.2.1 Multiple Hardware Tasks in one Reconfigurable Module . . . . 132 5.2.2 Multi-Level Reconfiguration . . . . . . . . . . . . . . . . . . . . . 133 5.2.3 Resource Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 5.3 Datapath Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.3.1 Task Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.3.2 Resource Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 5.3.3 Resource Binding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 5.3.4 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.3.5 Constraints for Scheduling and Resource Binding . . . . . . . . 151 5.4 Reconfiguration Optimized Datapath Implementation . . . . . . . . . . 153 5.4.1 Effects of Scheduling and Binding on Reconfiguration Costs . 153 5.4.2 Strategies for Resource Type Binding . . . . . . . . . . . . . . . 154 5.4.3 Strategies for Resource Instance Binding . . . . . . . . . . . . . 157 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 5.5.1 Summary of Binding Methods and Tool Setup . . . . . . . . . . 163 5.5.2 Cost Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 5.5.3 Implementation Scenarios . . . . . . . . . . . . . . . . . . . . . . 166 5.5.4 Benchmark Characteristics . . . . . . . . . . . . . . . . . . . . . . 168 5.5.5 Benchmark Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 5.5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 6 Summary and Outlook 185 Bibliography 189 A Simulated Annealing 201Partielle dynamische Rekonfiguration von FPGAs hat in den letzten Jahren große Aufmerksamkeit von Wissenschaft und Industrie auf sich gezogen. Die Technik erlaubt es, die Funktionalität von progammierbaren Bausteinen zur Laufzeit an veränderte Anforderungen anzupassen. Dynamische Rekonfiguration erlaubt es Entwicklern, FPGAs effizienter einzusetzen: z.B. können Ressourcen für verschiedene Funktionen wiederverwendet werden und die Funktionen selbst können zur Laufzeit an veränderte Verarbeitungsschritte angepasst werden. Insgesamt erlaubt partielle dynamische Rekonfiguration eine einzigartige Kombination von software-artiger Flexibilität und hardware-artiger Leistungsfähigkeit. Bis heute gibt es keine Übereinkunft darüber, wie der zusätzliche Aufwand, der durch partielle dynamische Rekonfiguration verursacht wird, zu bewerten ist. Diese Dissertation führt ein neues Kostenmodell für Laufzeit und Speicherbedarf ein, welche durch partielle dynamische Rekonfiguration verursacht wird. Es wird aufgezeigt, wie das Modell in alle Ebenen der Entwurfsoptimierung für rekonfigurierbare Hardware einbezogen werden kann. Insbesondere wird gezeigt, wie digitale Schaltungen derart auf FPGAs abgebildet werden können, sodass nur wenig Ressourcen der Hardware zur Laufzeit rekonfiguriert werden müssen. Dadurch kann Zeit, Speicher und Energie eingespart werden. Die Entwurfsoptimierung ist am effektivsten, wenn sie auf der Ebene der High-Level-Synthese angewendet wird. Diese Arbeit beschreibt, wie das Kostenmodell in ein neuartiges Werkzeug für die High-Level-Synthese integriert wurde. Das Werkzeug erlaubt es, beim Entwurf die Nutzung von FPGA-Ressourcen gegen den Rekonfigurationsaufwand abzuwägen. Es wird gezeigt, dass partielle Rekonfiguration nur wenig Kosten verursacht, wenn der Entwurf bezüglich Rekonfigurationskosten optimiert wird. Eine Anzahl von Beispielen und experimentellen Ergebnissen belegt die Vorteile der angewendeten Methodik.:1 Introduction 1 1.1 Reconfigurable Computing . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.1 Reconfigurable System on a Chip (RSOC) . . . . . . . . . . . . 4 1.1.2 Anatomy of an Application . . . . . . . . . . . . . . . . . . . . . . 6 1.1.3 RSOC Design Characteristics and Trade-offs . . . . . . . . . . . 7 1.2 Classification of Reconfigurable Architectures . . . . . . . . . . . . . . . 10 1.2.1 Partial Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2.2 Runtime Reconfiguration (RTR) . . . . . . . . . . . . . . . . . . . 10 1.2.3 Multi-Context Configuration . . . . . . . . . . . . . . . . . . . . . 11 1.2.4 Fine-Grain Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2.5 Coarse-Grain Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3 Reconfigurable Computing Specific Design Issues . . . . . . . . . . . . 12 1.4 Overview of this Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . 14 2 Reconfigurable Computing Systems – Background 17 2.1 Examples for RSOCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Partially Reconfigurable FPGAs: Xilinx Virtex Device Family . . . . . . 20 2.2.1 Virtex-II/Virtex-II Pro Logic Architecture . . . . . . . . . . . . . 20 2.2.2 Reconfiguration Architecture and Reconfiguration Control . . 21 2.3 Methods for Design Entry . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.1 Behavioural Design Entry . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.2 Design Entry at Register-Transfer Level (RTL) . . . . . . . . . . 25 2.3.3 Xilinx Early Access Partial Reconfiguration Design Flow . . . . 26 2.4 Task Management in Reconfigurable Computing . . . . . . . . . . . . . 27 2.4.1 Online and Offline Task Management . . . . . . . . . . . . . . . 28 2.4.2 Task Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.3 Task Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.4 Reconfiguration Runtime Overhead . . . . . . . . . . . . . . . . 31 2.5 Configuration Data Compression . . . . . . . . . . . . . . . . . . . . . . . 32 2.6 Evaluation of Reconfigurable Systems . . . . . . . . . . . . . . . . . . . . 35 2.6.1 Energy Efficiency Models . . . . . . . . . . . . . . . . . . . . . . . 35 2.6.2 Area Efficiency Models . . . . . . . . . . . . . . . . . . . . . . . . 37 2.6.3 Runtime Efficiency Models . . . . . . . . . . . . . . . . . . . . . . 37 2.7 Similarity Based Reduction of Reconfiguration Overhead . . . . . . . . 38 2.7.1 Configuration Data Generation Methods . . . . . . . . . . . . . 39 2.7.2 Device Mapping Methods . . . . . . . . . . . . . . . . . . . . . . . 40 2.7.3 Circuit Design Methods . . . . . . . . . . . . . . . . . . . . . . . . 41 2.7.4 Model for Partial Configuration . . . . . . . . . . . . . . . . . . . 44 2.8 Contributions of this Work . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3 Runtime Reconfiguration Cost and Optimization Methods 47 3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2 Reconfiguration State Graph . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2.1 Reconfiguration Time Overhead . . . . . . . . . . . . . . . . . . 52 3.2.2 Dynamic Configuration Data Overhead . . . . . . . . . . . . . . 52 3.3 Configuration Cost at Bitstream Level . . . . . . . . . . . . . . . . . . . . 54 3.4 Configuration Cost at Structural Level . . . . . . . . . . . . . . . . . . . 56 3.4.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.4.2 Virtual Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.4.3 Reconfiguration Costs in the VA Context . . . . . . . . . . . . . 65 3.5 Allocation Functions with Minimal Reconfiguration Costs . . . . . . . 67 3.5.1 Allocation of Node Pairs . . . . . . . . . . . . . . . . . . . . . . . 68 3.5.2 Direct Allocation of Nodes . . . . . . . . . . . . . . . . . . . . . . 76 3.5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4 Implementation Tools for Reconfigurable Computing 95 4.1 Mapping of Netlists to FPGA Resources . . . . . . . . . . . . . . . . . . . 96 4.1.1 Mapping to Device Resources . . . . . . . . . . . . . . . . . . . . 96 4.1.2 Connectivity Transformations . . . . . . . . . . . . . . . . . . . . 99 4.1.3 Mapping Variants and Reconfiguration Costs . . . . . . . . . . . 100 4.1.4 Mapping of Circuit Macros . . . . . . . . . . . . . . . . . . . . . . 101 4.1.5 Global Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.1.6 Netlist Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.2 Mapping Aware Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.2.1 Generalized Node Mapping . . . . . . . . . . . . . . . . . . . . . 104 4.2.2 Successive Node Allocation . . . . . . . . . . . . . . . . . . . . . 105 4.2.3 Node Allocation with Ant Colony Optimization . . . . . . . . . 107 4.2.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.3 Netlist Mapping with Minimized Reconfiguration Cost . . . . . . . . . 110 4.3.1 Mapping Database . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.3.2 Mapping and Packing of Elements into Logic Blocks . . . . . . 112 4.3.3 Logic Element Selection . . . . . . . . . . . . . . . . . . . . . . . 114 4.3.4 Logic Element Selection for Min. Routing Reconfiguration . . 115 4.3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5 High-Level Synthesis for Reconfigurable Computing 125 5.1 Introduction to HLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.1.1 HLS Tool Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.1.2 Realization of the Hardware Tasks . . . . . . . . . . . . . . . . . 128 5.2 New Concepts for Task-based Reconfiguration . . . . . . . . . . . . . . 131 5.2.1 Multiple Hardware Tasks in one Reconfigurable Module . . . . 132 5.2.2 Multi-Level Reconfiguration . . . . . . . . . . . . . . . . . . . . . 133 5.2.3 Resource Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 5.3 Datapath Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.3.1 Task Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.3.2 Resource Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 5.3.3 Resource Binding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 5.3.4 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.3.5 Constraints for Scheduling and Resource Binding . . . . . . . . 151 5.4 Reconfiguration Optimized Datapath Implementation . . . . . . . . . . 153 5.4.1 Effects of Scheduling and Binding on Reconfiguration Costs . 153 5.4.2 Strategies for Resource Type Binding . . . . . . . . . . . . . . . 154 5.4.3 Strategies for Resource Instance Binding . . . . . . . . . . . . . 157 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 5.5.1 Summary of Binding Methods and Tool Setup . . . . . . . . . . 163 5.5.2 Cost Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 5.5.3 Implementation Scenarios . . . . . . . . . . . . . . . . . . . . . . 166 5.5.4 Benchmark Characteristics . . . . . . . . . . . . . . . . . . . . . . 168 5.5.5 Benchmark Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 5.5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 6 Summary and Outlook 185 Bibliography 189 A Simulated Annealing 20

Digital Commons

Technische Universität Dresden: Qucosa

Models, Design Methods and Tools for Improved Partial Dynamic Reconﬁguration

Author: Rullmann Markus
Publication venue
Publication date: 26/02/2010
Field of study

Technische Universität Dresden: Qucosa