130 research outputs found

    Coarse-grained reconfigurable array architectures

    Get PDF
    Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code

    A unified modulo scheduling and register allocation technique for clustered processors

    Get PDF
    This work presents a modulo scheduling framework for clustered ILP processors that integrates the cluster assignment, instruction scheduling and register allocation steps in a single phase. This unified approach is more effective than traditional approaches based on sequentially performing some (or all) of the three steps, since it allows optimizing the global code generation problem instead of searching for optimal solutions to each individual step. Besides, it avoids the iterative nature of traditional approaches, which require repeated applications of the three steps until a valid solution is found. The proposed framework includes a mechanism to insert spill code on-the-fly and heuristics to evaluate the quality of partial schedules considering simultaneously inter-cluster communications, memory pressure and register pressure. Transformations that allow trading pressure on a type of resource for another resource are also included. We show that the proposed technique outperforms previously proposed techniques. For instance, the average speed-up for the SPECfp95 is 36% for a 4-cluster configuration.Peer ReviewedPostprint (published version

    Virtual cluster scheduling through the scheduling graph

    Get PDF
    This paper presents an instruction scheduling and cluster assignment approach for clustered processors. The proposed technique makes use of a novel representation named the scheduling graph which describes all possible schedules. A powerful deduction process is applied to this graph, reducing at each step the set of possible schedules. In contrast to traditional list scheduling techniques, the proposed scheme tries to establish relations among instructions rather than assigning each instruction to a particular cycle. The main advantage is that wrong or poor schedules can be anticipated and discarded earlier. In addition, cluster assignment of instructions is performed using another novel concept called virtual clusters, which define sets of instructions that must execute in the same cluster. These clusters are managed during the deduction process to identify incompatibilities among instructions. The mapping of virtual to physical clusters is postponed until the scheduling of the instructions has finalized. The advantages this novel approach features include: (1) accurate scheduling information when assigning, and, (2) accurate information of the cluster assignment constraints imposed by scheduling decisions. We have implemented and evaluated the proposed scheme with superblocks extracted from Speclnt95 and MediaBench. The results show that this approach produces better schedules than the previous state-of-the-art. Speed-ups are up to 15%, with average speed-ups ranging from 2.5% (2-Clusters) to 9.5% (4-Clusters).Peer ReviewedPostprint (published version

    AGAMOS: A graph-based approach to modulo scheduling for clustered microarchitectures

    Get PDF
    This paper presents AGAMOS, a technique to modulo schedule loops on clustered microarchitectures. The proposed scheme uses a multilevel graph partitioning strategy to distribute the workload among clusters and reduces the number of intercluster communications at the same time. Partitioning is guided by approximate schedules (i.e., pseudoschedules), which take into account all of the constraints that influence the final schedule. To further reduce the number of intercluster communications, heuristics for instruction replication are included. The proposed scheme is evaluated using the SPECfp95 programs. The described scheme outperforms a state-of-the-art scheduler for all programs and different cluster configurations. For some configurations, the speedup obtained when using this new scheme is greater than 40 percent, and for selected programs, performance can be more than doubled.Peer ReviewedPostprint (published version

    VThreads: A novel VLIW chip multiprocessor with hardware-assisted PThreads

    Get PDF
    We discuss VThreads, a novel VLIW CMP with hardware-assisted shared-memory Thread support. VThreads supports Instruction Level Parallelism via static multiple-issue and Thread Level Parallelism via hardware-assisted POSIX Threads along with extensive customization. It allows the instantiation of tightlycoupled streaming accelerators and supports up to 7-address Multiple-Input, Multiple-Output instruction extensions. VThreads is designed in technology-independent Register-Transfer-Level VHDL and prototyped on 40 nm and 28 nm Field-Programmable gate arrays. It was evaluated against a PThreads-based multiprocessor based on the Sparc-V8 ISA. On a 65 nm ASIC implementation VThreads achieves up to x7.2 performance increase on synthetic benchmarks, x5 on a parallel Mandelbrot implementation, 66% better on a threaded JPEG implementation, 79% better on an edge-detection benchmark and ~13% improvement on DES compared to the Leon3MP CMP. In the range of 2 to 8 cores VThreads demonstrates a post-route (statistical) power reduction between 65% to 57% at an area increase of 1.2%-10% for 1-8 cores, compared to a similarly-configured Leon3MP CMP. This combination of micro-architectural features, scalability, extensibility, hardware support for low-latency PThreads, power efficiency and area make the processor an attractive proposition for low-power, deeply-embedded applications requiring minimum OS support

    Doctor of Philosophy

    Get PDF
    dissertationThe embedded system space is characterized by a rapid evolution in the complexity and functionality of applications. In addition, the short time-to-market nature of the business motivates the use of programmable devices capable of meeting the conflicting constraints of low-energy, high-performance, and short design times. The keys to achieving these conflicting constraints are specialization and maximally extracting available application parallelism. General purpose processors are flexible but are either too power hungry or lack the necessary performance. Application-specific integrated circuits (ASICS) efficiently meet the performance and power needs but are inflexible. Programmable domain-specific architectures (DSAs) are an attractive middle ground, but their design requires significant time, resources, and expertise in a variety of specialties, which range from application algorithms to architecture and ultimately, circuit design. This dissertation presents CoGenE, a design framework that automates the design of energy-performance-optimal DSAs for embedded systems. For a given application domain and a user-chosen initial architectural specification, CoGenE consists of a a Compiler to generate execution binary, a simulator Generator to collect performance/energy statistics, and an Explorer that modifies the current architecture to improve energy-performance-area characteristics. The above process repeats automatically until the user-specified constraints are achieved. This removes or alleviates the time needed to understand the application, manually design the DSA, and generate object code for the DSA. Thus, CoGenE is a new design methodology that represents a significant improvement in performance, energy dissipation, design time, and resources. This dissertation employs the face recognition domain to showcase a flexible architectural design methodology that creates "ASIC-like" DSAs. The DSAs are instruction set architecture (ISA)-independent and achieve good energy-performance characteristics by coscheduling the often conflicting constraints of data access, data movement, and computation through a flexible interconnect. This represents a significant increase in programming complexity and code generation time. To address this problem, the CoGenE compiler employs integer linear programming (ILP)-based 'interconnect-aware' scheduling techniques for automatic code generation. The CoGenE explorer employs an iterative technique to search the complete design space and select a set of energy-performance-optimal candidates. When compared to manual designs, results demonstrate that CoGenE produces superior designs for three application domains: face recognition, speech recognition and wireless telephony. While CoGenE is well suited to applications that exhibit a streaming behavior, multithreaded applications like ray tracing present a different but important challenge. To demonstrate its generality, CoGenE is evaluated in designing a novel multicore N-wide SIMD architecture, known as StreamRay, for the ray tracing domain. CoGenE is used to synthesize the SIMD execution cores, the compiler that generates the application binary, and the interconnection subsystem. Further, separating address and data computations in space reduces data movement and contention for resources, thereby significantly improving performance compared to existing ray tracing approaches

    Constraint driven operation assignment for retargetable VLIW compilers

    Get PDF
    In veel consumenten elektronica producten worden processoren toegepast voor het bewerken van gedigitaliseerde signalen. Deze processoren zijn gewoonlijk ingebed in een systeem en moeten wat rekenkracht, vermogensverbruik en fabricage kosten aan stringente eisen voldoen. Door het optimaliseren van een processor voor een specifieke taak, of een kleine verzameling van taken, kan er aan strengere eisen worden voldaan. Deze specialisatie heeft een grotere diversiteit aan processor types tot gevolg. Door het toepassen van geautomatiseerde processor ontwerp en programmeer systemen wordt er getracht om de ontwikkelkosten in de hand te houden. Een processor kan onder andere geoptimaliseerd worden door het toepassen van een incompleet communicatie netwerk in de processor. Daarnaast is het wenselijk om meerdere register files toe te passen in een processor met een groot aantal parallelle bewerkingseenheden. Deze optimalisaties hebben tot gevolg dat er veel hulp en expertise van programmeur nodig is om hoogwaardige microcode te genereren met behulp van traditionele code generatie technieken in een compiler. Met de in dit proefschrift beschreven code generatie methode is het in veel gevallen wel mogelijk om hoogwaardige microcode volledig automatisch te genereren. Het toepassen van een incompleet netwerk in de processor maakt het toekennen van basis bewerkingen aan bewerkingseenheden een moeilijke taak voor de code generator. Een toekenning moet namelijk zo plaatsvinden dat voor iedere bewerking die uitgevoerd wordt op een bewerkingseenheid er een kanaal in het netwerk van de processor is, dat gebruikt kan worden om het resultaat naar de bewerkingseenheid toe te sturen die de resultaat consumerende bewerking uitvoerd. Dit communicatiekanaal en de bewerkingseenheid moeten tevens op het gewenste tijdstip beschikbaar zijn. In de voorgestelde code generatie methode wordt er gezocht naar een oplossing. Na het nemen van een bewerkings toekenningsbelissing wordt er geanalyseerd welke toekomstige beslissings opties niet tot een oplossing kunnen behoren gegeven de reeds gemaakte beslissingen. Deze gevallen worden verwijderd uit de zoekruimte zodat tijdens toekomstige beslissingen andere toekenningsbeslissingen zullen worden geprobeerd. Indien er gedetecteerd wordt dat er gegeven de gemaakt beslissingen geen oplossing bestaat, dan worden er beslissingen ongedaan gemaakt en andere opties geprobeerd. Het verwijderen van zoveel mogelijk beslissings opties die niet tot een oplossing behoren, verminderd het aantal keer dat er op een beslissing terug gekomen moet worden en de tijd die nodig is om een oplossing te vinden Voor het bewerking aan bewerkingseenheid toekenings probleem wordt er een conflict graaf opgesteld waarin alle opties en combinatie van niet toegestane opties gerepresenteerd worden. Gevallen die zeker niet tot een oplossing behoren worden gevonden met algoritmes die rekentijd effici¨ent zijn. Indien door analyse wordt vastgesteld dat twee bewerkingen op hetzelfde tijdstip uitgevoerd moeten worden dan wordt er een kant in de conflict graaf toegevoegd. Deze kant sluit uit dat deze beide bewerkingen aan dezelfde bewerkingseenheid wordt toegekend. Indien er wordt vast gesteld dat een bewerking op een specifieke bewerkingseenheid moet worden uitgevoerd dan wordt deze informatie gebruikt om nauwkeuriger het tijdsinterval te bepalen waarin de operatie uitgevoerd kan worden. De voorgestelde toekenningstechnieken zijn ge-implementeerd in een prototype codegenerator FACTS. Deze code generator is gekoppeld aan de processor synthese omgeving AjRT-designer. Door het koppelen van FACTS aan AjRT-designer kunnen processoren, die bevroren zijn na synthese, hergeprogrammeerd worden. Deze omgeving is gebruikt om de codegeneratie technieken in FACTS te evalueren voor industrieel relevante applicatie domein specifieke processor ontwerpen. De resultaten tonen aan dat er met deze technieken in veel gevallen microcode gegenereerd kan worden die de opslag capaciteit van de register files en de beschikbare verbindingen in de VLIW-processor respecteert en aan stringente eisen wat betreft de rekentijd voldoet

    Mr.Wolf: An Energy-Precision Scalable Parallel Ultra Low Power SoC for IoT Edge Processing

    Get PDF
    This paper presents Mr. Wolf, a parallel ultra-low power (PULP) system on chip (SoC) featuring a hierarchical architecture with a small (12 kgates) microcontroller (MCU) class RISC-V core augmented with an autonomous IO subsystem for efficient data transfer from a wide set of peripherals. The small core can offload compute-intensive kernels to an eight-core floating-point capable of processing engine available on demand. The proposed SoC, implemented in a 40-nm LP CMOS technology, features a 108-mu W fully retentive memory (512 kB). The IO subsystem is capable of transferring up to 1.6 Gbit/s from external devices to the memory in less than 2.5 mW. The eight-core compute cluster achieves a peak performance of 850 million of 32-bit integer multiply and accumulate per second (MMAC/s) and 500 million of 32-bit floating-point multiply and accumulate per second (MFMAC/s) -1 GFlop/s-with an energy efficiency up to 15 MMAC/s/mW and 9 MFMAC/s/mW. These building blocks are supported by aggressive on-chip power conversion and management, enabling energy-proportional heterogeneous computing for always-on IoT end nodes improving performance by several orders of magnitude with respect to traditional single-core MCUs within a power envelope of 153 mW. We demonstrated the capabilities of the proposed SoC on a wide set of near-sensor processing kernels showing that Mr. Wolf can deliver performance up to 16.4 GOp/s with energy efficiency up to 274 MOp/s/mW on real-life applications, paving the way for always-on data analytics on high-bandwidth sensors at the edge of the Internet of Things

    Compiler-Driven Reconfiguration of Multiprocessors

    Get PDF
    Hussmann M, Thies M, Kastens U, Purnaprajna M, Porrmann M, Rückert U. Compiler-Driven Reconfiguration of Multiprocessors. In: Proceedings of the Workshop on Application Specific Processors (WASP) 2007. 2007.Multiprocessors enable parallel execution of a single large application to achieve a performance improvement. An application is split at instruction, data or task level (based on the granularity), such that the overhead of partitioning is minimal. Parallelization for multiprocessors is mostly restricted to a fixed granularity. Reconfiguration enables architectural variations to allow multiple granularities of operation within a multiprocessor. This adaptability optimizes resource utilization over a fixed organization. Here, a unified hardware-software approach to design a reconfigurable multiprocessor system called QuadroCore is presented. In our holistic methodology, compiler-driven reconfiguration selects from a fixed set of modes. Each mode relies on matching program analysis to exploit the architecture efficiently. For instance, a multiprocessor may adapt to different parallelization paradigms. The compiler can determine the best execution mode for each piece of code by analyzing the parallelism in a program. A fast, singlecycle, run-time reconfiguration between these predetermined modes is enabled by executing special instructions which switch coarse-grained components like instruction decoders, ALUs and register banks. Performance is evaluated in terms of execution cycles and achieved clock frequency. First results indicate suitability especially in audio and video processing applications

    HW-SW Emulation Framework for Temperature-Aware Design in MPSoCs

    Get PDF
    New tendencies envisage Multi-Processor Systems-On-Chip (MPSoCs) as a promising solution for the consumer electronics market. MPSoCs are complex to design, as they must execute multiple applications (games, video), while meeting additional design constraints (energy consumption, time-to-market). Moreover, the rise of temperature in the die for MPSoCs can seriously affect their final performance and reliability. In this paper, we present a new hardware-software emulation framework that allows designers a complete exploration of the thermal behavior of final MPSoC designs early in the design flow. The proposed framework uses FPGA emulation as the key element to model the hardware components of the considered MPSoC platform at multi-megahertz speeds. It automatically extracts detailed system statistics that are used as input to our software thermal library running in a host computer. This library calculates at run-time the temperature of on-chip components, based on the collected statistics from the emulated system and the final floorplan of the MPSoC. This enables fast testing of various thermal management techniques. Our results show speed-ups of three orders of magnitude compared to cycle-accurate MPSoC simulator
    corecore