34 research outputs found

    VThreads: A novel VLIW chip multiprocessor with hardware-assisted PThreads

    Get PDF
    We discuss VThreads, a novel VLIW CMP with hardware-assisted shared-memory Thread support. VThreads supports Instruction Level Parallelism via static multiple-issue and Thread Level Parallelism via hardware-assisted POSIX Threads along with extensive customization. It allows the instantiation of tightlycoupled streaming accelerators and supports up to 7-address Multiple-Input, Multiple-Output instruction extensions. VThreads is designed in technology-independent Register-Transfer-Level VHDL and prototyped on 40 nm and 28 nm Field-Programmable gate arrays. It was evaluated against a PThreads-based multiprocessor based on the Sparc-V8 ISA. On a 65 nm ASIC implementation VThreads achieves up to x7.2 performance increase on synthetic benchmarks, x5 on a parallel Mandelbrot implementation, 66% better on a threaded JPEG implementation, 79% better on an edge-detection benchmark and ~13% improvement on DES compared to the Leon3MP CMP. In the range of 2 to 8 cores VThreads demonstrates a post-route (statistical) power reduction between 65% to 57% at an area increase of 1.2%-10% for 1-8 cores, compared to a similarly-configured Leon3MP CMP. This combination of micro-architectural features, scalability, extensibility, hardware support for low-latency PThreads, power efficiency and area make the processor an attractive proposition for low-power, deeply-embedded applications requiring minimum OS support

    내장형 프로세서에서의 코드 크기 최적화를 위한 아키텍처 설계 및 컴파일러 지원

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2014. 2. 백윤흥.Embedded processors usually need to satisfy very tight design constraints to achieve low power consumption, small chip area, and high performance. One of the obstacles to meeting these requirements is related to delivering instructions from instruction memory/caches. The size of instruction memory/cache considerably contributes total chip area. Further, frequent access to caches incurs high power/energy consumption and significantly hampers overall system performance due to cache misses. To reduce the negative effects of the instruction delivery, therefore, this study focuses on the sizing of instruction memory/cache through code size optimization. One observation for code size optimization is that very long instruction word (VLIW) architectures often consume more power and memory space than necessary due to long instruction bit-width. One way to lessen this problem is to adopt a reduced bit-width ISA (Instruction Set Architecture) that has a narrower instruction word length. In practice, however, it is impossible to convert a given ISA fully into an equivalent reduced bit-width one because the narrow instruction word, due to bitwidth restrictions, can encode only a small subset of normal instructions in the original ISA. To explore the possibility of complete conversion of an existing 32-bit ISA into a 16-bit one that supports effectively all 32-bit instructions, we propose the reduced bit-width (e.g. 16-bit × 4-way) VLIW architectures that equivalently behave as their original bit-width (e.g. 32-bit × 4-way) architectures with the help of dynamic implied addressing mode (DIAM). Second, we observe that code duplication techniques have been proposed to increase the reliability against soft errors in multi-issue embedded systems such as VLIW by exploiting empty slots for duplicated instructions. Unfortunately, all duplicated instructions cannot be allocated to empty slots, which enforces generating additional VLIW packets to include the duplicated instructions. The increase of code size due to the extra VLIW packets is necessarily accompanied with the enhanced reliability. In order to minimize code size, we propose a novel approach compiler-assisted dynamic code duplication scheme, which accepts an assembly code composed of only original instructions as input, and generates duplicated instructions at runtime with the help of encoded information attached to original instructions. Since the duplicates of original instructions are not explicitly present in the assembly code, the increase of code size due to the duplicated instructions can be avoided in the proposed scheme. Lastly, the third observation is that, to cope with soft errors similarly to the second observation, a recently proposed software-based technique with TMR (Triple Modular Redundancy) implemented on coarse-grained reconfigurable architectures (CGRA) incurs the increase of configuration size, which is corresponding to the code size of CGRA, and thus extreme overheads in terms of runtime and energy consumption mainly due to expensive voting mechanisms for the outputs from the triplication of every operation. To reduce the expensive performance overhead due to the large configuration from the validation mechanism, we propose selective validation mechanisms for efficient modular redundancy techniques in the datapath on CGRA. The proposed techniques selectively validate the results at synchronous operations rather than every operation.Abstract i Chapter 1 Introduction 1 1.1 Instruction Delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 The causes of code size increase . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Instruction Bit-width in VLIW Architectures . . . . . . . . . 2 1.2.2 Instruction Redundancy . . . . . . . . . . . . . . . . . . . . 3 Chapter 2 Reducing Instruction Bit-width with Dynamic Implied Addressing Mode (DIAM) 7 2.1 Conceptual View . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Architecture Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 ISA Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 Remote Operand Array Buffer . . . . . . . . . . . . . . . . . 15 2.2.3 Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Compiler Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.1 16-bit Instruction Generation . . . . . . . . . . . . . . . . . . 24 2.3.2 DDG Construction & Scheduling . . . . . . . . . . . . . . . 26 2.4 VLES(Variable Length Execution Set) Architecture with a Reduced Bit-width Instruction Set . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.1 Architecture Design . . . . . . . . . . . . . . . . . . . . . . 30 2.4.2 Compiler Support . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.5.3 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . 48 2.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Chapter 3 Compiler-assisted Dynamic Code Duplication Scheme for Soft Error Resilient VLIW Architectures 53 3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.2 Compiler-assisted Dynamic Code Duplication . . . . . . . . . . . . . 58 3.2.1 ISA Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.2.2 Modified Fetch Stage . . . . . . . . . . . . . . . . . . . . . . 62 3.3 Compilation Techniques . . . . . . . . . . . . . . . . . . . . . . . . 66 3.3.1 Static Code Duplication Algorithm . . . . . . . . . . . . . . 67 3.3.2 Vulnerability-aware Duplication Algorithm . . . . . . . . . . 68 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 71 3.4.2 Effectiveness of Compiler-assisted Dynamic Code Duplication 73 3.4.3 Effectiveness of Vulnerability-aware Duplication Algorithm . 77 Chapter 4 Selective Validation Techniques for Robust CGRAs against Soft Errors 85 4.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.3.1 Selective Validation Mechanism . . . . . . . . . . . . . . . . 91 4.3.2 Compilation Flow and Performance Analysis . . . . . . . . . 92 4.3.3 Fault Coverage Analysis . . . . . . . . . . . . . . . . . . . . 96 4.3.4 Our Optimization - Minimizing Store Operation . . . . . . . . 97 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 100 Chapter 5 Conculsion 110 초록 122Docto

    Control of sectioned on-chip communication

    Get PDF

    From Parallel Programs to Customized Parallel Processors

    Get PDF
    The need for fast time to market of new embedded processor-based designs calls for a rapid design methodology of the included processors. The call for such a methodology is even more emphasized in the context of so called soft cores targeted to reconfigurable fabrics where per-design processor customization is commonplace. The C language has been commonly used as an input to hardware/software co-design flows. However, as C is a sequential language, its potential to generate parallel operations to utilize naturally parallel hardware constructs is far from optimal, leading to a customized processor design space with limited parallel resource scalability. In contrast, when utilizing a parallel programming language as an input, a wider processor design space can be explored to produce customized processors with varying degrees of utilized parallelism. This Thesis proposes a novel Multicore Application-Specific Instruction Set Processor (MCASIP) co-design methodology that exploits parallel programming languages as the application input format. In the methodology, the designer can explicitly capture the parallelism of the algorithm and exploit specialized instructions using a parallel programming language in contrast to being on the mercy of the compiler or the hardware to extract the parallelism from a sequential input. The Thesis proposes a multicore processor template based on the Transport Triggered Architecture, compiler techniques involved in static parallelization of computation kernels with barriers and a datapath integrated hardware accelerator for low overhead software synchronization implementation. These contributions enable scaling the customized processors both at the instruction and task levels to efficiently exploit the parallelism in the input program up to the implementation constraints such as the memory bandwidth or the chip area. The different contributions are validated with case studies, comparisons and design examples

    Enhanced applicability of loop transformations

    Get PDF

    KAVUAKA: a low-power application-specific processor architecture for digital hearing aids

    Get PDF
    The power consumption of digital hearing aids is very restricted due to their small physical size and the available hardware resources for signal processing are limited. However, there is a demand for more processing performance to make future hearing aids more useful and smarter. Future hearing aids should be able to detect, localize, and recognize target speakers in complex acoustic environments to further improve the speech intelligibility of the individual hearing aid user. Computationally intensive algorithms are required for this task. To maintain acceptable battery life, the hearing aid processing architecture must be highly optimized for extremely low-power consumption and high processing performance.The integration of application-specific instruction-set processors (ASIPs) into hearing aids enables a wide range of architectural customizations to meet the stringent power consumption and performance requirements. In this thesis, the application-specific hearing aid processor KAVUAKA is presented, which is customized and optimized with state-of-the-art hearing aid algorithms such as speaker localization, noise reduction, beamforming algorithms, and speech recognition. Specialized and application-specific instructions are designed and added to the baseline instruction set architecture (ISA). Among the major contributions are a multiply-accumulate (MAC) unit for real- and complex-valued numbers, architectures for power reduction during register accesses, co-processors and a low-latency audio interface. With the proposed MAC architecture, the KAVUAKA processor requires 16 % less cycles for the computation of a 128-point fast Fourier transform (FFT) compared to related programmable digital signal processors. The power consumption during register file accesses is decreased by 6 %to 17 % with isolation and by-pass techniques. The hardware-induced audio latency is 34 %lower compared to related audio interfaces for frame size of 64 samples.The final hearing aid system-on-chip (SoC) with four KAVUAKA processor cores and ten co-processors is integrated as an application-specific integrated circuit (ASIC) using a 40 nm low-power technology. The die size is 3.6 mm2. Each of the processors and co-processors contains individual customizations and hardware features with a varying datapath width between 24-bit to 64-bit. The core area of the 64-bit processor configuration is 0.134 mm2. The processors are organized in two clusters that share memory, an audio interface, co-processors and serial interfaces. The average power consumption at a clock speed of 10 MHz is 2.4 mW for SoC and 0.6 mW for the 64-bit processor.Case studies with four reference hearing aid algorithms are used to present and evaluate the proposed hardware architectures and optimizations. The program code for each processor and co-processor is generated and optimized with evolutionary algorithms for operation merging,instruction scheduling and register allocation. The KAVUAKA processor architecture is com-pared to related processor architectures in terms of processing performance, average power consumption, and silicon area requirements

    Dependable Embedded Systems

    Get PDF
    This Open Access book introduces readers to many new techniques for enhancing and optimizing reliability in embedded systems, which have emerged particularly within the last five years. This book introduces the most prominent reliability concerns from today’s points of view and roughly recapitulates the progress in the community so far. Unlike other books that focus on a single abstraction level such circuit level or system level alone, the focus of this book is to deal with the different reliability challenges across different levels starting from the physical level all the way to the system level (cross-layer approaches). The book aims at demonstrating how new hardware/software co-design solution can be proposed to ef-fectively mitigate reliability degradation such as transistor aging, processor variation, temperature effects, soft errors, etc. Provides readers with latest insights into novel, cross-layer methods and models with respect to dependability of embedded systems; Describes cross-layer approaches that can leverage reliability through techniques that are pro-actively designed with respect to techniques at other layers; Explains run-time adaptation and concepts/means of self-organization, in order to achieve error resiliency in complex, future many core systems

    An FPGA implementation of an investigative many-core processor, Fynbos : in support of a Fortran autoparallelising software pipeline

    Get PDF
    Includes bibliographical references.In light of the power, memory, ILP, and utilisation walls facing the computing industry, this work examines the hypothetical many-core approach to finding greater compute performance and efficiency. In order to achieve greater efficiency in an environment in which Moore’s law continues but TDP has been capped, a means of deriving performance from dark and dim silicon is needed. The many-core hypothesis is one approach to exploiting these available transistors efficiently. As understood in this work, it involves trading in hardware control complexity for hundreds to thousands of parallel simple processing elements, and operating at a clock speed sufficiently low as to allow the efficiency gains of near threshold voltage operation. Performance is there- fore dependant on exploiting a new degree of fine-grained parallelism such as is currently only found in GPGPUs, but in a manner that is not as restrictive in application domain range. While removing the complex control hardware of traditional CPUs provides space for more arithmetic hardware, a basic level of control is still required. For a number of reasons this work chooses to replace this control largely with static scheduling. This pushes the burden of control primarily to the software and specifically the compiler, rather not to the programmer or to an application specific means of control simplification. An existing legacy tool chain capable of autoparallelising sequential Fortran code to the degree of parallelism necessary for many-core exists. This work implements a many-core architecture to match it. Prototyping the design on an FPGA, it is possible to examine the real world performance of the compiler-architecture system to a greater degree than simulation only would allow. Comparing theoretical peak performance and real performance in a case study application, the system is found to be more efficient than any other reviewed, but to also significantly under perform relative to current competing architectures. This failing is apportioned to taking the need for simple hardware too far, and an inability to implement static scheduling mitigating tactics due to lack of support for such in the compiler
    corecore