236 research outputs found

    Compilation for Delay Impact Minimization in VLIW Embedded Systems

    Get PDF
    Tomorrow’s embedded devices need to run high resolution multimedia as well as need to support multistandard wireless systems which require an enormous computational complexity with a very low energy consumption and very high performance constraints. In this context, the register file is one of the key sources of power consumption and performance bottleneck, and its inappropriate design and management can severely affect the performance of the system. In this paper, we present a new compilation approach to mitigate the performance implications of technology variation in the shared register file in upcoming embedded VLIW architectures with several processing units. The compilation approach is based on a redefined register assignment policy and a set of architectural modifications to this device. Experimental results show up to a 67% performance improvement with our technique

    The Chameleon Architecture for Streaming DSP Applications

    Get PDF
    We focus on architectures for streaming DSP applications such as wireless baseband processing and image processing. We aim at a single generic architecture that is capable of dealing with different DSP applications. This architecture has to be energy efficient and fault tolerant. We introduce a heterogeneous tiled architecture and present the details of a domain-specific reconfigurable tile processor called Montium. This reconfigurable processor has a small footprint (1.8 mm2^2 in a 130 nm process), is power efficient and exploits the locality of reference principle. Reconfiguring the device is very fast, for example, loading the coefficients for a 200 tap FIR filter is done within 80 clock cycles. The tiles on the tiled architecture are connected to a Network-on-Chip (NoC) via a network interface (NI). Two NoCs have been developed: a packet-switched and a circuit-switched version. Both provide two types of services: guaranteed throughput (GT) and best effort (BE). For both NoCs estimates of power consumption are presented. The NI synchronizes data transfers, configures and starts/stops the tile processor. For dynamically mapping applications onto the tiled architecture, we introduce a run-time mapping tool

    내장형 프로세서에서의 코드 크기 최적화를 위한 아키텍처 설계 및 컴파일러 지원

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2014. 2. 백윤흥.Embedded processors usually need to satisfy very tight design constraints to achieve low power consumption, small chip area, and high performance. One of the obstacles to meeting these requirements is related to delivering instructions from instruction memory/caches. The size of instruction memory/cache considerably contributes total chip area. Further, frequent access to caches incurs high power/energy consumption and significantly hampers overall system performance due to cache misses. To reduce the negative effects of the instruction delivery, therefore, this study focuses on the sizing of instruction memory/cache through code size optimization. One observation for code size optimization is that very long instruction word (VLIW) architectures often consume more power and memory space than necessary due to long instruction bit-width. One way to lessen this problem is to adopt a reduced bit-width ISA (Instruction Set Architecture) that has a narrower instruction word length. In practice, however, it is impossible to convert a given ISA fully into an equivalent reduced bit-width one because the narrow instruction word, due to bitwidth restrictions, can encode only a small subset of normal instructions in the original ISA. To explore the possibility of complete conversion of an existing 32-bit ISA into a 16-bit one that supports effectively all 32-bit instructions, we propose the reduced bit-width (e.g. 16-bit × 4-way) VLIW architectures that equivalently behave as their original bit-width (e.g. 32-bit × 4-way) architectures with the help of dynamic implied addressing mode (DIAM). Second, we observe that code duplication techniques have been proposed to increase the reliability against soft errors in multi-issue embedded systems such as VLIW by exploiting empty slots for duplicated instructions. Unfortunately, all duplicated instructions cannot be allocated to empty slots, which enforces generating additional VLIW packets to include the duplicated instructions. The increase of code size due to the extra VLIW packets is necessarily accompanied with the enhanced reliability. In order to minimize code size, we propose a novel approach compiler-assisted dynamic code duplication scheme, which accepts an assembly code composed of only original instructions as input, and generates duplicated instructions at runtime with the help of encoded information attached to original instructions. Since the duplicates of original instructions are not explicitly present in the assembly code, the increase of code size due to the duplicated instructions can be avoided in the proposed scheme. Lastly, the third observation is that, to cope with soft errors similarly to the second observation, a recently proposed software-based technique with TMR (Triple Modular Redundancy) implemented on coarse-grained reconfigurable architectures (CGRA) incurs the increase of configuration size, which is corresponding to the code size of CGRA, and thus extreme overheads in terms of runtime and energy consumption mainly due to expensive voting mechanisms for the outputs from the triplication of every operation. To reduce the expensive performance overhead due to the large configuration from the validation mechanism, we propose selective validation mechanisms for efficient modular redundancy techniques in the datapath on CGRA. The proposed techniques selectively validate the results at synchronous operations rather than every operation.Abstract i Chapter 1 Introduction 1 1.1 Instruction Delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 The causes of code size increase . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Instruction Bit-width in VLIW Architectures . . . . . . . . . 2 1.2.2 Instruction Redundancy . . . . . . . . . . . . . . . . . . . . 3 Chapter 2 Reducing Instruction Bit-width with Dynamic Implied Addressing Mode (DIAM) 7 2.1 Conceptual View . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Architecture Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 ISA Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 Remote Operand Array Buffer . . . . . . . . . . . . . . . . . 15 2.2.3 Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Compiler Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.1 16-bit Instruction Generation . . . . . . . . . . . . . . . . . . 24 2.3.2 DDG Construction & Scheduling . . . . . . . . . . . . . . . 26 2.4 VLES(Variable Length Execution Set) Architecture with a Reduced Bit-width Instruction Set . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.1 Architecture Design . . . . . . . . . . . . . . . . . . . . . . 30 2.4.2 Compiler Support . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.5.3 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . 48 2.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Chapter 3 Compiler-assisted Dynamic Code Duplication Scheme for Soft Error Resilient VLIW Architectures 53 3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.2 Compiler-assisted Dynamic Code Duplication . . . . . . . . . . . . . 58 3.2.1 ISA Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.2.2 Modified Fetch Stage . . . . . . . . . . . . . . . . . . . . . . 62 3.3 Compilation Techniques . . . . . . . . . . . . . . . . . . . . . . . . 66 3.3.1 Static Code Duplication Algorithm . . . . . . . . . . . . . . 67 3.3.2 Vulnerability-aware Duplication Algorithm . . . . . . . . . . 68 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 71 3.4.2 Effectiveness of Compiler-assisted Dynamic Code Duplication 73 3.4.3 Effectiveness of Vulnerability-aware Duplication Algorithm . 77 Chapter 4 Selective Validation Techniques for Robust CGRAs against Soft Errors 85 4.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.3.1 Selective Validation Mechanism . . . . . . . . . . . . . . . . 91 4.3.2 Compilation Flow and Performance Analysis . . . . . . . . . 92 4.3.3 Fault Coverage Analysis . . . . . . . . . . . . . . . . . . . . 96 4.3.4 Our Optimization - Minimizing Store Operation . . . . . . . . 97 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 100 Chapter 5 Conculsion 110 초록 122Docto

    DeSyRe: on-Demand System Reliability

    No full text
    The DeSyRe project builds on-demand adaptive and reliable Systems-on-Chips (SoCs). As fabrication technology scales down, chips are becoming less reliable, thereby incurring increased power and performance costs for fault tolerance. To make matters worse, power density is becoming a significant limiting factor in SoC design, in general. In the face of such changes in the technological landscape, current solutions for fault tolerance are expected to introduce excessive overheads in future systems. Moreover, attempting to design and manufacture a totally defect and fault-free system, would impact heavily, even prohibitively, the design, manufacturing, and testing costs, as well as the system performance and power consumption. In this context, DeSyRe delivers a new generation of systems that are reliable by design at well-balanced power, performance, and design costs. In our attempt to reduce the overheads of fault-tolerance, only a small fraction of the chip is built to be fault-free. This fault-free part is then employed to manage the remaining fault-prone resources of the SoC. The DeSyRe framework is applied to two medical systems with high safety requirements (measured using the IEC 61508 functional safety standard) and tight power and performance constraints

    Fault tolerance issues in nanoelectronics

    Get PDF
    The astonishing success story of microelectronics cannot go on indefinitely. In fact, once devices reach the few-atom scale (nanoelectronics), transient quantum effects are expected to impair their behaviour. Fault tolerant techniques will then be required. The aim of this thesis is to investigate the problem of transient errors in nanoelectronic devices. Transient error rates for a selection of nanoelectronic gates, based upon quantum cellular automata and single electron devices, in which the electrostatic interaction between electrons is used to create Boolean circuits, are estimated. On the bases of such results, various fault tolerant solutions are proposed, for both logic and memory nanochips. As for logic chips, traditional techniques are found to be unsuitable. A new technique, in which the voting approach of triple modular redundancy (TMR) is extended by cascading TMR units composed of nanogate clusters, is proposed and generalised to other voting approaches. For memory chips, an error correcting code approach is found to be suitable. Various codes are considered and a lookup table approach is proposed for encoding and decoding. We are then able to give estimations for the redundancy level to be provided on nanochips, so as to make their mean time between failures acceptable. It is found that, for logic chips, space redundancies up to a few tens are required, if mean times between failures have to be of the order of a few years. Space redundancy can also be traded for time redundancy. As for memory chips, mean times between failures of the order of a few years are found to imply both space and time redundancies of the order of ten

    System-on-chip Computing and Interconnection Architectures for Telecommunications and Signal Processing

    Get PDF
    This dissertation proposes novel architectures and design techniques targeting SoC building blocks for telecommunications and signal processing applications. Hardware implementation of Low-Density Parity-Check decoders is approached at both the algorithmic and the architecture level. Low-Density Parity-Check codes are a promising coding scheme for future communication standards due to their outstanding error correction performance. This work proposes a methodology for analyzing effects of finite precision arithmetic on error correction performance and hardware complexity. The methodology is throughout employed for co-designing the decoder. First, a low-complexity check node based on the P-output decoding principle is designed and characterized on a CMOS standard-cells library. Results demonstrate implementation loss below 0.2 dB down to BER of 10^{-8} and a saving in complexity up to 59% with respect to other works in recent literature. High-throughput and low-latency issues are addressed with modified single-phase decoding schedules. A new "memory-aware" schedule is proposed requiring down to 20% of memory with respect to the traditional two-phase flooding decoding. Additionally, throughput is doubled and logic complexity reduced of 12%. These advantages are traded-off with error correction performance, thus making the solution attractive only for long codes, as those adopted in the DVB-S2 standard. The "layered decoding" principle is extended to those codes not specifically conceived for this technique. Proposed architectures exhibit complexity savings in the order of 40% for both area and power consumption figures, while implementation loss is smaller than 0.05 dB. Most modern communication standards employ Orthogonal Frequency Division Multiplexing as part of their physical layer. The core of OFDM is the Fast Fourier Transform and its inverse in charge of symbols (de)modulation. Requirements on throughput and energy efficiency call for FFT hardware implementation, while ubiquity of FFT suggests the design of parametric, re-configurable and re-usable IP hardware macrocells. In this context, this thesis describes an FFT/IFFT core compiler particularly suited for implementation of OFDM communication systems. The tool employs an accuracy-driven configuration engine which automatically profiles the internal arithmetic and generates a core with minimum operands bit-width and thus minimum circuit complexity. The engine performs a closed-loop optimization over three different internal arithmetic models (fixed-point, block floating-point and convergent block floating-point) using the numerical accuracy budget given by the user as a reference point. The flexibility and re-usability of the proposed macrocell are illustrated through several case studies which encompass all current state-of-the-art OFDM communications standards (WLAN, WMAN, xDSL, DVB-T/H, DAB and UWB). Implementations results are presented for two deep sub-micron standard-cells libraries (65 and 90 nm) and commercially available FPGA devices. Compared with other FFT core compilers, the proposed environment produces macrocells with lower circuit complexity and same system level performance (throughput, transform size and numerical accuracy). The final part of this dissertation focuses on the Network-on-Chip design paradigm whose goal is building scalable communication infrastructures connecting hundreds of core. A low-complexity link architecture for mesochronous on-chip communication is discussed. The link enables skew constraint looseness in the clock tree synthesis, frequency speed-up, power consumption reduction and faster back-end turnarounds. The proposed architecture reaches a maximum clock frequency of 1 GHz on 65 nm low-leakage CMOS standard-cells library. In a complex test case with a full-blown NoC infrastructure, the link overhead is only 3% of chip area and 0.5% of leakage power consumption. Finally, a new methodology, named metacoding, is proposed. Metacoding generates correct-by-construction technology independent RTL codebases for NoC building blocks. The RTL coding phase is abstracted and modeled with an Object Oriented framework, integrated within a commercial tool for IP packaging (Synopsys CoreTools suite). Compared with traditional coding styles based on pre-processor directives, metacoding produces 65% smaller codebases and reduces the configurations to verify up to three orders of magnitude

    Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)

    Get PDF
    Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016) Timisoara, Romania. February 8-11, 2016.The PhD Symposium was a very good opportunity for the young researchers to share information and knowledge, to present their current research, and to discuss topics with other students in order to look for synergies and common research topics. The idea was very successful and the assessment made by the PhD Student was very good. It also helped to achieve one of the major goals of the NESUS Action: to establish an open European research network targeting sustainable solutions for ultrascale computing aiming at cross fertilization among HPC, large scale distributed systems, and big data management, training, contributing to glue disparate researchers working across different areas and provide a meeting ground for researchers in these separate areas to exchange ideas, to identify synergies, and to pursue common activities in research topics such as sustainable software solutions (applications and system software stack), data management, energy efficiency, and resilience.European Cooperation in Science and Technology. COS

    KAVUAKA: a low-power application-specific processor architecture for digital hearing aids

    Get PDF
    The power consumption of digital hearing aids is very restricted due to their small physical size and the available hardware resources for signal processing are limited. However, there is a demand for more processing performance to make future hearing aids more useful and smarter. Future hearing aids should be able to detect, localize, and recognize target speakers in complex acoustic environments to further improve the speech intelligibility of the individual hearing aid user. Computationally intensive algorithms are required for this task. To maintain acceptable battery life, the hearing aid processing architecture must be highly optimized for extremely low-power consumption and high processing performance.The integration of application-specific instruction-set processors (ASIPs) into hearing aids enables a wide range of architectural customizations to meet the stringent power consumption and performance requirements. In this thesis, the application-specific hearing aid processor KAVUAKA is presented, which is customized and optimized with state-of-the-art hearing aid algorithms such as speaker localization, noise reduction, beamforming algorithms, and speech recognition. Specialized and application-specific instructions are designed and added to the baseline instruction set architecture (ISA). Among the major contributions are a multiply-accumulate (MAC) unit for real- and complex-valued numbers, architectures for power reduction during register accesses, co-processors and a low-latency audio interface. With the proposed MAC architecture, the KAVUAKA processor requires 16 % less cycles for the computation of a 128-point fast Fourier transform (FFT) compared to related programmable digital signal processors. The power consumption during register file accesses is decreased by 6 %to 17 % with isolation and by-pass techniques. The hardware-induced audio latency is 34 %lower compared to related audio interfaces for frame size of 64 samples.The final hearing aid system-on-chip (SoC) with four KAVUAKA processor cores and ten co-processors is integrated as an application-specific integrated circuit (ASIC) using a 40 nm low-power technology. The die size is 3.6 mm2. Each of the processors and co-processors contains individual customizations and hardware features with a varying datapath width between 24-bit to 64-bit. The core area of the 64-bit processor configuration is 0.134 mm2. The processors are organized in two clusters that share memory, an audio interface, co-processors and serial interfaces. The average power consumption at a clock speed of 10 MHz is 2.4 mW for SoC and 0.6 mW for the 64-bit processor.Case studies with four reference hearing aid algorithms are used to present and evaluate the proposed hardware architectures and optimizations. The program code for each processor and co-processor is generated and optimized with evolutionary algorithms for operation merging,instruction scheduling and register allocation. The KAVUAKA processor architecture is com-pared to related processor architectures in terms of processing performance, average power consumption, and silicon area requirements
    corecore