Search CORE

101 research outputs found

Performance optimizations for compiler-based error detection

Author: Mitropoulou Konstantina
Publication venue: The University of Edinburgh
Publication date: 29/06/2015
Field of study

The trend towards smaller transistor technologies and lower operating voltages stresses the hardware and makes transistors more susceptible to transient errors. In future systems, performance and power gains will come at the cost of unreliable areas on the chip. For this reason, there is an increased need for low-overhead highly-reliable error detection methodologies. In the last years, several techniques have been proposed. The majority of them are based on redundancy which can be implemented at several levels (e.g., hardware, instruction, thread, process, etc). In instruction-level error detection approaches, the compiler replicates the instructions of the program and inserts checks wherever they are needed. The checks evaluate code correctness and decide whether or not an error has occurred. This type of error detection is more flexible than the hardware alternatives. It allows the programmer to choose the protected area of the program and it can be applied without any hardware modifications. On the other hand, the replicated instructions and the checks cause a large slowdown making software techniques less appealing. In this thesis, we propose two techniques that aim at reducing the error detection overhead of compiler-based approaches and improving system’s performance without sacrificing the fault-coverage. The first technique, DRIFT, achieves this by decoupling the execution of the code (original and replicated) from the checks. The checks are compare and jump instructions. The latter ones tend to make the code sequential and prohibit the compiler from performing aggressive instruction scheduling optimizations. We call this phenomenon basic-block fragmentation. DRIFT reduces the impact of basic-block fragmentation by breaking the synchronized execute-check-confirm-execute cycle. In this way, DRIFT generates a scheduler-friendly code with more instruction-level parallelism (ILP). As a result, it reduces the performance overhead down to 1.29× (on average) and outperforms the state-of-the-art by up to 29.7% retaining the same fault-coverage. Next, CASTED focuses on reducing the impact of error detection overhead on single-chip scalable architectures that are composed of tightly-coupled cores. The proposed compiler methodology adaptively distributes the error detection overhead to the available resources across multiple cores, fully exploiting the abundant ILP of these architectures. CASTED adapts to a wide range of architecture configurations (issue-width, inter-core communication). The results show that CASTED matches the performance of, and often outperforms, sometimes by as mush as 21.2%, the best fixed state-of-the-art approach while maintaining the same fault coverage

CiteSeerX

Edinburgh Research Archive

Efficient memory-level parallelism extraction with decoupled strands

Author: Crago Neal
Publication venue
Publication date: 01/05/2011
Field of study

We present Outrider, an architecture for throughput-oriented processors that exploits intra-thread memory-level parallelism (MLP) to improve performance efficiency on highly threaded workloads. Outrider enables a single thread of execution to be presented to the architecture as multiple decoupled instruction streams, consisting of either memory accessing or memory consuming instructions. The key insight is that by decoupling the instruction streams, the processor pipeline can expose MLP in a way similar to out-of-order designs while relying on a low-complexity in-order micro-architecture. Instead of adding more threads as is done in modern GPUs, Outrider can expose the same MLP with fewer threads and reduced contention for resources shared among threads. We demonstrate that Outrider can outperform single-threaded cores by 23-131% and a 4-way simultaneous multi-threaded core by up to 87% in data parallel applications in a 1024-core system. Outrider achieves these performance gains without incurring the overhead of additional hardware thread contexts, which results in improved efficiency compared to a multi-threaded core

Illinois Digital Environment for Access to Learning and Scholarship Repository

Code Generation and Global Optimization Techniques for a Reconfigurable PRAM-NUMA Multicore Architecture

Author
Publication venue: 'Linkoping University Electronic Press'
Publication date
Field of study

Crossref

내장형 프로세서에서의 코드 크기 최적화를 위한 아키텍처 설계 및 컴파일러 지원

Author: 이종원
Publication venue: 서울대학교 대학원
Publication date: 01/02/2014
Field of study

학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2014. 2. 백윤흥.Embedded processors usually need to satisfy very tight design constraints to achieve low power consumption, small chip area, and high performance. One of the obstacles to meeting these requirements is related to delivering instructions from instruction memory/caches. The size of instruction memory/cache considerably contributes total chip area. Further, frequent access to caches incurs high power/energy consumption and significantly hampers overall system performance due to cache misses. To reduce the negative effects of the instruction delivery, therefore, this study focuses on the sizing of instruction memory/cache through code size optimization. One observation for code size optimization is that very long instruction word (VLIW) architectures often consume more power and memory space than necessary due to long instruction bit-width. One way to lessen this problem is to adopt a reduced bit-width ISA (Instruction Set Architecture) that has a narrower instruction word length. In practice, however, it is impossible to convert a given ISA fully into an equivalent reduced bit-width one because the narrow instruction word, due to bitwidth restrictions, can encode only a small subset of normal instructions in the original ISA. To explore the possibility of complete conversion of an existing 32-bit ISA into a 16-bit one that supports effectively all 32-bit instructions, we propose the reduced bit-width (e.g. 16-bit × 4-way) VLIW architectures that equivalently behave as their original bit-width (e.g. 32-bit × 4-way) architectures with the help of dynamic implied addressing mode (DIAM). Second, we observe that code duplication techniques have been proposed to increase the reliability against soft errors in multi-issue embedded systems such as VLIW by exploiting empty slots for duplicated instructions. Unfortunately, all duplicated instructions cannot be allocated to empty slots, which enforces generating additional VLIW packets to include the duplicated instructions. The increase of code size due to the extra VLIW packets is necessarily accompanied with the enhanced reliability. In order to minimize code size, we propose a novel approach compiler-assisted dynamic code duplication scheme, which accepts an assembly code composed of only original instructions as input, and generates duplicated instructions at runtime with the help of encoded information attached to original instructions. Since the duplicates of original instructions are not explicitly present in the assembly code, the increase of code size due to the duplicated instructions can be avoided in the proposed scheme. Lastly, the third observation is that, to cope with soft errors similarly to the second observation, a recently proposed software-based technique with TMR (Triple Modular Redundancy) implemented on coarse-grained reconfigurable architectures (CGRA) incurs the increase of configuration size, which is corresponding to the code size of CGRA, and thus extreme overheads in terms of runtime and energy consumption mainly due to expensive voting mechanisms for the outputs from the triplication of every operation. To reduce the expensive performance overhead due to the large configuration from the validation mechanism, we propose selective validation mechanisms for efficient modular redundancy techniques in the datapath on CGRA. The proposed techniques selectively validate the results at synchronous operations rather than every operation.Abstract i Chapter 1 Introduction 1 1.1 Instruction Delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 The causes of code size increase . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Instruction Bit-width in VLIW Architectures . . . . . . . . . 2 1.2.2 Instruction Redundancy . . . . . . . . . . . . . . . . . . . . 3 Chapter 2 Reducing Instruction Bit-width with Dynamic Implied Addressing Mode (DIAM) 7 2.1 Conceptual View . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Architecture Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 ISA Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 Remote Operand Array Buffer . . . . . . . . . . . . . . . . . 15 2.2.3 Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Compiler Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.1 16-bit Instruction Generation . . . . . . . . . . . . . . . . . . 24 2.3.2 DDG Construction & Scheduling . . . . . . . . . . . . . . . 26 2.4 VLES(Variable Length Execution Set) Architecture with a Reduced Bit-width Instruction Set . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.1 Architecture Design . . . . . . . . . . . . . . . . . . . . . . 30 2.4.2 Compiler Support . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.5.3 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . 48 2.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Chapter 3 Compiler-assisted Dynamic Code Duplication Scheme for Soft Error Resilient VLIW Architectures 53 3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.2 Compiler-assisted Dynamic Code Duplication . . . . . . . . . . . . . 58 3.2.1 ISA Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.2.2 Modified Fetch Stage . . . . . . . . . . . . . . . . . . . . . . 62 3.3 Compilation Techniques . . . . . . . . . . . . . . . . . . . . . . . . 66 3.3.1 Static Code Duplication Algorithm . . . . . . . . . . . . . . 67 3.3.2 Vulnerability-aware Duplication Algorithm . . . . . . . . . . 68 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 71 3.4.2 Effectiveness of Compiler-assisted Dynamic Code Duplication 73 3.4.3 Effectiveness of Vulnerability-aware Duplication Algorithm . 77 Chapter 4 Selective Validation Techniques for Robust CGRAs against Soft Errors 85 4.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.3.1 Selective Validation Mechanism . . . . . . . . . . . . . . . . 91 4.3.2 Compilation Flow and Performance Analysis . . . . . . . . . 92 4.3.3 Fault Coverage Analysis . . . . . . . . . . . . . . . . . . . . 96 4.3.4 Our Optimization - Minimizing Store Operation . . . . . . . . 97 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 100 Chapter 5 Conculsion 110 초록 122Docto

SNU Open Repository and Archive

KAVUAKA: a low-power application-specific processor architecture for digital hearing aids

Author: Gerlach Lukas
Publication venue: Hannover : Institutionelles Repositorium der Leibniz Universität Hannover
Publication date: 01/01/2021
Field of study

The power consumption of digital hearing aids is very restricted due to their small physical size and the available hardware resources for signal processing are limited. However, there is a demand for more processing performance to make future hearing aids more useful and smarter. Future hearing aids should be able to detect, localize, and recognize target speakers in complex acoustic environments to further improve the speech intelligibility of the individual hearing aid user. Computationally intensive algorithms are required for this task. To maintain acceptable battery life, the hearing aid processing architecture must be highly optimized for extremely low-power consumption and high processing performance.The integration of application-specific instruction-set processors (ASIPs) into hearing aids enables a wide range of architectural customizations to meet the stringent power consumption and performance requirements. In this thesis, the application-specific hearing aid processor KAVUAKA is presented, which is customized and optimized with state-of-the-art hearing aid algorithms such as speaker localization, noise reduction, beamforming algorithms, and speech recognition. Specialized and application-specific instructions are designed and added to the baseline instruction set architecture (ISA). Among the major contributions are a multiply-accumulate (MAC) unit for real- and complex-valued numbers, architectures for power reduction during register accesses, co-processors and a low-latency audio interface. With the proposed MAC architecture, the KAVUAKA processor requires 16 % less cycles for the computation of a 128-point fast Fourier transform (FFT) compared to related programmable digital signal processors. The power consumption during register file accesses is decreased by 6 %to 17 % with isolation and by-pass techniques. The hardware-induced audio latency is 34 %lower compared to related audio interfaces for frame size of 64 samples.The final hearing aid system-on-chip (SoC) with four KAVUAKA processor cores and ten co-processors is integrated as an application-specific integrated circuit (ASIC) using a 40 nm low-power technology. The die size is 3.6 mm2. Each of the processors and co-processors contains individual customizations and hardware features with a varying datapath width between 24-bit to 64-bit. The core area of the 64-bit processor configuration is 0.134 mm2. The processors are organized in two clusters that share memory, an audio interface, co-processors and serial interfaces. The average power consumption at a clock speed of 10 MHz is 2.4 mW for SoC and 0.6 mW for the 64-bit processor.Case studies with four reference hearing aid algorithms are used to present and evaluate the proposed hardware architectures and optimizations. The program code for each processor and co-processor is generated and optimized with evolutionary algorithms for operation merging,instruction scheduling and register allocation. The KAVUAKA processor architecture is com-pared to related processor architectures in terms of processing performance, average power consumption, and silicon area requirements

Institutionelles Repositorium der Leibniz Universität Hannover

10281 Abstracts Collection -- Dynamically Reconfigurable Architectures

Author: Athanas Peter M.
Verbauwhede Ingrid
Publication venue: Dagstuhl Seminar Proceedings. 10281 - Dynamically Reconfigurable Architectures
Publication date: 01/01/2010
Field of study

From 11.07.10 to 16.07.10, Dagstuhl Seminar 10281 ``Dynamically Reconfigurable Architectures \u27\u27 was held in Schloss Dagstuhl~--~Leibniz Center for Informatics. During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts of the presentations given during the seminar as well as abstracts of seminar results and ideas are put together in this paper. The first section describes the seminar topics and goals in general. Links to extended abstracts or full papers are provided, if available

Dagstuhl Research Online Publication Server

The Chameleon Architecture for Streaming DSP Applications

Author: Burgwal Marcel D. van de
Heysters Paul M.
Hölzenspies Philip K.F.
Kokkeler André B.J.
Smit Gerard J.M.
Wolkotte Pascal T.
Publication venue: Hindawi Publishing Corporation
Publication date: 01/01/2007
Field of study

We focus on architectures for streaming DSP applications such as wireless baseband processing and image processing. We aim at a single generic architecture that is capable of dealing with different DSP applications. This architecture has to be energy efficient and fault tolerant. We introduce a heterogeneous tiled architecture and present the details of a domain-specific reconfigurable tile processor called Montium. This reconfigurable processor has a small footprint (1.8 mm

^2

in a 130 nm process), is power efficient and exploits the locality of reference principle. Reconfiguring the device is very fast, for example, loading the coefficients for a 200 tap FIR filter is done within 80 clock cycles. The tiles on the tiled architecture are connected to a Network-on-Chip (NoC) via a network interface (NI). Two NoCs have been developed: a packet-switched and a circuit-switched version. Both provide two types of services: guaranteed throughput (GT) and best effort (BE). For both NoCs estimates of power consumption are presented. The NI synchronizes data transfers, configures and starts/stops the tile processor. For dynamically mapping applications onto the tiled architecture, we introduce a run-time mapping tool

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

University of Twente Research Information

A hardware-software codesign framework for cellular computing

Author: Mudry Pierre-André
Publication venue: Lausanne, EPFL
Publication date: 29/01/2009
Field of study

Until recently, the ever-increasing demand of computing power has been met on one hand by increasing the operating frequency of processors and on the other hand by designing architectures capable of exploiting parallelism at the instruction level through hardware mechanisms such as super-scalar execution. However, both these approaches seem to have reached a plateau, mainly due to issues related to design complexity and cost-effectiveness. To face the stabilization of performance of single-threaded processors, the current trend in processor design seems to favor a switch to coarser-grain parallelization, typically at the thread level. In other words, high computational power is achieved not only by a single, very fast and very complex processor, but through the parallel operation of several processors, each executing a different thread. Extrapolating this trend to take into account the vast amount of on-chip hardware resources that will be available in the next few decades (either through further shrinkage of silicon fabrication processes or by the introduction of molecular-scale devices), together with the predicted features of such devices (e.g., the impossibility of global synchronization or higher failure rates), it seems reasonable to foretell that current design techniques will not be able to cope with the requirements of next-generation electronic devices and that novel design tools and programming methods will have to be devised. A tempting source of inspiration to solve the problems implied by a massively parallel organization and inherently error-prone substrates is biology. In fact, living beings possess characteristics, such as robustness to damage and self-organization, which were shown in previous research as interesting to be implemented in hardware. For instance, it was possible to realize relatively simple systems, such as a self-repairing watch. Overall, these bio-inspired approaches seem very promising but their interest for a wider audience is problematic because their heavily hardware-oriented designs lack some of the flexibility achievable with a general purpose processor. In the context of this thesis, we will introduce a processor-grade processing element at the heart of a bio-inspired hardware system. This processor, based on a single-instruction, features some key properties that allow it to maintain the versatility required by the implementation of bio-inspired mechanisms and to realize general computation. We will also demonstrate that the flexibility of such a processor enables it to be evolved so it can be tailored to different types of applications. In the second half of this thesis, we will analyze how the implementation of a large number of these processors can be used on a hardware platform to explore various bio-inspired mechanisms. Based on an extensible platform of many FPGAs, configured as a networked structure of processors, the hardware part of this computing framework is backed by an open library of software components that provides primitives for efficient inter-processor communication and distributed computation. We will show that this dual software–hardware approach allows a very quick exploration of different ways to solve computational problems using bio-inspired techniques. In addition, we also show that the flexibility of our approach allows it to exploit replication as a solution to issues that concern standard embedded applications

Infoscience - École polytechnique fédérale de Lausanne