958 research outputs found

    Low Power Processor Architectures and Contemporary Techniques for Power Optimization – A Review

    Get PDF
    The technological evolution has increased the number of transistors for a given die area significantly and increased the switching speed from few MHz to GHz range. Such inversely proportional decline in size and boost in performance consequently demands shrinking of supply voltage and effective power dissipation in chips with millions of transistors. This has triggered substantial amount of research in power reduction techniques into almost every aspect of the chip and particularly the processor cores contained in the chip. This paper presents an overview of techniques for achieving the power efficiency mainly at the processor core level but also visits related domains such as buses and memories. There are various processor parameters and features such as supply voltage, clock frequency, cache and pipelining which can be optimized to reduce the power consumption of the processor. This paper discusses various ways in which these parameters can be optimized. Also, emerging power efficient processor architectures are overviewed and research activities are discussed which should help reader identify how these factors in a processor contribute to power consumption. Some of these concepts have been already established whereas others are still active research areas. © 2009 ACADEMY PUBLISHER

    The Design of a System Architecture for Mobile Multimedia Computers

    Get PDF
    This chapter discusses the system architecture of a portable computer, called Mobile Digital Companion, which provides support for handling multimedia applications energy efficiently. Because battery life is limited and battery weight is an important factor for the size and the weight of the Mobile Digital Companion, energy management plays a crucial role in the architecture. As the Companion must remain usable in a variety of environments, it has to be flexible and adaptable to various operating conditions. The Mobile Digital Companion has an unconventional architecture that saves energy by using system decomposition at different levels of the architecture and exploits locality of reference with dedicated, optimised modules. The approach is based on dedicated functionality and the extensive use of energy reduction techniques at all levels of system design. The system has an architecture with a general-purpose processor accompanied by a set of heterogeneous autonomous programmable modules, each providing an energy efficient implementation of dedicated tasks. A reconfigurable internal communication network switch exploits locality of reference and eliminates wasteful data copies

    Resource management and application customization for hardware accelerated systems

    Get PDF
    Computational demands are continuously increasing, driven by the growing resource demands of applications. At the era of big-data, big-scale applications, and real-time applications, there is an enormous need for quick processing of big amounts of data. To meet these demands, computer systems have shifted towards multi-core solutions. Technology scaling has allowed the incorporation of even larger numbers of transistors and cores into chips. Nevertheless, area constrains, power consumption limitations, and thermal dissipation limit the ability to design and sustain ever increasing chips. To overpassthese limitations, system designers have turned towards the usage of hardware accelerators. These accelerators can take the form of modules attached to each core of a multi-core system, forming a network on chip of cores with attached accelerators. Another option of hardware accelerators are Graphics Processing Units (GPUs). GPUs can be connected through a host-device model with a general purpose system, and are used to off-load parts of a workload to them. Additionally, accelerators can be functionality dedicated units. They can be part of a chip and the main processor can offload specific workloads to the hardware accelerator unit.In this dissertation we present: (a) a microcoded synchronization mechanism for systems with hardware accelerators that provide distributed shared memory, (b) a Streaming Multiprocessor (SM) allocation policy for single application execution on GPUs, (c) an SM allocation policy for concurrent applications that execute on GPUs, and (d) a framework to map neural network (NN) weights to approximate multiplier accuracy levels. Theaforementioned mechanisms coexist in the resource management domain. Specifically, the methodologies introduce ways to boost system performance by using hardware accelerators. In tandem with improved performance, the methodologies explore and balance trade-offs that the use of hardware accelerators introduce

    Energy-Efficient Neural Network Architectures

    Full text link
    Emerging systems for artificial intelligence (AI) are expected to rely on deep neural networks (DNNs) to achieve high accuracy for a broad variety of applications, including computer vision, robotics, and speech recognition. Due to the rapid growth of network size and depth, however, DNNs typically result in high computational costs and introduce considerable power and performance overheads. Dedicated chip architectures that implement DNNs with high energy efficiency are essential for adding intelligence to interactive edge devices, enabling them to complete increasingly sophisticated tasks by extending battery lie. They are also vital for improving performance in cloud servers that support demanding AI computations. This dissertation focuses on architectures and circuit technologies for designing energy-efficient neural network accelerators. First, a deep-learning processor is presented for achieving ultra-low power operation. Using a heterogeneous architecture that includes a low-power always-on front-end and a selectively-enabled high-performance back-end, the processor dynamically adjusts computational resources at runtime to support conditional execution in neural networks and meet performance targets with increased energy efficiency. Featuring a reconfigurable datapath and a memory architecture optimized for energy efficiency, the processor supports multilevel dynamic activation of neural network segments, performing object detection tasks with 5.3x lower energy consumption in comparison with a static execution baseline. Fabricated in 40nm CMOS, the processor test-chip dissipates 0.23mW at 5.3 fps. It demonstrates energy scalability up to 28.6 TOPS/W and can be configured to run a variety of workloads, including severely power-constrained ones such as always-on monitoring in mobile applications. To further improve the energy efficiency of the proposed heterogeneous architecture, a new charge-recovery logic family, called zero-short-circuit current (ZSCC) logic, is proposed to decrease the power consumption of the always-on front-end. By relying on dedicated circuit topologies and a four-phase clocking scheme, ZSCC operates with significantly reduced short-circuit currents, realizing order-of-magnitude power savings at relatively low clock frequencies (in the order of a few MHz). The efficiency and applicability of ZSCC is demonstrated through an ANSI S1.11 1/3 octave filter bank chip for binaural hearing aids with two microphones per ear. Fabricated in a 65nm CMOS process, this charge-recovery chip consumes 13.8µW with a 1.75MHz clock frequency, achieving 9.7x power reduction per input in comparison with a 40nm monophonic single-input chip that represents the published state of the art. The ability of ZSCC to further increase the energy efficiency of the heterogeneous neural network architecture is demonstrated through the design and evaluation of a ZSCC-based front-end. Simulation results show 17x power reduction compared with a conventional static CMOS implementation of the same architecture.PHDElectrical and Computer EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/147614/1/hsiwu_1.pd

    MORA - an architecture and programming model for a resource efficient coarse grained reconfigurable processor

    Get PDF
    This paper presents an architecture and implementation details for MORA, a novel coarse grained reconfigurable processor for accelerating media processing applications. The MORA architecture involves a 2-D array of several such processors, to deliver low cost, high throughput performance in media processing applications. A distinguishing feature of the MORA architecture is the co-design of hardware architecture and low-level programming language throughout the design cycle. The implementation details for the single MORA processor, and benchmark evaluation using a cycle accurate simulator are presented

    Cross-Layer Automated Hardware Design for Accuracy-Configurable Approximate Computing

    Get PDF
    Approximate Computing trades off computation accuracy against performance or energy efficiency. It is a design paradigm that arose in the last decade as an answer to diminishing returns from Dennard\u27s scaling and a shift in the prominent workloads. A range of modern workloads, categorized mainly as recognition, mining, and synthesis, features an inherent tolerance to approximations. Their characteristics, such as redundancies in their input data and robust-to-noise algorithms, allow them to produce outputs of acceptable quality, despite an approximation in some of their computations. Approximate Computing leverages the application tolerance by relaxing the exactness in computation towards primary design goals of increasing performance or improving energy efficiency. Existing techniques span across the abstraction layers of computer systems where cross-layer techniques are shown to offer a larger design space and yield higher savings. Currently, the majority of the existing work aims at meeting a single accuracy. The extent of approximation tolerance, however, significantly varies with a change in input characteristics and applications. In this dissertation, methods and implementations are presented for cross-layer and automated design of accuracy-configurable Approximate Computing to maximally exploit the performance and energy benefits. In particular, this dissertation addresses the following challenges and introduces novel contributions: A main Approximate Computing category in hardware is to scale either voltage or frequency beyond the safe limits for power or performance benefits, respectively. The rationale is that timing errors would be gradual and for an initial range tolerable. This scaling enables a fine-grain accuracy-configurability by varying the timing error occurrence. However, conventional synthesis tools aim at meeting a single delay for all paths within the circuit. Subsequently, with voltage or frequency scaling, either all paths succeed, or a large number of paths fail simultaneously, with a steep increase in error rate and magnitude. This dissertation presents an automated method for minimizing path delays by individually constraining the primary outputs of combinational circuits. As a result, it reduces the number of failing paths and makes the timing errors significantly more gradual, and also rarer and smaller on average. Additionally, it reveals that delays can be significantly reduced towards the least significant bit (LSB) and allows operating at a higher frequency when small operands are computed. Precision scaling, i.e., reducing the representation of data and its accuracy is widely used in multiple abstraction layers in Approximate Computing. Reducing data precision also reduces the transistor toggles, and therefore the dynamic power consumption. Application and architecture level precision scaling results in using only LSBs of the circuit. Arithmetic circuits often have less complexity and logic depth in LSBs compared to most significant bits (MSB). To take advantage of this circuit property, a delay-altering synthesis methodology is proposed. The method finds energy-optimal delay values under configurable precision usage and assigns them to primary outputs used for different precisions. Thereby, it enables dynamic frequency-precision scalable circuits for energy efficiency. Within the hardware architecture, it is possible to instantiate multiple units with the same functionality with different fixed approximation levels, where each block benefits from having fewer transistors and also synthesis relaxations. These blocks can be selected dynamically and thus allow to configure the accuracy during runtime. Instantiating such approximate blocks can be a lower dynamic power but higher area and leakage cost alternative to the current state-of-the-art gating mechanisms which switch off a group of paths in the circuit to reduce the toggling activity. Jointly, instantiating multiple blocks and gating mechanisms produce a large design space of accuracy-configurable hardware, where energy-optimal solutions require a cross-layer search in architecture and circuit levels. To that end, an approximate hardware synthesis methodology is proposed with joint optimizations in architecture and circuit for dynamic accuracy scaling, and thereby it enables energy vs. area trade-offs
    • …
    corecore