1,267 research outputs found

    Fast, predictable and low energy memory references through architecture-aware compilation

    Get PDF
    The design of future high-performance embedded systems is hampered by two problems: First, the required hardware needs more energy than is available from batteries. Second, current cache-based approaches for bridging the increasing speed gap between processors and memories cannot guarantee predictable real-time behavior. A contribution to solving both problems is made in this paper which describes a comprehensive set of algorithms that can be applied at design time in order to maximally exploit scratch pad memories (SPMs). We show that both the energy consumption as well as the computed worst case execution time (WCET) can be reduced by up to to 80% and 48%, respectively, by establishing a strong link between the memory architecture and the compiler

    Design and Code Optimization for Systems with Next-generation Racetrack Memories

    Get PDF
    With the rise of computationally expensive application domains such as machine learning, genomics, and fluids simulation, the quest for performance and energy-efficient computing has gained unprecedented momentum. The significant increase in computing and memory devices in modern systems has resulted in an unsustainable surge in energy consumption, a substantial portion of which is attributed to the memory system. The scaling of conventional memory technologies and their suitability for the next-generation system is also questionable. This has led to the emergence and rise of nonvolatile memory ( NVM ) technologies. Today, in different development stages, several NVM technologies are competing for their rapid access to the market. Racetrack memory ( RTM ) is one such nonvolatile memory technology that promises SRAM -comparable latency, reduced energy consumption, and unprecedented density compared to other technologies. However, racetrack memory ( RTM ) is sequential in nature, i.e., data in an RTM cell needs to be shifted to an access port before it can be accessed. These shift operations incur performance and energy penalties. An ideal RTM , requiring at most one shift per access, can easily outperform SRAM . However, in the worst-cast shifting scenario, RTM can be an order of magnitude slower than SRAM . This thesis presents an overview of the RTM device physics, its evolution, strengths and challenges, and its application in the memory subsystem. We develop tools that allow the programmability and modeling of RTM -based systems. For shifts minimization, we propose a set of techniques including optimal, near-optimal, and evolutionary algorithms for efficient scalar and instruction placement in RTMs . For array accesses, we explore schedule and layout transformations that eliminate the longer overhead shifts in RTMs . We present an automatic compilation framework that analyzes static control flow programs and transforms the loop traversal order and memory layout to maximize accesses to consecutive RTM locations and minimize shifts. We develop a simulation framework called RTSim that models various RTM parameters and enables accurate architectural level simulation. Finally, to demonstrate the RTM potential in non-Von-Neumann in-memory computing paradigms, we exploit its device attributes to implement logic and arithmetic operations. As a concrete use-case, we implement an entire hyperdimensional computing framework in RTM to accelerate the language recognition problem. Our evaluation shows considerable performance and energy improvements compared to conventional Von-Neumann models and state-of-the-art accelerators

    Compiler-directed energy reduction using dynamic voltage scaling and voltage Islands for embedded systems

    Get PDF
    Cataloged from PDF version of article.Addressing power and energy consumption related issues early in the system design flow ensures good design and minimizes iterations for faster turnaround time. In particular, optimizations at software level, e.g., those supported by compilers, are very important for minimizing energy consumption of embedded applications. Recent research demonstrates that voltage islands provide the flexibility to reduce power by selectively shutting down the different regions of the chip and/or running the select parts of the chip at different voltage/frequency levels. As against most of the prior work on voltage islands that mainly focused on the architecture design and IP placement related issues, this paper studies the necessary software compiler support for voltage islands. Specifically, we focus on an embedded multiprocessor architecture that supports both voltage islands and control domains within these islands, and determine how an optimizing compiler can automatically map an embedded application onto this architecture. Such an automated support is critical since it is unrealistic to expect an application programmer to reach a good mapping correlating multiple factors such as performance and energy at the same time. Our experiments with the proposed compiler support show that our approach is very effective in reducing energy consumption. The experiments also show that the energy savings we achieve are consistent across a wide range of values of our major simulation parameters

    Optimizing the flash-RAM energy trade-off in deeply embedded systems

    Full text link
    Deeply embedded systems often have the tightest constraints on energy consumption, requiring that they consume tiny amounts of current and run on batteries for years. However, they typically execute code directly from flash, instead of the more energy efficient RAM. We implement a novel compiler optimization that exploits the relative efficiency of RAM by statically moving carefully selected basic blocks from flash to RAM. Our technique uses integer linear programming, with an energy cost model to select a good set of basic blocks to place into RAM, without impacting stack or data storage. We evaluate our optimization on a common ARM microcontroller and succeed in reducing the average power consumption by up to 41% and reducing energy consumption by up to 22%, while increasing execution time. A case study is presented, where an application executes code then sleeps for a period of time. For this example we show that our optimization could allow the application to run on battery for up to 32% longer. We also show that for this scenario the total application energy can be reduced, even if the optimization increases the execution time of the code

    ILP-based energy minimization techniques for banked memories

    Get PDF
    Main memories can consume a significant portion of overall energy in many data-intensive embedded applications. One way of reducing this energy consumption is banking, that is, dividing available memory space into multiple banks and placing unused (idle) memory banks into low-power operating modes. Prior work investigated code-restructuring- and data-layout-reorganization-based approaches for increasing the energy benefits that could be obtained from a banked memory architecture. This article explores different techniques that can potentially coexist within the same optimization framework for maximizing benefits of low-power operating modes. These techniques include employing nonuniform bank sizes, data migration, data compression, and data replication. By using these techniques, we try to increase the chances for utilizing low-power operating modes in a more effective manner, and achieve further energy savings over what could be achieved by exploiting low-power modes alone. Specifically, nonuniform banking tries to match bank sizes with application-data access patterns. The goal of data migration is to cluster data with similar access patterns in the same set of banks. Data compression reduces the size of the data used by an application, and thus helps reduce the number of memory banks occupied by data. Finally, data replication increases bank idleness by duplicating select read-only data blocks across banks. We formulate each of these techniques as an ILP (integer linear programming) problem, and solve them using a commercial solver. Our experimental analysis using several benchmarks indicates that all the techniques presented in this framework are successful in reducing memory energy consumption. Based on our experience with these techniques, we recommend to compiler writers for banked memories to consider data compression, replication, and migration. © 2008 ACM

    A Construction Kit for Efficient Low Power Neural Network Accelerator Designs

    Get PDF
    Implementing embedded neural network processing at the edge requires efficient hardware acceleration that couples high computational performance with low power consumption. Driven by the rapid evolution of network architectures and their algorithmic features, accelerator designs are constantly updated and improved. To evaluate and compare hardware design choices, designers can refer to a myriad of accelerator implementations in the literature. Surveys provide an overview of these works but are often limited to system-level and benchmark-specific performance metrics, making it difficult to quantitatively compare the individual effect of each utilized optimization technique. This complicates the evaluation of optimizations for new accelerator designs, slowing-down the research progress. This work provides a survey of neural network accelerator optimization approaches that have been used in recent works and reports their individual effects on edge processing performance. It presents the list of optimizations and their quantitative effects as a construction kit, allowing to assess the design choices for each building block separately. Reported optimizations range from up to 10'000x memory savings to 33x energy reductions, providing chip designers an overview of design choices for implementing efficient low power neural network accelerators

    WCET-Aware Scratchpad Memory Management for Hard Real-Time Systems

    Get PDF
    abstract: Cyber-physical systems and hard real-time systems have strict timing constraints that specify deadlines until which tasks must finish their execution. Missing a deadline can cause unexpected outcome or endanger human lives in safety-critical applications, such as automotive or aeronautical systems. It is, therefore, of utmost importance to obtain and optimize a safe upper bound of each task’s execution time or the worst-case execution time (WCET), to guarantee the absence of any missed deadline. Unfortunately, conventional microarchitectural components, such as caches and branch predictors, are only optimized for average-case performance and often make WCET analysis complicated and pessimistic. Caches especially have a large impact on the worst-case performance due to expensive off- chip memory accesses involved in cache miss handling. In this regard, software-controlled scratchpad memories (SPMs) have become a promising alternative to caches. An SPM is a raw SRAM, controlled only by executing data movement instructions explicitly at runtime, and such explicit control facilitates static analyses to obtain safe and tight upper bounds of WCETs. SPM management techniques, used in compilers targeting an SPM-based processor, determine how to use a given SPM space by deciding where to insert data movement instructions and what operations to perform at those program locations. This dissertation presents several management techniques for program code and stack data, which aim to optimize the WCETs of a given program. The proposed code management techniques include optimal allocation algorithms and a polynomial-time heuristic for allocating functions to the SPM space, with or without the use of abstraction of SPM regions, and a heuristic for splitting functions into smaller partitions. The proposed stack data management technique, on the other hand, finds an optimal set of program locations to evict and restore stack frames to avoid stack overflows, when the call stack resides in a size-limited SPM. In the evaluation, the WCETs of various benchmarks including real-world automotive applications are statically calculated for SPMs and caches in several different memory configurations.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Heap Data Allocation to Scratch-Pad Memory in Embedded Systems

    Get PDF
    This thesis presents the first-ever compile-time method for allocating a portion of a program's dynamic data to scratch-pad memory. A scratch-pad is a fast directly addressed compiler-managed SRAM memory that replaces the hardware-managed cache. It is motivated by its better real-time guarantees vs cache and by its significantly lower overheads in access time, energy consumption, area and overall runtime. Dynamic data refers to all objects allocated at run-time in a program, as opposed to static data objects which are allocated at compile-time. Existing compiler methods for allocating data to scratch-pad are able to place only code, global and stack data (static data) in scratch-pad memory; heap and recursive-function objects(dynamic data) are allocated entirely in DRAM, resulting in poor performance for these dynamic data types. Runtime methods based on software caching can place data in scratch-pad, but because of their high overheads from software address translation, they have not been successful, especially for dynamic data. In this thesis we present a dynamic yet compiler-directed allocation method for dynamic data that for the first time, (i) is able to place a portion of the dynamic data in scratch-pad; (ii) has no software-caching tags; (iii) requires no run-time per-access extra address translation; and (iv) is able to move dynamic data back and forth between scratch-pad and DRAM to better track the program's locality characteristics. With our method, code, global, stack and heap variables can share the same scratch-pad. When compared to placing all dynamic data variables in DRAM and only static data in scratch-pad, our results show that our method reduces the average runtime of our benchmarks by 22.3%, and the average power consumption by 26.7%, for the same size of scratch-pad fixed at 5% of total data size. Significant savings in runtime and energy across a large number of benchmarks were also observed when compared against cache memory organizations, showing our method's success under constrained SRAM sizes when dealing with dynamic data. Lastly, our method is able to minimize the profile dependence issues which plague all similar allocation methods through careful analysis of static and dynamic profile information

    Software caching techniques and hardware optimizations for on-chip local memories

    Get PDF
    Despite the fact that the most viable L1 memories in processors are caches, on-chip local memories have been a great topic of consideration lately. Local memories are an interesting design option due to their many benefits: less area occupancy, reduced energy consumption and fast and constant access time. These benefits are especially interesting for the design of modern multicore processors since power and latency are important assets in computer architecture today. Also, local memories do not generate coherency traffic which is important for the scalability of the multicore systems. Unfortunately, local memories have not been well accepted in modern processors yet, mainly due to their poor programmability. Systems with on-chip local memories do not have hardware support for transparent data transfers between local and global memories, and thus ease of programming is one of the main impediments for the broad acceptance of those systems. This thesis addresses software and hardware optimizations regarding the programmability, and the usage of the on-chip local memories in the context of both single-core and multicore systems. Software optimizations are related to the software caching techniques. Software cache is a robust approach to provide the user with a transparent view of the memory architecture; but this software approach can suffer from poor performance. In this thesis, we start optimizing traditional software cache by proposing a hierarchical, hybrid software-cache architecture. Afterwards, we develop few optimizations in order to speedup our hybrid software cache as much as possible. As the result of the software optimizations we obtain that our hybrid software cache performs from 4 to 10 times faster than traditional software cache on a set of NAS parallel benchmarks. We do not stop with software caching. We cover some other aspects of the architectures with on-chip local memories, such as the quality of the generated code and its correspondence with the quality of the buffer management in local memories, in order to improve performance of these architectures. Therefore, we run our research till we reach the limit in software and start proposing optimizations on the hardware level. Two hardware proposals are presented in this thesis. One is about relaxing alignment constraints imposed in the architectures with on-chip local memories and the other proposal is about accelerating the management of local memories by providing hardware support for the majority of actions performed in our software cache.Malgrat les memòries cau encara son el component basic pel disseny del subsistema de memòria, les memòries locals han esdevingut una alternativa degut a les seves característiques pel que fa a l’ocupació d’àrea, el seu consum energètic i el seu rendiment amb un temps d’accés ràpid i constant. Aquestes característiques son d’especial interès quan les properes arquitectures multi-nucli estan limitades pel consum de potencia i la latència del subsistema de memòria.Les memòries locals pateixen de limitacions respecte la complexitat en la seva programació, fet que dificulta la seva introducció en arquitectures multi-nucli, tot i els avantatges esmentats anteriorment. Aquesta tesi presenta un seguit de solucions basades en programari i maquinari específicament dissenyat per resoldre aquestes limitacions.Les optimitzacions del programari estan basades amb tècniques d'emmagatzematge de memòria cau suportades per llibreries especifiques. La memòria cau per programari és un sòlid mètode per proporcionar a l'usuari una visió transparent de l'arquitectura, però aquest enfocament pot patir d'un rendiment deficient. En aquesta tesi, es proposa una estructura jeràrquica i híbrida. Posteriorment, desenvolupem optimitzacions per tal d'accelerar l’execució del programari que suporta el disseny de la memòria cau. Com a resultat de les optimitzacions realitzades, obtenim que el nostre disseny híbrid es comporta de 4 a 10 vegades més ràpid que una implementació tradicional de memòria cau sobre un conjunt d’aplicacions de referencia, com son els “NAS parallel benchmarks”.El treball de tesi inclou altres aspectes de les arquitectures amb memòries locals, com ara la qualitat del codi generat i la seva correspondència amb la qualitat de la gestió de memòria intermèdia en les memòries locals, per tal de millorar el rendiment d'aquestes arquitectures. La tesi desenvolupa propostes basades estrictament en el disseny de nou maquinari per tal de millorar el rendiment de les memòries locals quan ja no es possible realitzar mes optimitzacions en el programari. En particular, la tesi presenta dues propostes de maquinari: una relaxa les restriccions imposades per les memòries locals respecte l’alineament de dades, l’altra introdueix maquinari específic per accelerar les operacions mes usuals sobre les memòries locals
    corecore