309 research outputs found

    Improvement Energy Efficiency for a Hybrid Multibank Memory in Energy Critical Applications

    Get PDF
    High performance, low power multiprocessor/multibank memory system requires a compiler that provides efficient data partitioning and mapping procedures. This paper introduced two compiler techniques for the data mapping to multibank memory, since data mapping is still an open problem and needs a better solution. The multibank memory can be consisted of volatile and non-volatile memory components to support ultra-low powered wearable devices. This hybrid memory system including volatile and non-volatile memory components yields higher complexity to map data onto it. To efficiently solve this mapping problem, we formulate it to a simple decision problem. Based on the problem definition, we proposed two efficient algorithms to determine the placement of data to the multibank memory. The proposed techniques consider the characteristic of the non-volatile memory that its write operation consumes more energy than the same operation of a volatile memory even though it provides ultra-low operation power and nearly zero leakage current. The proposed technique solves this negative effect of non-volatile memory by using efficient data placement technique and hybrid memory architecture. In experimental section, the result shows that the proposed techniques improve energy saving up to 59.5% for the hybrid multibank memory architecture

    PYDAC: A DISTRIBUTED RUNTIME SYSTEM AND PROGRAMMING MODEL FOR A HETEROGENEOUS MANY-CORE ARCHITECTURE

    Get PDF
    Heterogeneous many-core architectures that consist of big, fast cores and small, energy-efficient cores are very promising for future high-performance computing (HPC) systems. These architectures offer a good balance between single-threaded perfor- mance and multithreaded throughput. Such systems impose challenges on the design of programming model and runtime system. Specifically, these challenges include (a) how to fully utilize the chip’s performance, (b) how to manage heterogeneous, un- reliable hardware resources, and (c) how to generate and manage a large amount of parallel tasks. This dissertation proposes and evaluates a Python-based programming framework called PyDac. PyDac supports a two-level programming model. At the high level, a programmer creates a very large number of tasks, using the divide-and-conquer strategy. At the low level, tasks are written in imperative programming style. The runtime system seamlessly manages the parallel tasks, system resilience, and inter- task communication with architecture support. PyDac has been implemented on both an field-programmable gate array (FPGA) emulation of an unconventional het- erogeneous architecture and a conventional multicore microprocessor. To evaluate the performance, resilience, and programmability of the proposed system, several micro-benchmarks were developed. We found that (a) the PyDac abstracts away task communication and achieves programmability, (b) the micro-benchmarks are scalable on the hardware prototype, but (predictably) serial operation limits some micro-benchmarks, and (c) the degree of protection versus speed could be varied in redundant threading that is transparent to programmers

    Dynamic Binary Translation for Embedded Systems with Scratchpad Memory

    Get PDF
    Embedded software development has recently changed with advances in computing. Rather than fully co-designing software and hardware to perform a relatively simple task, nowadays embedded and mobile devices are designed as a platform where multiple applications can be run, new applications can be added, and existing applications can be updated. In this scenario, traditional constraints in embedded systems design (i.e., performance, memory and energy consumption and real-time guarantees) are more difficult to address. New concerns (e.g., security) have become important and increase software complexity as well. In general-purpose systems, Dynamic Binary Translation (DBT) has been used to address these issues with services such as Just-In-Time (JIT) compilation, dynamic optimization, virtualization, power management and code security. In embedded systems, however, DBT is not usually employed due to performance, memory and power overhead. This dissertation presents StrataX, a low-overhead DBT framework for embedded systems. StrataX addresses the challenges faced by DBT in embedded systems using novel techniques. To reduce DBT overhead, StrataX loads code from NAND-Flash storage and translates it into a Scratchpad Memory (SPM), a software-managed on-chip SRAM with limited capacity. SPM has similar access latency as a hardware cache, but consumes less power and chip area. StrataX manages SPM as a software instruction cache, and employs victim compression and pinning to reduce retranslation cost and capture frequently executed code in the SPM. To prevent performance loss due to excessive code expansion, StrataX minimizes the amount of code inserted by DBT to maintain control of program execution. When a hardware instruction cache is available, StrataX dynamically partitions translated code among the SPM and main memory. With these techniques, StrataX has low performance overhead relative to native execution for MiBench programs. Further, it simplifies embedded software and hardware design by operating transparently to applications without any special hardware support. StrataX achieves sufficiently low overhead to make it feasible to use DBT in embedded systems to address important design goals and requirements

    Volatile STT-RAM Scratchpad Design and Data Allocation for Low Energy

    Get PDF
    [Abstract] On-chip power consumption is one of the fundamental challenges of current technology scaling. Cache memories consume a sizable part of this power, particularly due to leakage energy. STT-RAM is one of several new memory technologies that have been proposed in order to improve power while preserving performance. It features high density and low leakage, but at the expense of write energy and performance. This article explores the use of STT-RAM--based scratchpad memories that trade nonvolatility in exchange for faster and less energetically expensive accesses, making them feasible for on-chip implementation in embedded systems. A novel multiretention scratchpad partitioning is proposed, featuring multiple storage spaces with different retention, energy, and performance characteristics. A customized compiler-based allocation algorithm suitable for use with such a scratchpad organization is described. Our experiments indicate that a multiretention STT-RAM scratchpad can provide energy savings of 53% with respect to an iso-area, hardware-managed SRAM cache

    Using data compression for increasing memory system utilization

    Get PDF
    Cataloged from PDF version of article.The memory system presents one of the critical challenges in embedded system design and optimization. This is mainly due to the ever-increasing code complexity of embedded applications and the exponential increase seen in the amount of data they manipulate. The memory bottleneck is even more important for multiprocessor-system-on-a-chip (MPSoC) architectures due to the high cost of off-chip memory accesses in terms of both energy and performance. As a result, reducing the memory-space occupancy of embedded applications is very important and will be even more important in the next decade. While it is true that the on-chip memory capacity of embedded systems is continuously increasing, the increases in the complexity of embedded applications and the sizes of the data sets they process are far greater. Motivated by this observation, this paper presents and evaluates a compiler-driven approach to data compression for reducing memory-space occupancy. Our goal is to study how automated compiler support can help in deciding the set of data elements to compress/decompress and the points during execution at which these compressions/decompressions should be performed. We first study this problem in the context of single-core systems and then extend it to MPSoCs where we schedule compressions and decompressions intelligently such that they do not conflict with application execution as much as possible. Particularly, in MPSoCs, one needs to decide which processors should participate in the compression and decompression activities at any given point during the course of execution. We propose both static and dynamic algorithms for this purpose. In the static scheme, the processors are divided into two groups: those performing compression/decompression and those executing the application, and this grouping is maintained throughout the execution of the application. In the dynamic scheme, on the other hand, the execution starts with some grouping but this grouping can change during the course of execution, depending on the dynamic variations in the data access pattern. Our experimental results show that, in a single-core system, the proposed approach reduces maximum memory occupancy by 47.9% and average memory occupancy by 48.3% when averaged over all the benchmarks. Our results also indicate that, in an MPSoC, the average energy saving is 12.7% when all eight benchmarks are considered. While compressions and decompressions and related bookkeeping activities take extra cycles and memory space and consume additional energy, we found that the improvements they bring from the memory space, execution cycles, and energy perspectives are much higher than these overheads

    Heap Data Allocation to Scratch-Pad Memory in Embedded Systems

    Get PDF
    This thesis presents the first-ever compile-time method for allocating a portion of a program's dynamic data to scratch-pad memory. A scratch-pad is a fast directly addressed compiler-managed SRAM memory that replaces the hardware-managed cache. It is motivated by its better real-time guarantees vs cache and by its significantly lower overheads in access time, energy consumption, area and overall runtime. Dynamic data refers to all objects allocated at run-time in a program, as opposed to static data objects which are allocated at compile-time. Existing compiler methods for allocating data to scratch-pad are able to place only code, global and stack data (static data) in scratch-pad memory; heap and recursive-function objects(dynamic data) are allocated entirely in DRAM, resulting in poor performance for these dynamic data types. Runtime methods based on software caching can place data in scratch-pad, but because of their high overheads from software address translation, they have not been successful, especially for dynamic data. In this thesis we present a dynamic yet compiler-directed allocation method for dynamic data that for the first time, (i) is able to place a portion of the dynamic data in scratch-pad; (ii) has no software-caching tags; (iii) requires no run-time per-access extra address translation; and (iv) is able to move dynamic data back and forth between scratch-pad and DRAM to better track the program's locality characteristics. With our method, code, global, stack and heap variables can share the same scratch-pad. When compared to placing all dynamic data variables in DRAM and only static data in scratch-pad, our results show that our method reduces the average runtime of our benchmarks by 22.3%, and the average power consumption by 26.7%, for the same size of scratch-pad fixed at 5% of total data size. Significant savings in runtime and energy across a large number of benchmarks were also observed when compared against cache memory organizations, showing our method's success under constrained SRAM sizes when dealing with dynamic data. Lastly, our method is able to minimize the profile dependence issues which plague all similar allocation methods through careful analysis of static and dynamic profile information

    Automated Compilation Framework for Scratchpad-based Real-Time Systems

    Get PDF
    ScratchPad Memory (SPM) is highly adopted in real-time systems as it exhibits a predictable behaviour. SPM is software-managed by explicitly inserting instructions to move code and data transfers between the SPM and the main memory. However, it is a tedious job to decide how to manage the SPM and to manually modify the code to insert memory transfers. Hence, an automated compilation tool is essential to efficiently utilize the SPM. Another key problem with SPM is the latency suffered by the system due to memory transfers. Hiding this latency is important for high-performance systems. In this thesis, we address the problems of managing SPM and reducing the impact of memory latency. To realize the automation of our work, we develop a compilation framework based on the LLVM compiler to analyze and transform the program code. We exploit our framework to improve the performance of the execution of single and multi-tasks in real-time systems. For the single task execution, Worst-Case Execution Time (WCET) is of great importance to assure correct and safe behaviour of the system. So, we propose a WCET-driven allocation technique for data SPM that employs software prefetching to efficiently manage the SPM and to overlap the memory transfer and the task execution in a predictable way. On the other hand, multi-tasking requires the system to be schedulable such that all the tasks can meet their timing requirements. However, executing multiple tasks on a multi-processor platform suffers from the contention of the accesses to the shared main memory. To avoid the contention, several scheduling techniques adopted the 3-phase execution model which executes the task as a sequence of memory and computation phases. This provides the means to avoid the contention as well as to hide the memory latency by using a Direct Memory Access (DMA) engine. Executing memory transfers using the DMA allows overlapping the memory transfers with the computations on the processor. Using the 3-phase model in systems with limited sizes of local SPM may necessitate a segmentation of the task. Automating the segmentation process is necessary especially for systems with large task sets. Hence, we propose a set of efficient segmentation algorithms that follow the 3-phase execution model. The application of these algorithms shows a significant improvement in the system schedulability. For our segmentation algorithms to be more applicable, we extend the 3-phase model to allow programs with multiple paths represented as conditional Directed Acyclic Graphs (DAGs), unlike the previous works that targeted sequential programs. We also introduce a multi-steaming model to exploit the benefits of prefetching by overlapping the memory and computation phases of the same task, which was not allowed in the previous approaches. By combining the automated compilation with the proposed algorithms, we are able to achieve our goal to efficiently manage data SPM in real-time systems

    Access pattern-based code compression for memory-constrained systems

    Get PDF
    As compared to a large spectrum of performance optimizations, relatively less effort has been dedicated to optimize other aspects of embedded applications such as memory space requirements, power, real-time predictability, and reliability. In particular, many modern embedded systems operate under tight memory space constraints. One way of addressing this constraint is to compress executable code and data as much as possible. While researchers on code compression have studied efficient hardware and software based code compression strategies, many of these techniques do not take application behavior into account; that is, the same compression/decompression strategy is used irrespective of the application being optimized. This article presents an application-sensitive code compression strategy based on control flow graph (CFG) representation of the embedded program. The idea is to start with a memory image wherein all basic blocks of the application are compressed, and decompress only the blocks that are predicted to be needed in the near future. When the current access to a basic block is over, our approach also decides the point at which the block could be compressed. We propose and evaluate several compression and decompression strategies that try to reduce memory requirements without excessively increasing the original instruction cycle counts. Some of our strategies make use of profile data, whereas others are fully automatic. Our experimental evaluation using seven applications from the MediaBench suite and three large embedded applications reveals that the proposed code compression strategy is very successful in practice. Our results also indicate that working at a basic block granularity, as opposed to a procedure granularity, is important for maximizing memory space savings. © 2008 ACM

    Transparent management of scratchpad memories in shared memory programming models

    Get PDF
    Cache-coherent shared memory has traditionally been the favorite memory organization for chip multiprocessors thanks to its high programmability. In this organization the cache hierarchy is in charge of moving the data and keeping it coherent between all the caches, enabling the usage of shared memory programming models where the programmer does not need to carry out any data management operation. Unfortunately, performing all the data management operations in hardware causes severe problems, being the primary concerns the power consumption originated in the caches and the amount of coherence traffic in the interconnection network. A good solution is to introduce ScratchPad Memories (SPMs) alongside the cache hierarchy, forming a hybrid memory hierarchy. SPMs are more power-efficient than caches and do not generate coherence traffic, but they degrade programmability. In particular, SPMs require the programmer to partition the data, to program data transfers, and to keep coherence between different copies of the data. A promising solution to exploit the benefits of the SPMs without harming programmability is to allow programmers to use shared memory programming models and to automatically generate code that manages the SPMs. Unfortunately, current compilers and runtime systems encounter serious limitations to automatically generate code for hybrid memory hierarchies from shared memory programming models. This thesis proposes to transparently manage the SPMs of hybrid memory hierarchies in shared memory programming models. In order to achieve this goal this thesis proposes a combination of hardware and compiler techniques to manage the SPMs in fork-join programming models and a set of runtime system techniques to manage the SPMs in task programming models. The proposed techniques allow to program hybrid memory hierarchies with these two well-known and easy-to-use forms of shared memory programming models, capitalizing on the benefits of hybrid memory hierarchies in power consumption and network traffic without harming programmability. The first contribution of this thesis is a hardware/software co-designed coherence protocol to transparently manage the SPMs of hybrid memory hierarchies in fork-join programming models. The solution allows the compiler to always generate code to manage the SPMs with tiling software caches, even in the presence of unknown memory aliasing hazards between memory references to the SPMs and to the cache hierarchy. On the software side, the compiler generates a special form of memory instruction for memory references with possible aliasing hazards. On the hardware side, the special memory instructions are diverted to the correct copy of the data using a set of directories that track what data is mapped to the SPMs. The second contribution of this thesis is a set of runtime system techniques to manage the SPMs of hybrid memory hierarchies in task programming models. The proposed runtime system techniques exploit the characteristics of these programming models to map the data specified in the task dependences to the SPMs. Different policies are proposed to mitigate the communication costs of the data transfers, overlapping them with other execution phases such as the task scheduling phase or the execution of the previous task. The runtime system can also reduce the number of data transfers by using a task scheduler that exploits data locality in the SPMs. In addition, the proposed techniques are combined with mechanisms that reduce the impact of fine-grained tasks, such as hardware runtime systems or large SPM sizes. The accomplishment of this thesis is that hybrid memory hierarchies can be programmed with fork-join and task programming models. Consequently, architectures with hybrid memory hierarchies can be exposed to the programmer as a shared memory multiprocessor, taking advantage of the benefits of the SPMs while maintaining the programming simplicity of shared memory programming models.La memoria compartida con coherencia de caches es la jerarquía de memoria más utilizada en multiprocesadores gracias a su programabilidad. En esta solución la jerarquía de caches se encarga de mover los datos y mantener la coherencia entre las caches, habilitando el uso de modelos de programación de memoria compartida donde el programador no tiene que realizar ninguna operación para gestionar las memorias. Desafortunadamente, realizar estas operaciones en la arquitectura causa problemas severos, siendo especialmente relevantes el consumo de energía de las caches y la cantidad de tráfico de coherencia en la red de interconexión. Una buena solución es añadir Memorias ScratchPad (SPMs) acompañando la jerarquía de caches, formando una jerarquía de memoria híbrida. Las SPMs son más eficientes en energía y tráfico de coherencia, pero dificultan la programabilidad ya que requieren que el programador particione los datos, programe transferencias de datos y mantenga la coherencia entre diferentes copias de datos. Una solución prometedora para beneficiarse de las ventajas de las SPMs sin dificultar la programabilidad es permitir que el programador use modelos de programación de memoria compartida y generar código para gestionar las SPMs automáticamente. El problema es que los compiladores y los entornos de ejecución actuales sufren graves limitaciones al gestionar automáticamente una jerarquía de memoria híbrida en modelos de programación de memoria compartida. Esta tesis propone gestionar automáticamente una jerarquía de memoria híbrida en modelos de programación de memoria compartida. Para conseguir este objetivo esta tesis propone una combinación de técnicas hardware y de compilador para gestionar las SPMs en modelos de programación fork-join, y técnicas en entornos de ejecución para gestionar las SPMs en modelos de programación basados en tareas. Las técnicas propuestas hacen que las jerarquías de memoria híbridas puedan programarse con estos dos modelos de programación de memoria compartida, de tal forma que las ventajas en energía y tráfico de coherencia se puedan explotar sin dificultar la programabilidad. La primera contribución de esta tesis en un protocolo de coherencia hardware/software para gestionar SPMs en modelos de programación fork-join. La propuesta consigue que el compilador siempre pueda generar código para gestionar las SPMs, incluso cuando hay posibles alias de memoria entre referencias a memoria a las SPMs y a la jerarquía de caches. En la solución el compilador genera instrucciones especiales para las referencias a memoria con posibles alias, y el hardware sirve las instrucciones especiales con la copia válida de los datos usando directorios que guardan información sobre qué datos están mapeados en las SPMs. La segunda contribución de esta tesis son una serie de técnicas para gestionar SPMs en modelos de programación basados en tareas. Las técnicas aprovechan las características de estos modelos de programación para mapear las dependencias de las tareas en las SPMs y se complementan con políticas para minimizar los costes de las transferencias de datos, como solaparlas con fases del entorno de ejecución o la ejecución de tareas anteriores. El número de transferencias también se puede reducir utilizando un planificador que tenga en cuenta la localidad de datos y, además, las técnicas se pueden combinar con mecanismos para reducir los efectos negativos de tener tareas pequeñas, como entornos de ejecución en hardware o SPMs de más capacidad. Las propuestas de esta tesis consiguen que las jerarquías de memoria híbridas se puedan programar con modelos de programación fork-join y basados en tareas. En consecuencia, las arquitecturas con jerarquías de memoria híbridas se pueden exponer al programador como multiprocesadores de memoria compartida, beneficiándose de las ventajas de las SPMs en energía y tráfico de coherencia y manteniendo la simplicidad de uso de los modelos de programación de memoria compartida

    Many-Core Architectures: Hardware-Software Optimization and Modeling Techniques

    Get PDF
    During the last few decades an unprecedented technological growth has been at the center of the embedded systems design paramount, with Moore’s Law being the leading factor of this trend. Today in fact an ever increasing number of cores can be integrated on the same die, marking the transition from state-of-the-art multi-core chips to the new many-core design paradigm. Despite the extraordinarily high computing power, the complexity of many-core chips opens the door to several challenges. As a result of the increased silicon density of modern Systems-on-a-Chip (SoC), the design space exploration needed to find the best design has exploded and hardware designers are in fact facing the problem of a huge design space. Virtual Platforms have always been used to enable hardware-software co-design, but today they are facing with the huge complexity of both hardware and software systems. In this thesis two different research works on Virtual Platforms are presented: the first one is intended for the hardware developer, to easily allow complex cycle accurate simulations of many-core SoCs. The second work exploits the parallel computing power of off-the-shelf General Purpose Graphics Processing Units (GPGPUs), with the goal of an increased simulation speed. The term Virtualization can be used in the context of many-core systems not only to refer to the aforementioned hardware emulation tools (Virtual Platforms), but also for two other main purposes: 1) to help the programmer to achieve the maximum possible performance of an application, by hiding the complexity of the underlying hardware. 2) to efficiently exploit the high parallel hardware of many-core chips in environments with multiple active Virtual Machines. This thesis is focused on virtualization techniques with the goal to mitigate, and overtake when possible, some of the challenges introduced by the many-core design paradigm
    corecore