Search CORE

24 research outputs found

Architecture extensions for efficient managament of scratch-pad Memory

Author: Busquets Mataix José Vicente
Catalá Carlos
Martí Campoy Antonio
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Nowadays, many embedded processors include in their architecture on-chip static memories, so called scratch-pad memories (SPM). Compared to cache, these memories do not require complex control logic, thus resulting in increased efficiency both in silicon area and energy consumption. Last years, many papers have proposed algorithms to allocate memory segments in SPM in order to enhance its usage. However, very few care about the SPM architecture itself, to make it more controllable, more power efficient and faster. This paper proposes architecture extensions to automatically load code into the SPM whilst it is fetched for execution to reduce the SPM updating delays, which motivates a very dynamic use of the SPM. We test our proposal in a derivation of the Simplescalar simulator, with typical embedded benchmarks. The results show improvements, on average, of 30.6% in energy saving and 7.6% in performance compared to a system with cache. © 2011 Springer-Verlag.This research was sponsored by local Government “Generalitat Valenciana” under project GV07/ 2007/122.Busquets Mataix, JV.; Catalá, C.; Martí Campoy, A. (2011). Architecture extensions for efficient managament of scratch-pad Memory. En Integrated Circuit and System Design. Power and Timing Modeling, Optimization, and Simulation. Springer Verlag (Germany). (6951):43-52. https://doi.org/10.1007/978-3-642-24154-3_5S43526951Banakar, R., Steinke, S., Lee, B.-S., Balakrishnan, M., Marwedel, P.: Scratchpad memory: design alternative for cache on-chip memory in embedded systems. In: CODES 2002, pp. 73–78 (2002)Verma, M., Wehmeyer, L., Marwedel, P.: Cache-Aware Scratchpad Allocation Algorithm. In: DATE 2004, pp. 1264–1269 (2004)Verma, M., Marwedel, P.: Advanced memory optimization techniques for low-power embedded processors, pp. I-XII, 1–188. Springer, Heidelberg (2007)Nguyen, N., Dominguez, A., Barua, R.: Memory allocation for embedded systems with a compile-time-unknown scratch-pad size. In: CASES 2005, pp. 115–125 (2005)Egger, B., Kim, C., Jang, C., Nam, Y., Lee, J., Min, S.L.: A dynamic code placement technique for scratchpad memory using postpass optimization. In: CASES 2006, pp. 223–233 (2006)Egger, B., Lee, J., Shin, H.: Scratchpad memory management for portable systems with a memory management unit. In: EMSOFT 2006, pp. 321–330 (2006)Egger, B., Lee, J., Shin, H.: Dynamic scratchpad memory management for code in portable systems with an MMU. ACM Trans. Embedded Comput. Syst. 7(2) (2008)Cho, H., Egger, B., Lee, J., Shin, H.: Dynamic data scratchpad memory management for a memory subsystem with an MMU. In: LCTES 2007, pp. 195–206 (2007)Janapsatya, A., Parameswaran, S., Ignjatovic, A.: Hardware/software managed scratchpad memory for embedded system. In: ICCAD 2004, pp. 370–377 (2004)Balakrishnan, M., Marwedel, P., Wehmeyer, L., Grunwald, N., Banakar, R., Steinke, S.: Reducing Energy Consumption by Dynamic Copying of Instructions onto Onchip Memory. In: ISSS 2002, pp. 213–218 (2002)Poletti, F., Marchal, P., Atienza, D., Benini, L., Catthoor, F., Mendias, J.M.: An integrated hardware/software approach for run-time scratchpad management. In: DAC 2004, pp. 238–243 (2004)Li, L., Gao, L., Xue, J.: Memory Coloring: A Compiler Approach for Scratchpad Memory Management. In: IEEE PACT 2005, pp. 329–338 (2005)Lee, L.H., Moyer, B., Arends, J.: Instruction fetch energy reduction using loop caches for embedded applications with small tight loops. In: ISLPED 1999, pp. 267–269 (1999)Victorio, J.A., Torres Moren, E.F., Yúfera, V.V.: Vatios: Simulador de Procesador con Estimación de Potencia. XVIII Jornadas de Paralelismo, Zaragoza (2007)Burger, D., Austin, T.M.: The SimpleScalar Tool Set Version 2.0. Technical Report 1342, Computer Sciences Department. University of Wisconsin–Madison (May 1997)Brooks, D., Tiwari, V., Martonosi, M.: Wattch: a framework for architectural-level power analysis and optimizations. In: ISCA 2000, pp. 83–94 (2000)Tarjan, D., Thoziyoor, S., Jouppi, N.: CACTI 4.0, P. HPL-2006- 86 20060606The Mälardalen WCET research group. The Mälardalen WCET benchmarks homepage, http://www.mrtc.mdh.se/projects/wcet/benchmarks.htmlCho, D., Pasricha, S., Issenin, I., Dutt, N.D., Ahn, M., Paek, Y.: Adaptive Scratch Pad Memory Management for Dynamic Behavior of Multimedia Applications. IEEE Trans. on CAD of Integrated Circuits and Systems (TCAD) 28(4), 554–567 (2009

Crossref

RiuNet

Dynamic code mapping for limited local memory systems

Author: Aviral Shrivastava
Ke Bai
Seung Chul Jung
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

Abstract—This paper presents heuristics for dynamic man-agement of application code on limited local memories present in high-performance multi-core processors. Previous techniques formulate the problem using call graphs, which do not capture the temporal ordering of functions. In addition, they only use a conservative estimate of the interference cost between functions to obtain a mapping. As a result previous techniques are unable to achieve efficient code mapping. Techniques proposed in this paper overcome both these limitations and achieve superior code mapping. Experimental results from executing benchmarks from MiBench onto the Cell processor in the Sony Playstation 3 demonstrate upto 29 % and average 12 % performance improve-ment, at tolerable compile-time overhead. I

CiteSeerX

Crossref

SPM management using markov chain based data access prediction

Author: Kandemir M.
Ozturk O.
Srikantaiah S.
Yemliha T.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2008
Field of study

Leveraging the power of scratchpad memories (SPMs) available in most embedded systems today is crucial to extract maximum performance from application programs. While regular accesses like scalar values and array expressions with affine subscript functions have been tractable for compiler analysis (to be prefetched into SPM), irregular accesses like pointer accesses and indexed array accesses have not been easily amenable for compiler analysis. This paper presents an SPM management technique using Markov chain based data access prediction for such irregular accesses. Our approach takes advantage of inherent, but hidden reuse in data accesses made by irregular references. We have implemented our proposed approach using an optimizing compiler. In this paper, we also present a thorough comparison of our different dynamic prediction schemes with other SPM management schemes. SPM management using our approaches produces 12.7% to 28.5% improvements in performance across a range of applications with both regular and irregular access patterns, with an average improvement of 20.8%

Crossref

Bilkent University Institutional Repository

Cooperative Data and Computation Partitioning for Decentralized Architectures.

Author: Chu Michael L.
Publication venue
Publication date: 01/01/2007
Field of study

Scalability of future wide-issue processor designs is severely hampered by the use of centralized resources such as register files, memories and interconnect networks. While the use of centralized resources eases both hardware design and compiler code generation efforts, they can become performance bottlenecks as access latencies increase with larger designs. The natural solution to this problem is to adapt the architecture to use smaller, decentralized resources. Decentralized architectures use smaller, faster components and exploit distributed instruction-level parallelism across the resources. A multicluster architecture is an example of such a decentralized processor, where subsets of smaller register files, functional units, and memories are grouped together in a tightly coupled unit, forming a cluster. These clusters can then be replicated and connected together to form a scalable, high-performance architecture. The main difficulty with decentralized architectures resides in compiler code generation. In a centralized Very Long Instruction Word (VLIW) processor, the compiler must statically schedule each operation to both a functional unit and a time slot for execution. In contrast, for a decentralized multicluster VLIW, the compiler must consider the additional effects of cluster assignment, recognizing that communication between clusters will result in a delay penalty. In addition, if the multicluster processor also has partitioned data memories, the compiler has the additional task of assigning data objects to their respective memories. Each decision, of cluster, functional unit, memory, and time slot, are highly interrelated and can have dramatic effects on the best choice for every other decision. This dissertation addresses the issues of extracting and exploiting inherent parallelism across decentralized resources through compiler analysis and code generation techniques. First, a static analysis technique to partition data objects is presented, which maps data objects to decentralized scratchpad memories. Second, an alternative profile-guided technique for memory partitioning is presented which can effectively map data access operations onto distributed caches. Finally, a detailed, resource-aware partitioning algorithm is presented which can effectively split computation operations of an application across a set of processing elements. These partitioners work in tandem to create a high-performance partition assignment of both memory and computation operations for decentralized multicluster or multicore processors.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/57649/2/mchu_1.pd

Deep Blue Documents at the University of Michigan

Processor Energy Characterization for Compiler-Assisted Software Energy Reduction

Author
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2012
Field of study

Crossref

Application-aware Performance Optimization for Software Managed Manycore Architectures

Author
Publication venue
Publication date: 01/01/2019
Field of study

abstract: One of the main goals of computer architecture design is to improve performance without much increase in the power consumption. It cannot be achieved by adding increasingly complex intelligent schemes in the hardware, since they will become increasingly less power-efficient. Therefore, parallelism comes up as the solution. In fact, the irrevocable trend of computer design in near future is still to keep increasing the number of cores while reducing the operating frequency. However, it is not easy to scale number of cores. One important challenge is that existing cores consume too much power. Another challenge is that cache-based memory hierarchy poses a serious limitation due to the rapidly increasing demand of area and power for coherence maintenance. In this dissertation, opportunities to resolve the aforementioned issues were explored in two aspects. Firstly, the possibility of removing hardware cache altogether, and replacing it with scratchpad memory with software management was explored. Scratchpad memory consumes much less power than caches. However, as data management logic is completely shifted to Software, how to reduce software overhead is challenging. This thesis presents techniques to manage scratchpad memory judiciously by exploiting application semantics and knowledge of data access patterns, thereby enabling optimization of data movement across the memory hierarchy. Experimental results show that the optimization was able to reduce stack data management overhead by 13X, produce better code mapping in more than 80% of the case, and improve performance by 83% in heap management. Secondly, the possibility of using software branch hinting to replace hardware branch prediction to completely eliminate power consumption on corresponding hardware components was explored. As branch predictor is removed from hardware, software logic is responsible for reducing branch penalty. Techniques to minimize the branch penalty by optimizing branch hint placement were proposed, which can reduce branch penalty by 35.4% over the state-of-the-art.Dissertation/ThesisDoctoral Dissertation Computer Science 201

ASU Digital Repository

Compiler and Runtime for Memory Management on Software Managed Manycore Processors

Author
Publication venue
Publication date: 01/01/2014
Field of study

abstract: We are expecting hundreds of cores per chip in the near future. However, scaling the memory architecture in manycore architectures becomes a major challenge. Cache coherence provides a single image of memory at any time in execution to all the cores, yet coherent cache architectures are believed will not scale to hundreds and thousands of cores. In addition, caches and coherence logic already take 20-50% of the total power consumption of the processor and 30-60% of die area. Therefore, a more scalable architecture is needed for manycore architectures. Software Managed Manycore (SMM) architectures emerge as a solution. They have scalable memory design in which each core has direct access to only its local scratchpad memory, and any data transfers to/from other memories must be done explicitly in the application using Direct Memory Access (DMA) commands. Lack of automatic memory management in the hardware makes such architectures extremely power-efficient, but they also become difficult to program. If the code/data of the task mapped onto a core cannot fit in the local scratchpad memory, then DMA calls must be added to bring in the code/data before it is required, and it may need to be evicted after its use. However, doing this adds a lot of complexity to the programmer's job. Now programmers must worry about data management, on top of worrying about the functional correctness of the program - which is already quite complex. This dissertation presents a comprehensive compiler and runtime integration to automatically manage the code and data of each task in the limited local memory of the core. We firstly developed a Complete Circular Stack Management. It manages stack frames between the local memory and the main memory, and addresses the stack pointer problem as well. Though it works, we found we could further optimize the management for most cases. Thus a Smart Stack Data Management (SSDM) is provided. In this work, we formulate the stack data management problem and propose a greedy algorithm for the same. Later on, we propose a general cost estimation algorithm, based on which CMSM heuristic for code mapping problem is developed. Finally, heap data is dynamic in nature and therefore it is hard to manage it. We provide two schemes to manage unlimited amount of heap data in constant sized region in the local memory. In addition to those separate schemes for different kinds of data, we also provide a memory partition methodology.Dissertation/ThesisPh.D. Computer Science 201

ASU Digital Repository

Scratchpad Management in Software Managed Manycore Architectures

Author
Publication venue
Publication date: 01/01/2017
Field of study

abstract: Caches have long been used to reduce memory access latency. However, the increased complexity of cache coherence brings significant challenges in processor design as the number of cores increases. While making caches scalable is still an important research problem, some researchers are exploring the possibility of a more power-efficient SRAM called scratchpad memories or SPMs. SPMs consume significantly less area, and are more energy-efficient per access than caches, and therefore make the design of on-chip memories much simpler. Unlike caches, which fetch data from memories automatically, an SPM requires explicit instructions for data transfers. SPM-only architectures are thus named as software managed manycore (SMM), since the data movements of such architectures rely on software. SMM processors have been widely used in different areas, such as embedded computing, network processing, or even high performance computing. While SMM processors provide a low-power platform, the hardware alone does not guarantee power efficiency, if applications on such processors deliver low performance. Efficient software techniques are therefore required. A big body of management techniques for SMM architectures are compiler-directed, as inserting data movement operations by hand forces programmers to trace flow of data, which can be error-prone and sometimes difficult if not impossible. This thesis develops compiler-directed techniques to manage data transfers for embedded applications on SMMs efficiently. The techniques analyze and find out the proper program points and insert data movement instructions accordingly. The techniques manage code, stack and heap data of applications, and reduce execution time by 14%, 52% and 80% respectively compared to their predecessors on typical embedded applications. On top of managing local data, a technique is also developed for shared data in SMM architectures. Experimental results show it achieves more than 2X speedup than the previous technique on average.Dissertation/ThesisDoctoral Dissertation Computer Science 201

ASU Digital Repository

Power-Efficient and Low-Latency Memory Access for CMP Systems with Heterogeneous Scratchpad On-Chip Memory

Author: Chen Zhi
Publication venue: UKnowledge
Publication date: 01/01/2013
Field of study

The gradually widening speed disparity of between CPU and memory has become an overwhelming bottleneck for the development of Chip Multiprocessor (CMP) systems. In addition, increasing penalties caused by frequent on-chip memory accesses have raised critical challenges in delivering high memory access performance with tight power and latency budgets. To overcome the daunting memory wall and energy wall issues, this thesis focuses on proposing a new heterogeneous scratchpad memory architecture which is configured from SRAM, MRAM, and Z-RAM. Based on this architecture, we propose two algorithms, a dynamic programming and a genetic algorithm, to perform data allocation to different memory units, therefore reducing memory access cost in terms of power consumption and latency. Extensive and intensive experiments are performed to show the merits of the heterogeneous scratchpad architecture over the traditional pure memory system and the effectiveness of the proposed algorithms

University of Kentucky

WCET-Aware Scratchpad Memory Management for Hard Real-Time Systems

Author
Publication venue
Publication date: 01/01/2017
Field of study

abstract: Cyber-physical systems and hard real-time systems have strict timing constraints that specify deadlines until which tasks must finish their execution. Missing a deadline can cause unexpected outcome or endanger human lives in safety-critical applications, such as automotive or aeronautical systems. It is, therefore, of utmost importance to obtain and optimize a safe upper bound of each task’s execution time or the worst-case execution time (WCET), to guarantee the absence of any missed deadline. Unfortunately, conventional microarchitectural components, such as caches and branch predictors, are only optimized for average-case performance and often make WCET analysis complicated and pessimistic. Caches especially have a large impact on the worst-case performance due to expensive off- chip memory accesses involved in cache miss handling. In this regard, software-controlled scratchpad memories (SPMs) have become a promising alternative to caches. An SPM is a raw SRAM, controlled only by executing data movement instructions explicitly at runtime, and such explicit control facilitates static analyses to obtain safe and tight upper bounds of WCETs. SPM management techniques, used in compilers targeting an SPM-based processor, determine how to use a given SPM space by deciding where to insert data movement instructions and what operations to perform at those program locations. This dissertation presents several management techniques for program code and stack data, which aim to optimize the WCETs of a given program. The proposed code management techniques include optimal allocation algorithms and a polynomial-time heuristic for allocating functions to the SPM space, with or without the use of abstraction of SPM regions, and a heuristic for splitting functions into smaller partitions. The proposed stack data management technique, on the other hand, finds an optimal set of program locations to evict and restore stack frames to avoid stack overflows, when the call stack resides in a size-limited SPM. In the evaluation, the WCETs of various benchmarks including real-world automotive applications are statically calculated for SPMs and caches in several different memory configurations.Dissertation/ThesisDoctoral Dissertation Computer Science 201

ASU Digital Repository