Search CORE

15 research outputs found

WCET-Directed Dynamic Scratchpad Memory Allocation of Data

Author: Isabelle Puaut
Jean-françois Deverge
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2007
Field of study

Many embedded systems feature processors coupled with a small and fast scratchpad memory. To the difference with caches, allocation of data to scratchpad memory must be handled by software. The major gain is to enhance the pre-dictability of memory accesses latencies. A compile-time dy-namic allocation approach enables eviction and placement of data to the scratchpad memory at runtime. Previous dynamic scratchpad memory allocation ap-proaches aimed to reduce average-case program execution time or the energy consumption due to memory accesses. For real-time systems, worst-case execution time is the main metric to optimize. In this paper, we propose a WCET-directed algorithm to dynamically allocate static data and stack data of a pro-gram to scratchpad memory. The granularity of placement of memory transfers (e.g. on function, basic block bound-aries) is discussed from the perspective of its computation complexity and the quality of allocation. 1

CiteSeerX

Crossref

Automatic Safe Data Reuse Detection for the WCET Analysis of Systems With Data Caches

Author: Cortadella J.
Gran Tejero R.
Segarra J.
Viñals-Yufera V.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2020
Field of study

Worst-case execution time (WCET) analysis of systems with data caches is one of the key challenges in real-time systems. Caches exploit the inherent reuse properties of programs, temporarily storing certain memory contents near the processor, in order that further accesses to such contents do not require costly memory transfers. Current worst-case data cache analysis methods focus on specific cache organizations (LRU, locked, ACDC, etc.). In this article, we analyze data reuse (in the worst case) as a property of the program, and thus independent of the data cache. Our analysis method uses Abstract Interpretation on the compiled program to extract, for each static load/store instruction, a linear expression for the address pattern of its data accesses, according to the Loop Nest Data Reuse Theory. Each data access expression is compared to that of prior (dominant) memory instructions to verify whether it presents a guaranteed reuse. Our proposal manages references to scalars, arrays, and non-linear accesses, provides both temporal and spatial reuse information, and does not require the exploration of explicit data access sequences. As a proof of concept we analyze the TACLeBench benchmark suite, showing that most loads/stores present data reuse, and how compiler optimizations affect it. Using a simple hit/miss estimation on our reuse results, the time devoted to data accesses in the worst case is reduced to 27% compared to an always-miss system, equivalent to a data hit ratio of 81%. With compiler optimization, such time is reduced to 6.5%

UPCommons. Portal del coneixement obert de la UPC

Repositorio Universidad de Zaragoza

Speculative tag access for reduced energy dissipation in set-associative L1 data caches

Author: Bardizbanyan Alen
Larsson-Edefors Per
Sj\ue4lander Magnus
Whalley David
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2013
Field of study

Due to performance reasons, all ways in set-associative level-one (L1) data caches are accessed in parallel for load operations even though the requested data can only reside in one of the ways. Thus, a significant amount of energy is wasted when loads are performed. We propose a speculation technique that performs the tag comparison in parallel with the address calculation, leading to the access of only one way during the following cycle on successful speculations. The technique incurs no execution time penalty, has an insignificant area overhead, and does not require any customized SRAM implementation. Assuming a 16kB 4-way set-associative L1 data cache implemented in a 65-nm process technology, our evaluation based on 20 different MiBench benchmarks shows that the proposed technique on average leads to a 24% data cache energy reduction

Crossref

Chalmers Research

Towards a performance- and energy-efficient data filter cache

Author: Bardizbanyan Alen
Larsson-Edefors Per
Sj\ue4lander Magnus
Whalley David
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2013
Field of study

As CPU data requests to the level-one (L1) data cache (DC) can represent as much as 25% of an embedded processor\u27s total power dissipation, techniques that decrease L1 DC accesses can significantly enhance processor energy efficiency. Filter caches are known to efficiently decrease the number of accesses to instruction caches. However, due to the irregular access pattern of data accesses, a conventional data filter cache (DFC) has a high miss rate, which degrades processor performance. We propose to integrate a DFC with a fast address calculation technique to significantly reduce the impact of misses and to improve performance by enabling one-cycle loads. Furthermore, we show that DFC stalls can be eliminated even after unsuccessful fast address calculations, by simultaneously accessing the DFC and L1 DC on the following cycle. We quantitatively evaluate different DFC configurations, with and without the fast address calculation technique, using different write allocation policies, and qualitatively describe their impact on energy efficiency. The proposed design provides an efficient DFC that yields both energy and performance improvements

Crossref

Chalmers Research

Understanding Optimization Phase Interactions to Reduce the Phase Order Search Space

Author: Jantz Michael
Publication venue: 'Paleontological Institute at The University of Kansas'
Publication date: 01/01/2010
Field of study

Compiler optimization phase ordering is a longstanding problem, and is of particular relevance to the performance-oriented and cost-constrained domain of embedded systems applications. Optimization phases are known to interact with each other, enabling and disabling opportunities for successive phases. Therefore, varying the order of applying these phases often generates distinct output codes, with different speed, code-size and power consumption characteristics. Most cur- rent approaches to address this issue focus on developing innovative methods to selectively evaluate the vast phase order search space to produce a good (but, potentially suboptimal) representation for each program. In contrast, the goal of this thesis is to study and reduce the phase order search space by: (1) identifying common causes of optimization phase interactions across all phases, and then devising techniques to eliminate them, and (2) exploiting natural phase independence to prune the phase order search space. We observe that several phase interactions are caused by false register dependence during many optimization phases. We explore the potential of cleanup phases, such as register remapping and copy propagation, at reducing false dependences. We show that innovative implementation and application of these phases not only reduces the size of the phase order search space substantially, but can also improve the quality of code generated by optimizing compilers. We examine the effect of removing cleanup phases, such as dead assignment elimination, which should not interact with other compiler phases, from the phase order search space. Finally, we show that reorganization of the phase order search into a multi-staged approach employing sets of mutually independent optimizations can reduce the search space to a fraction of its original size without sacrificing performance

KU ScholarWorks

Compiler Transformations to Generate Reentrant C Programs to Assist Software Parallelization

Author: Smith Adam R.
Publication venue: 'Paleontological Institute at The University of Kansas'
Publication date: 01/01/2009
Field of study

As we move through the multi-core era into the many-core era it becomes obvi- ous that thread-based programming is here to stay. This trend in the development of general purpose hardware is augmented by the fact that while writing sequential programs is considered a non-trivial task, writing parallel applications to take ad- vantage of the advances in the number of cores in a processor severely complicates the process. Writing parallel applications requires programs and functions to be reentrant. Therefore, we cannot use globals and statics. However, globals and statics are useful in certain contexts. Globals allow an easy programming mecha- nism to share data between several functions. Statics provide the only mechanism of data hiding in C for variables that are global in scope. Writing parallel programs restricts users from using globals and statics in their programs, as doing so would make the program non-reentrant. Moreover, there is a large existing legacy code base of sequential programs that are non-reentrant, since they rely on statics and globals. Several of these sequential programs dis- play significant amounts of data parallelism by operating on independent chunks of input data, and therefore can be easily converted into parallel versions to ex- ploit multi-core processors. Indeed, several such programs have been manually converted into parallel versions. However, manually eliminating all globals and statics to make the program reentrant is tedious, time-consuming, and error-prone. In this paper we describe a system to provide a semi-automated mechanism for users to still be able to use statics and globals in their programs, and to let the compiler automatically convert them into their semantically-equivalent reentrant versions enabling their parallelization later

KU ScholarWorks

Detecting bugs in register allocation

Author: Blech J. O.
Bruce R. Childers
Jaramillo C. S.
Lee C.
Lee K. D.
Li G.
Mary Lou Soffa
Nandivada V. K.
Pnueli A.
Steffen B.
Yuqiang Huang
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Path splitting--a technique for improving data flow analysis

Author: Poletto Massimiliano Antonio
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/1995
Field of study

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1995.Includes bibliographical references (p. 83-87).by Massimiliano Antonio Poletto.M.Eng

DSpace@MIT

Compiler architecture using a portable intermediate language

Author: Reig Galilea Fermín Javier
Publication venue
Publication date: 01/01/2002
Field of study

The back end of a compiler performs machine-dependent tasks and low-level optimisations that are laborious to implement and difficult to debug. In addition, in languages that require run-time services such as garbage collection, the back end must interface with the run-time system to provide those services. The net result is that building a compiler back end entails a high implementation cost. In this dissertation I describe reusable code generation infrastructure that enables the construction of a complete programming language implementation (compiler and run-time system) with reduced effort. The infrastructure consists of a portable intermediate language, a compiler for this language and a low-level run-time system. I provide an implementation of this system and I show that it can support a variety of source programming languages, it reduces the overall eort required to implement a programming language, it can capture and retain information necessary to support run-time services and optimisations, and it produces efficient code

Glasgow Theses Service

Improving processor efficiency by statically pipelining instructions

Author: Brandon Davis
David Whalley
Gang-Ryung Uh
Gary Tyson
Hoang-Thanh T.
Ian Finlayson
Magnus Själander
Peter Gavin
Sethi A.
Thoziyoor S.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref