440 research outputs found
Loop Nest Splitting for WCET-Optimization and Predictability Improvement
This paper presents the influence of the loop nest splitting source code optimization on the worst-case execution time (WCET). Loop nest splitting minimizes the number of executed if-statements in loop nests of embedded multimedia applications. Especially loops and if-statements of high-level languages are an inherent source of unpredictability and loss of precision for WCET analysis. This is caused by the fact that it is difficult to obtain safe and tight worst-case estimates of an application\u27s flow of control through these high-level constructs. In addition, the corresponding control flow redirections expressed at the assembly level reduce predictability even more due to the complex pipeline and branch prediction behavior of modern embedded processors.
The analysis techniques for loop nest splitting are based on precise mathematical models combined with genetic algorithms. On the one hand, these techniques achieve a significantly more homogeneous structure of the control flow. On the other hand, the precision of our analyses leads to the generation of very accurate high-level flow facts for loops and if-statements. The application of our implemented algorithms to three real-life multimedia benchmarks leads to average speed-ups by 25.0% - 30.1%, while WCET is reduced between 34.0% and 36.3%
WCET-Aware Scratchpad Memory Management for Hard Real-Time Systems
abstract: Cyber-physical systems and hard real-time systems have strict timing constraints that specify deadlines until which tasks must finish their execution. Missing a deadline can cause unexpected outcome or endanger human lives in safety-critical applications, such as automotive or aeronautical systems. It is, therefore, of utmost importance to obtain and optimize a safe upper bound of each task’s execution time or the worst-case execution time (WCET), to guarantee the absence of any missed deadline. Unfortunately, conventional microarchitectural components, such as caches and branch predictors, are only optimized for average-case performance and often make WCET analysis complicated and pessimistic. Caches especially have a large impact on the worst-case performance due to expensive off- chip memory accesses involved in cache miss handling. In this regard, software-controlled scratchpad memories (SPMs) have become a promising alternative to caches. An SPM is a raw SRAM, controlled only by executing data movement instructions explicitly at runtime, and such explicit control facilitates static analyses to obtain safe and tight upper bounds of WCETs. SPM management techniques, used in compilers targeting an SPM-based processor, determine how to use a given SPM space by deciding where to insert data movement instructions and what operations to perform at those program locations. This dissertation presents several management techniques for program code and stack data, which aim to optimize the WCETs of a given program. The proposed code management techniques include optimal allocation algorithms and a polynomial-time heuristic for allocating functions to the SPM space, with or without the use of abstraction of SPM regions, and a heuristic for splitting functions into smaller partitions. The proposed stack data management technique, on the other hand, finds an optimal set of program locations to evict and restore stack frames to avoid stack overflows, when the call stack resides in a size-limited SPM. In the evaluation, the WCETs of various benchmarks including real-world automotive applications are statically calculated for SPMs and caches in several different memory configurations.Dissertation/ThesisDoctoral Dissertation Computer Science 201
Separation logic for high-level synthesis
High-level synthesis (HLS) promises a significant shortening of the digital hardware design cycle by raising the abstraction level of the design entry to high-level languages such as C/C++. However, applications using dynamic, pointer-based data structures remain difficult to implement well, yet such constructs are widely used in software. Automated optimisations that leverage the memory bandwidth of dedicated hardware implementations by distributing the application data over separate on-chip memories and parallelise the implementation are often ineffective in the presence of dynamic data structures, due to the lack of an automated analysis that disambiguates pointer-based memory accesses. This thesis takes a step towards closing this gap. We explore recent advances in separation logic, a rigorous mathematical framework that enables formal reasoning about the memory access of heap-manipulating programs. We develop a static analysis that automatically splits heap-allocated data structures into provably disjoint regions. Our algorithm focuses on dynamic data structures accessed in loops and is accompanied by automated source-to-source transformations which enable loop parallelisation and physical memory partitioning by off-the-shelf HLS tools.
We then extend the scope of our technique to pointer-based memory-intensive implementations that require access to an off-chip memory. The extended HLS design aid generates parallel on-chip multi-cache architectures. It uses the disjointness property of memory accesses to support non-overlapping memory regions by private caches. It also identifies regions which are shared after parallelisation and which are supported by parallel caches with a coherency mechanism and synchronisation, resulting in automatically specialised memory systems. We show up to 15x acceleration from heap partitioning, parallelisation and the insertion of the custom cache system in demonstrably practical applications.Open Acces
Compiler-Decided Dynamic Memory Allocation for Scratch-Pad Based Embedded Systems
In this research we propose a highly predictable, low overhead and
yet dynamic, memory allocation strategy for embedded systems with
scratch-pad memory. A scratch-pad is a fast compiler-managed
SRAM memory that replaces the hardware-managed cache. It is
motivated by its better real-time guarantees vs cache and by its
significantly lower overheads in energy consumption, area and
overall runtime, even with a simple allocation scheme.
Scratch-pad allocation methods primarily
are of two types. First, software-caching schemes emulate the
workings of a hardware cache in software. Instructions are inserted
before each load/store to check the software-maintained cache tags.
Such methods incur large overheads in runtime, code size, energy
consumption and SRAM space for tags and deliver poor real-time
guarantees, just like hardware caches. A second category of
algorithms partitions variables at compile-time into the two banks.
However, a drawback of such static allocation schemes is that they
do not account for dynamic program behavior.
We propose a dynamic allocation methodology for global and stack
data and program code that (i) accounts for changing program
requirements at runtime (ii) has no software-caching tags (iii)
requires no run-time checks (iv) has extremely low overheads, and
(v) yields 100% predictable memory access times. In this method
data that is about to be accessed frequently is copied into the
scratch-pad using compiler-inserted code at fixed and infrequent
points in the program. Earlier data is evicted if necessary. When
compared to an existing static allocation scheme, results show that
our scheme reduces runtime by up to 39.8% and energy by up to
31.3% on average for our benchmarks, depending on the SRAM size
used. The actual gain depends on the SRAM size, but our results
show that close to the maximum benefit in run-time and energy is
achieved for a substantial range of small SRAM sizes commonly found
in embedded systems. Our comparison with a direct mapped cache shows
that our method performs roughly as well as a cached architecture in runtime
and energy while delivering better real-time benefits
Power-efficient data management for dynamic applications
In recent years, the semiconductor industry has turned its focus towards heterogeneous multi-processor platforms. They are an economically viable solution for coping with the growing setup and manufacturing cost of silicon systems. Furthermore, their inherent flexibility also perfectly supports the emerging market of interactive, mobile data and content services. The platform's performance and energy depend largely on how well the data-dominated services are mapped on the memory subsystem. A crucial aspect thereby is how efficient data is transferred between the different memory layers. Several compilation techniques have been developed to optimally use the available bandwidth. Unfortunately, they do not take the interaction between multiple threads running on the different processors into account, only locally optimize the bandwidth nor deal with the dynamic behavior of these applications. The contributions of this chapter are to outline the main limitations of current techniques and to introduce an approach for dealing with the dynamic multi-threaded of our application domain
Scaling of a large-scale simulation of synchronous slow-wave and asynchronous awake-like activity of a cortical model with long-range interconnections
Cortical synapse organization supports a range of dynamic states on multiple
spatial and temporal scales, from synchronous slow wave activity (SWA),
characteristic of deep sleep or anesthesia, to fluctuating, asynchronous
activity during wakefulness (AW). Such dynamic diversity poses a challenge for
producing efficient large-scale simulations that embody realistic metaphors of
short- and long-range synaptic connectivity. In fact, during SWA and AW
different spatial extents of the cortical tissue are active in a given timespan
and at different firing rates, which implies a wide variety of loads of local
computation and communication. A balanced evaluation of simulation performance
and robustness should therefore include tests of a variety of cortical dynamic
states. Here, we demonstrate performance scaling of our proprietary Distributed
and Plastic Spiking Neural Networks (DPSNN) simulation engine in both SWA and
AW for bidimensional grids of neural populations, which reflects the modular
organization of the cortex. We explored networks up to 192x192 modules, each
composed of 1250 integrate-and-fire neurons with spike-frequency adaptation,
and exponentially decaying inter-modular synaptic connectivity with varying
spatial decay constant. For the largest networks the total number of synapses
was over 70 billion. The execution platform included up to 64 dual-socket
nodes, each socket mounting 8 Intel Xeon Haswell processor cores @ 2.40GHz
clock rates. Network initialization time, memory usage, and execution time
showed good scaling performances from 1 to 1024 processes, implemented using
the standard Message Passing Interface (MPI) protocol. We achieved simulation
speeds of between 2.3x10^9 and 4.1x10^9 synaptic events per second for both
cortical states in the explored range of inter-modular interconnections.Comment: 22 pages, 9 figures, 4 table
Scaling of a large-scale simulation of synchronous slow-wave and asynchronous awake-like activity of a cortical model with long-range interconnections
Cortical synapse organization supports a range of dynamic states on multiple
spatial and temporal scales, from synchronous slow wave activity (SWA),
characteristic of deep sleep or anesthesia, to fluctuating, asynchronous
activity during wakefulness (AW). Such dynamic diversity poses a challenge for
producing efficient large-scale simulations that embody realistic metaphors of
short- and long-range synaptic connectivity. In fact, during SWA and AW
different spatial extents of the cortical tissue are active in a given timespan
and at different firing rates, which implies a wide variety of loads of local
computation and communication. A balanced evaluation of simulation performance
and robustness should therefore include tests of a variety of cortical dynamic
states. Here, we demonstrate performance scaling of our proprietary Distributed
and Plastic Spiking Neural Networks (DPSNN) simulation engine in both SWA and
AW for bidimensional grids of neural populations, which reflects the modular
organization of the cortex. We explored networks up to 192x192 modules, each
composed of 1250 integrate-and-fire neurons with spike-frequency adaptation,
and exponentially decaying inter-modular synaptic connectivity with varying
spatial decay constant. For the largest networks the total number of synapses
was over 70 billion. The execution platform included up to 64 dual-socket
nodes, each socket mounting 8 Intel Xeon Haswell processor cores @ 2.40GHz
clock rates. Network initialization time, memory usage, and execution time
showed good scaling performances from 1 to 1024 processes, implemented using
the standard Message Passing Interface (MPI) protocol. We achieved simulation
speeds of between 2.3x10^9 and 4.1x10^9 synaptic events per second for both
cortical states in the explored range of inter-modular interconnections.Comment: 22 pages, 9 figures, 4 table
Parallel object-oriented algorithms for simulation of multiphysics : application to thermal systems
The present and the future expectation in parallel computing pose a new generational change in simulation and computing. Modern High Performance Computing (HPC) facilities have high computational power in terms of operations per second -today peta-FLOPS (10e15 FLOPS) and growing toward the exascale (10e18 FLOPS) which is expected in few years-. This opens the way for using simulation tools in a wide range of new
engineering and scientific applications. For example, CFD&HT codes will be effectively used in the design phase of industrial devices, obtaining valuable information with reasonable time expenses. However, the use of the emerging computer architectures is subjected to enhancements and innovation in software design patterns. So far, powerful codes for individually studying heat and mass transfer phenomena at multiple levels of
modeling are available. However, there is no way to combine them for resolving complex coupled problems. In the current context, this PhD thesis presents the development of parallel methodologies, and its implementation as an object-oriented software platform, for the simulation of multiphysics systems. By means of this new software platform, called NEST, the distinct codes can now be integrated into single simulation tools for
specific applications of social and industrial interest. This is done in an intuitive and simple way so that the researchers do not have to bother either on the coexistence of several codes at the same time neither on how they interact to each other. The coupling of the involved components is controlled from a low level code layer, which is transparent to the users. This contributes with appealing benefits on software projects management
first and on the flexibility and features of the simulations, later. In sum, the presented approaches pose a new paradigm in the production of physics simulation programs. Although the thesis pursues general purpose applications, special emphasis is placed on the simulation of thermal systems, in particular on buildings energy assessment and on hermetic reciprocating compressors.Las expectativas puestas en el uso de la computación en paralelo plantean un cambio generacional en simulación y computación. Las más modernas instalaciones computacionales de alto nivel -High Performance Computing (HPC)- alcanzan ya la capacidad de realizar gran cantidad de operaciones por segundo -hoy del orden de peta-FLOPS (1e15 FLOPS) y dirigiéndose hacia exaFlops (1e18 FLOPS)-. Esto abre la posibilidad de usar la simulación por ordenador en un amplio espectro de nuevas aplicaciones en ciencia e ingeniería. Por ejemplo, los códigos de CFD&HT van a poder usarse de una forma más efectiva en la fase de diseño de dispositivos industriales ya que se obtendrán resultados muy valiosos en tiempos de ejecución razonables. Por el momento, hay muchos códigos disponibles para el estudio individual de fenómenos de transferencia de calor i de masa con distintos niveles de modelización. Sin embargo, estos códigos no se pueden combinar entre sí para abordar problemas más complejos, en los cuales varios fenómenos físicos interactúan simultáneamente. Bajo este contexto, en esta tesis doctoral se presenta el desarrollo de una metodología de estrategia paralela, y su implementación en una plataforma informática, para la simulación de sistemas multi-físicos. De éste modo, ahora los distintos códigos pueden ser integrados para la creación de nuevas herramientas de simulación destinadas a aplicaciones específicas de interés tanto social como industrial. Esto se hace de una manera intuitiva y simple de manera que los investigadores no tienen que preocuparse ni por la coexistencia de varios códigos simultáneamente ni en cómo hacer que interactúen entre ellos. El acoplamiento entre los diferentes componentes involucrados en una simulación se realiza mediante un código más básico con el cual el usuario solamente interacciona a través de una interfase. Esto aporta interesantes beneficios tanto en la gestión de los proyectos de programario como en la flexibilidad y las características de las simulaciones. En resumen, la estrategia que se propone plantea un nuevo paradigma en la producción de programas de simulación de fenómenos físicos. Aunque la tesis persigue aplicaciones de propósito general se ha puesto especial atención en la simulación de sistemas térmicos, en particular en la evaluación energética de edificios y en compresores herméticos alternativos.Postprint (published version
- …