302 research outputs found
Coarse-grained reconfigurable array architectures
Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code
Auto-tuning Distributed Stream Processing Systems using Reinforcement Learning
Fine tuning distributed systems is considered to be a craftsmanship, relying
on intuition and experience. This becomes even more challenging when the
systems need to react in near real time, as streaming engines have to do to
maintain pre-agreed service quality metrics. In this article, we present an
automated approach that builds on a combination of supervised and reinforcement
learning methods to recommend the most appropriate lever configurations based
on previous load. With this, streaming engines can be automatically tuned
without requiring a human to determine the right way and proper time to deploy
them. This opens the door to new configurations that are not being applied
today since the complexity of managing these systems has surpassed the
abilities of human experts. We show how reinforcement learning systems can find
substantially better configurations in less time than their human counterparts
and adapt to changing workloads
Understanding Optimization Phase Interactions to Reduce the Phase Order Search Space
Compiler optimization phase ordering is a longstanding problem, and is of particular relevance to the performance-oriented and cost-constrained domain of embedded systems applications. Optimization phases are known to interact with each other, enabling and disabling opportunities for successive phases. Therefore, varying the order of applying these phases often generates distinct output codes, with different speed, code-size and power consumption characteristics. Most cur- rent approaches to address this issue focus on developing innovative methods to selectively evaluate the vast phase order search space to produce a good (but, potentially suboptimal) representation for each program. In contrast, the goal of this thesis is to study and reduce the phase order search space by: (1) identifying common causes of optimization phase interactions across all phases, and then devising techniques to eliminate them, and (2) exploiting natural phase independence to prune the phase order search space. We observe that several phase interactions are caused by false register dependence during many optimization phases. We explore the potential of cleanup phases, such as register remapping and copy propagation, at reducing false dependences. We show that innovative implementation and application of these phases not only reduces the size of the phase order search space substantially, but can also improve the quality of code generated by optimizing compilers. We examine the effect of removing cleanup phases, such as dead assignment elimination, which should not interact with other compiler phases, from the phase order search space. Finally, we show that reorganization of the phase order search into a multi-staged approach employing sets of mutually independent optimizations can reduce the search space to a fraction of its original size without sacrificing performance
HyFM: Function Merging for Free
Function merging is an important optimization for reducing code size. It merges multiple functions into a single one, eliminating duplicate code among them. The existing state-of-the-art relies on a well-known sequence alignment algorithm to identify duplicate code across whole functions. However, this algorithm is quadratic in time and space on the number of instructions. This leads to very high time overheads and prohibitive levels of memory usage even for medium-sized benchmarks. For larger programs, it becomes impractical.
This is made worse by an overly eager merging approach. All selected pairs of functions will be merged. Only then will this approach estimate the potential benefit from merging and decide whether to replace the original functions with the merged one. Given that most pairs are unprofitable, a significant amount of time is wasted producing merged functions that are simply thrown away.
In this paper, we propose HyFM, a novel function merging technique that delivers similar levels of code size reduction for significantly lower time overhead and memory usage. Unlike the state-of-the-art, our alignment strategy works at the block level. Since basic blocks are usually much shorter than functions, even a quadratic alignment is acceptable. However, we also propose a linear algorithm for aligning blocks of the same size at a much lower cost. We extend this strategy with a multi-tier profitability analysis that bails out early from unprofitable merging attempts. By aligning individual pairs of blocks, we are able to decide their alignment’s profitability separately and before actually generating code.
Experimental results on SPEC 2006 and 2017 show that HyFM needs orders of magnitude less memory, using up to 48 MB or 5.6 MB, depending on the variant used, while the state-of-the-art requires 32 GB in the worst case. HyFM also runs over 4.5×× faster, while still achieving comparable code size reduction. Combined with the speedup of later compilation stages due to the reduced number of functions, HyFM contributes to a reduced end-to-end compilation time
goSLP: Globally Optimized Superword Level Parallelism Framework
Modern microprocessors are equipped with single instruction multiple data
(SIMD) or vector instruction sets which allow compilers to exploit superword
level parallelism (SLP), a type of fine-grained parallelism. Current SLP
auto-vectorization techniques use heuristics to discover vectorization
opportunities in high-level language code. These heuristics are fragile, local
and typically only present one vectorization strategy that is either accepted
or rejected by a cost model. We present goSLP, a novel SLP auto-vectorization
framework which solves the statement packing problem in a pairwise optimal
manner. Using an integer linear programming (ILP) solver, goSLP searches the
entire space of statement packing opportunities for a whole function at a time,
while limiting total compilation time to a few minutes. Furthermore, goSLP
optimally solves the vector permutation selection problem using dynamic
programming. We implemented goSLP in the LLVM compiler infrastructure,
achieving a geometric mean speedup of 7.58% on SPEC2017fp, 2.42% on SPEC2006fp
and 4.07% on NAS benchmarks compared to LLVM's existing SLP auto-vectorizer.Comment: Published at OOPSLA 201
Optimizing the FPGA memory design for a Sobel edge detector
This research explored different memory systems on FPGA chips in order to show the various trade-offs involved with choosing one memory system over another. We explored the different memory components that are found on FPGA chips using the example of a Sobel edge detector. We demonstrated how the different FPGA chip’s memories affected I/O performance and area. By exploiting the trade-offs between these a designer should be able to find an optimal on-chip memory system for a given application. Given further study, we believe we can develop application-specific memory templates that can be used with a hardware compiler to generate optimal on-chip memory system
A switchable approach to large object allocation in real-time Java
Over the last 20 years object-oriented programming languages and managed run-times like Java have been very popular because of their software engineering benefits. Despite their popularity in many application areas, they have not been considered suitable for real-time programming. Besides many other factors, one of the barriers that prevent their acceptance in the development of real-time systems is the long pause times that may arise during large object allocation. This paper examines different kinds of solutions that have been developed so far and introduces a switchable approach to large object allocation in real-time Java. A synthetic benchmark application that is developed to evaluate the effectiveness of the presented technique against other currently implemented techniques is also described
Worst-case analysis of heap allocations
Abstract. In object oriented languages, dynamic memory allocation is a fundamental concept. When using such a language in hard real-time systems, it becomes important to bound both the worst-case execution time and the worst-case memory consumption. In this paper, we present an analysis to determine the worst-case heap allocations of tasks. The analysis builds upon techniques that are well established for worst-case execution time analysis. The difference is that the cost function is not the execution time of instructions in clock cycles, but the allocation in bytes. In contrast to worst-case execution time analysis, worst-case heap allocation analysis is not processor dependent. However, the cost function depends on the object layout of the runtime system. The analysis is evaluated with several real-time benchmarks to establish the usefulness of the analysis, and to compare the memory consumption of different object layouts.
- …