582 research outputs found
Static analysis of energy consumption for LLVM IR programs
Energy models can be constructed by characterizing the energy consumed by
executing each instruction in a processor's instruction set. This can be used
to determine how much energy is required to execute a sequence of assembly
instructions, without the need to instrument or measure hardware.
However, statically analyzing low-level program structures is hard, and the
gap between the high-level program structure and the low-level energy models
needs to be bridged. We have developed techniques for performing a static
analysis on the intermediate compiler representations of a program.
Specifically, we target LLVM IR, a representation used by modern compilers,
including Clang. Using these techniques we can automatically infer an estimate
of the energy consumed when running a function under different platforms, using
different compilers.
One of the challenges in doing so is that of determining an energy cost of
executing LLVM IR program segments, for which we have developed two different
approaches. When this information is used in conjunction with our analysis, we
are able to infer energy formulae that characterize the energy consumption for
a particular program. This approach can be applied to any languages targeting
the LLVM toolchain, including C and XC or architectures such as ARM Cortex-M or
XMOS xCORE, with a focus towards embedded platforms. Our techniques are
validated on these platforms by comparing the static analysis results to the
physical measurements taken from the hardware. Static energy consumption
estimation enables energy-aware software development, without requiring
hardware knowledge
Towards a Time-predictable Dual-Issue Microprocessor: The Patmos Approach
Current processors are optimized for average case performance, often leading to a high worst-case execution time (WCET). Many architectural features that increase the average case performance are hard to be modeled for the WCET analysis. In this paper we present Patmos, a processor optimized for low WCET bounds rather than high average case performance. Patmos is a dual-issue, statically scheduled RISC processor. The instruction cache is organized as a method cache and the data cache is organized as a split cache in order to simplify the cache WCET analysis. To fill the dual-issue pipeline with enough useful instructions, Patmos relies on a customized compiler. The compiler also plays a central role in optimizing the application for the WCET instead of average case performance
Runtime Optimizations for Prediction with Tree-Based Models
Tree-based models have proven to be an effective solution for web ranking as
well as other problems in diverse domains. This paper focuses on optimizing the
runtime performance of applying such models to make predictions, given an
already-trained model. Although exceedingly simple conceptually, most
implementations of tree-based models do not efficiently utilize modern
superscalar processor architectures. By laying out data structures in memory in
a more cache-conscious fashion, removing branches from the execution flow using
a technique called predication, and micro-batching predictions using a
technique called vectorization, we are able to better exploit modern processor
architectures and significantly improve the speed of tree-based models over
hard-coded if-else blocks. Our work contributes to the exploration of
architecture-conscious runtime implementations of machine learning algorithms
An automated OpenCL FPGA compilation framework targeting a configurable, VLIW chip multiprocessor
Modern system-on-chips augment their baseline CPU with coprocessors and accelerators to increase overall computational capacity and power efficiency, and thus have evolved into heterogeneous systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This thesis discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a customised VLIW chip multiprocessor (CMP) architecture, known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on the LE1 CPU. The framework fully automates the compilation flow and supports work-item coalescing to better utilise the CPU cores and alleviate the effects of thread divergence. This thesis discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework on a highly precise cycle-accurate simulator. This is achieved through the execution of 12 benchmarks across 240 different machine configurations, as well as further results utilising an incomplete development branch of the compiler. It is shown that the problems generally scale well with the LE1 architecture, up to eight cores, when the memory system becomes a serious bottleneck. Results demonstrate superlinear performance on certain benchmarks (x9 for the bitonic sort benchmark with 8 dual-issue cores) with further improvements from compiler optimisations (x14 for bitonic with the same configuration
VThreads: A novel VLIW chip multiprocessor with hardware-assisted PThreads
We discuss VThreads, a novel VLIW CMP with hardware-assisted shared-memory Thread support. VThreads supports Instruction Level Parallelism via static multiple-issue and Thread Level Parallelism via hardware-assisted POSIX Threads along with extensive customization. It allows the instantiation of tightlycoupled streaming accelerators and supports up to 7-address Multiple-Input, Multiple-Output instruction extensions. VThreads is designed in technology-independent Register-Transfer-Level VHDL and prototyped on 40 nm and 28 nm Field-Programmable gate arrays. It was evaluated against a PThreads-based multiprocessor
based on the Sparc-V8 ISA. On a 65 nm ASIC implementation VThreads achieves up to x7.2
performance increase on synthetic benchmarks, x5 on a parallel Mandelbrot implementation, 66% better on a threaded JPEG implementation, 79% better on an edge-detection benchmark and ~13% improvement on DES compared to the Leon3MP CMP. In the range of 2 to 8 cores VThreads demonstrates a post-route (statistical) power reduction between 65% to 57% at an area increase of 1.2%-10% for 1-8 cores, compared to a similarly-configured Leon3MP CMP. This combination of micro-architectural features, scalability, extensibility,
hardware support for low-latency PThreads, power efficiency and area make the processor an attractive proposition for low-power, deeply-embedded applications requiring minimum OS support
A Survey on Compiler Autotuning using Machine Learning
Since the mid-1990s, researchers have been trying to use machine-learning
based approaches to solve a number of different compiler optimization problems.
These techniques primarily enhance the quality of the obtained results and,
more importantly, make it feasible to tackle two main compiler optimization
problems: optimization selection (choosing which optimizations to apply) and
phase-ordering (choosing the order of applying optimizations). The compiler
optimization space continues to grow due to the advancement of applications,
increasing number of compiler optimizations, and new target architectures.
Generic optimization passes in compilers cannot fully leverage newly introduced
optimizations and, therefore, cannot keep up with the pace of increasing
options. This survey summarizes and classifies the recent advances in using
machine learning for the compiler optimization field, particularly on the two
major problems of (1) selecting the best optimizations and (2) the
phase-ordering of optimizations. The survey highlights the approaches taken so
far, the obtained results, the fine-grain classification among different
approaches and finally, the influential papers of the field.Comment: version 5.0 (updated on September 2018)- Preprint Version For our
Accepted Journal @ ACM CSUR 2018 (42 pages) - This survey will be updated
quarterly here (Send me your new published papers to be added in the
subsequent version) History: Received November 2016; Revised August 2017;
Revised February 2018; Accepted March 2018
Detecting and Understanding Dynamically Dead Instructions for Contemporary Architectures
Instructions executed by the processor are dynamically dead if the values they produce are not used by the program. Executing such useless instructions can potentially slow-down program execution and waste power. The goal of this work is to quantify and understand the occurrence of dynamically dead instructions (DDI) for programs compiled using modern compilers for the most relevant contemporary architectures. We expect our extensive study to highlight the issue of DDI and to play a critical role in the development of compiler and/or architectural techniques to avoid DDI execution at runtime. In this thesis, we introduce our novel GCC&ndashbased instrumentation and analysis framework to determine DDI during program execution. We present the ratio and characteristics of DDI in our benchmark programs. We find that programs compiled with GCC (with and without optimizations) execute a significant fraction of DDI on x86 and ARM based machines. Additionally, an ample amount of predication employed by GCC results in a large fraction of executed instructions on the ARM to be dynamically dead. We observe that a handful of static instructions contribute a large majority to the overall DDI in standard benchmark programs. We also find that employing a small amount of static context information can significantly benefit the detection of DDI at run&ndashtime. Additionally, we describe the results of our manual study to analyze and categorize the DDI instances in our x86 benchmarks. We briefly outline compiler and architecture based techniques that can be used to eliminate each category of DDI in future programs. Overall, we believe that a close synergy between compiler and architecture techniques may be the most effective strategy to eliminate DDI to improve sequential program performance and energy efficiency on modern machines
Optimizing SIMD execution in HW/SW co-designed processors
SIMD accelerators are ubiquitous in microprocessors from different computing domains. Their high compute power and hardware simplicity improve overall performance in an energy efficient manner. Moreover, their replicated functional units and simple control mechanism make them amenable to scaling to higher vector lengths. However, code generation for these accelerators has been a challenge from the days of their inception. Compilers generate vector code conservatively to ensure correctness. As a result they lose significant vectorization opportunities and fail to extract maximum benefits out of SIMD accelerators.
This thesis proposes to vectorize the program binary at runtime in a speculative manner, in addition to the compile time static vectorization. There are different environments that support runtime profiling and optimization support required for dynamic vectorization, one of most prominent ones being: 1) Dynamic Binary Translators and Optimizers (DBTO) and 2) Hardware/Software (HW/SW) Co-designed Processors. HW/SW co-designed environment provides several advantages over DBTOs like transparent incorporations of new hardware features, binary compatibility, etc. Therefore, we use HW/SW co-designed environment to assess the potential of speculative dynamic vectorization.
Furthermore, we analyze vector code generation for wider vector units and find out that even though SIMD accelerators are amenable to scaling from the hardware point of view, vector code generation at higher vector length is even more challenging. The two major factors impeding vectorization for wider SIMD units are: 1) Reduced dynamic instruction stream coverage for vectorization and 2) Large number of permutation instructions. To solve the first problem we propose Variable Length Vectorization that iteratively vectorizes for multiple vector lengths to improve dynamic instruction stream coverage. Secondly, to reduce the number of permutation instructions we propose Selective Writing that selectively writes to different parts of a vector register and avoids permutations.
Finally, we tackle the problem of leakage energy in SIMD accelerators. Since SIMD accelerators consume significant amount of real estate on the chip, they become the principle source of leakage if not utilized judiciously. Power gating is one of the most widely used techniques to reduce leakage energy of functional units. However, power gating has its own energy and performance overhead associated with it. We propose to selectively devectorize the vector code when higher SIMD lanes are used intermittently. This selective devectorization keeps the higher SIMD lanes idle and power gated for maximum duration. Therefore, resulting in overall leakage energy reduction.Postprint (published version
- âŠ