Search CORE

1,184 research outputs found

Optimizing SIMD execution in HW/SW co-designed processors

Author: Kumar Rakesh
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2014
Field of study

SIMD accelerators are ubiquitous in microprocessors from different computing domains. Their high compute power and hardware simplicity improve overall performance in an energy efficient manner. Moreover, their replicated functional units and simple control mechanism make them amenable to scaling to higher vector lengths. However, code generation for these accelerators has been a challenge from the days of their inception. Compilers generate vector code conservatively to ensure correctness. As a result they lose significant vectorization opportunities and fail to extract maximum benefits out of SIMD accelerators. This thesis proposes to vectorize the program binary at runtime in a speculative manner, in addition to the compile time static vectorization. There are different environments that support runtime profiling and optimization support required for dynamic vectorization, one of most prominent ones being: 1) Dynamic Binary Translators and Optimizers (DBTO) and 2) Hardware/Software (HW/SW) Co-designed Processors. HW/SW co-designed environment provides several advantages over DBTOs like transparent incorporations of new hardware features, binary compatibility, etc. Therefore, we use HW/SW co-designed environment to assess the potential of speculative dynamic vectorization. Furthermore, we analyze vector code generation for wider vector units and find out that even though SIMD accelerators are amenable to scaling from the hardware point of view, vector code generation at higher vector length is even more challenging. The two major factors impeding vectorization for wider SIMD units are: 1) Reduced dynamic instruction stream coverage for vectorization and 2) Large number of permutation instructions. To solve the first problem we propose Variable Length Vectorization that iteratively vectorizes for multiple vector lengths to improve dynamic instruction stream coverage. Secondly, to reduce the number of permutation instructions we propose Selective Writing that selectively writes to different parts of a vector register and avoids permutations. Finally, we tackle the problem of leakage energy in SIMD accelerators. Since SIMD accelerators consume significant amount of real estate on the chip, they become the principle source of leakage if not utilized judiciously. Power gating is one of the most widely used techniques to reduce leakage energy of functional units. However, power gating has its own energy and performance overhead associated with it. We propose to selectively devectorize the vector code when higher SIMD lanes are used intermittently. This selective devectorization keeps the higher SIMD lanes idle and power gated for maximum duration. Therefore, resulting in overall leakage energy reduction.Postprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

A System-Level Dynamic Binary Translator using Automatically-Learned Translation Rules

Author: Dong Rongchao
Jiang Jinhu
Liang Chaoyi
Wang Wenwen
Yang Zhaohui
Yew Pen-Chung
Zhang Weihua
Zhou Zhongjun
Publication venue
Publication date: 14/02/2024
Field of study

System-level emulators have been used extensively for system design, debugging and evaluation. They work by providing a system-level virtual machine to support a guest operating system (OS) running on a platform with the same or different native OS that uses the same or different instruction-set architecture. For such system-level emulation, dynamic binary translation (DBT) is one of the core technologies. A recently proposed learning-based DBT approach has shown a significantly improved performance with a higher quality of translated code using automatically learned translation rules. However, it has only been applied to user-level emulation, and not yet to system-level emulation. In this paper, we explore the feasibility of applying this approach to improve system-level emulation, and use QEMU to build a prototype. ... To achieve better performance, we leverage several optimizations that include coordination overhead reduction to reduce the overhead of each coordination, and coordination elimination and code scheduling to reduce the coordination frequency. Experimental results show that it can achieve an average of 1.36X speedup over QEMU 6.1 with negligible coordination overhead in the system emulation mode using SPEC CINT2006 as application benchmarks and 1.15X on real-world applications.Comment: 10 pages, 19 figures, to be published in International Symposium on Code Generation and Optimization (CGO) 202

arXiv.org e-Print Archive

The Janus triad: Exploiting parallelism through dynamic binary modification

Author: Erdős M
Jones TM
Wort G
Zhou R
Publication venue: VEE 2019 - Proceedings of the 15th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments
Publication date: 01/01/2019
Field of study

We present a unified approach for exploiting thread-level, data-level, and memory-level parallelism through a same-ISA dynamic binary modifier guided by static binary analysis. A static binary analyser first examines an executable and determines the operations required to extract parallelism at runtime, encoding them as a series of rewrite rules that a dynamic binary modifier uses to perform binary transformation. We demonstrate this framework by exploiting three different kinds of parallelism to perform automatic vectorisation, software prefetching, and automatic parallelisation together on legacy application binaries. Software prefetch insertion alone achieves an average speedup of 1.2×, comparing favourably with an automatic compiler pass. Automatic vectorisation brings speedups of 2.7× on the TSVC benchmarks, significantly beating a compiler approach for some workloads. Finally, combining prefetching, vectorisation, and parallelisation realises a speedup of 3.8× on a representative application loop

Crossref

Apollo (Cambridge)

NANOCONTROLLER PROGRAM OPTIMIZATION USING ITE DAGS

Author: Rajachidambaram Sarojini Priyadarshini
Publication venue: UKnowledge
Publication date: 01/01/2007
Field of study

Kentucky Architecture nanocontrollers employ a bit-serial SIMD-parallel hardware design to execute MIMD control programs. A MIMD program is transformed into equivalent SIMD code by a process called Meta-State Conversion (MSC), which makes heavy use of enable masking to distinguish which code should be executed by each processing element. Both the bit-serial operations and the enable masking imposed on them are expressed in terms of if-then-else (ITE) operations implemented by a 1-of-2 multiplexor, greatly simplifying the hardware. However, it takes a lot of ITEs to implement even a small program fragment. Traditionally, bit-serial SIMD machines had been programmed by expanding a fixed bitserial pattern for each word-level operation. Instead, nanocontrollers can make use of the fact that ITEs are equivalent to the operations in Binary Decision Diagrams (BDDs), and can apply BDD analysis to optimize the ITEs. This thesis proposes and experimentally evaluates a number of techniques for minimizing the complexity of the BDDs, primarily by manipulating normalization ordering constraints. The best method found is a new approach in which a simple set of optimization transformations is followed by normalization using an ordering determined by a Genetic Algorithm (GA)

University of Kentucky

goSLP: Globally Optimized Superword Level Parallelism Framework

Author: Amarasinghe Saman
Mendis Charith
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 30/10/2018
Field of study

Modern microprocessors are equipped with single instruction multiple data (SIMD) or vector instruction sets which allow compilers to exploit superword level parallelism (SLP), a type of fine-grained parallelism. Current SLP auto-vectorization techniques use heuristics to discover vectorization opportunities in high-level language code. These heuristics are fragile, local and typically only present one vectorization strategy that is either accepted or rejected by a cost model. We present goSLP, a novel SLP auto-vectorization framework which solves the statement packing problem in a pairwise optimal manner. Using an integer linear programming (ILP) solver, goSLP searches the entire space of statement packing opportunities for a whole function at a time, while limiting total compilation time to a few minutes. Furthermore, goSLP optimally solves the vector permutation selection problem using dynamic programming. We implemented goSLP in the LLVM compiler infrastructure, achieving a geometric mean speedup of 7.58% on SPEC2017fp, 2.42% on SPEC2006fp and 4.07% on NAS benchmarks compared to LLVM's existing SLP auto-vectorizer.Comment: Published at OOPSLA 201

arXiv.org e-Print Archive

DSpace@MIT

Vapor SIMD: Auto-Vectorize Once, Run Everywhere

Author: Cohen Albert
Dyshel Sergei
Nuzman Dorit
Rohou Erven
Rosen Ira
Williams Kevin
Yuste David
Zaks Ayal
Publication venue: HAL CCSD
Publication date: 01/01/2011
Field of study

International audienceJust-in-Time (JIT) compiler technology offers portability while facilitating target- and context-specific specialization. Single-Instruction-Multiple-Data (SIMD) hardware is ubiquitous and markedly diverse, but can be difficult for JIT compilers to efficiently target due to resource and budget constraints. We present our design for a synergistic auto-vectorizing compilation scheme. The scheme is composed of an aggressive, generic offline stage coupled with a lightweight, target-specific online stage. Our method leverages the optimized intermediate results provided by the first stage across disparate SIMD architectures from different vendors, having distinct characteristics ranging from different vector sizes, memory alignment and access constraints, to special computational idioms.We demonstrate the effectiveness of our design using a set of kernels that exercise innermost loop, outer loop, as well as straight-line code vectorization, all automatically extracted by the common offline compilation stage. This results in performance comparable to that provided by specialized monolithic offline compilers. Our framework is implemented using open-source tools and standards, thereby promoting interoperability and extendibility

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

Hal-Diderot

HAL-Rennes 1