Search CORE

2 research outputs found

Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code

Author: Amarasinghe Saman P.
Bosboom Jeffrey William
Kamil Shoaib
Mendis Thirimadura C. Yasend
Paris Sylvain
Ragan-Kelley Jonathan
Wu Kevin
Zhao Qin
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/06/2015
Field of study

Highly optimized programs are prone to bit rot, where performance quickly becomes suboptimal in the face of new hardware and compiler techniques. In this paper we show how to automatically lift performance-critical stencil kernels from a stripped x86 binary and generate the corresponding code in the high-level domain-specific language Halide. Using Halide’s state-of-the-art optimizations targeting current hardware, we show that new optimized versions of these kernels can replace the originals to rejuvenate the application for newer hardware. The original optimized code for kernels in stripped binaries is nearly impossible to analyze statically. Instead, we rely on dynamic traces to regenerate the kernels. We perform buffer structure reconstruction to identify input, intermediate and output buffer shapes. We abstract from a forest of concrete dependency trees which contain absolute memory addresses to symbolic trees suitable for high-level code generation. This is done by canonicalizing trees, clustering them based on structure, inferring higher-dimensional buffer accesses and finally by solving a set of linear equations based on buffer accesses to lift them up to simple, high-level expressions. Helium can handle highly optimized, complex stencil kernels with input-dependent conditionals. We lift seven kernels from Adobe Photoshop giving a 75% performance improvement, four kernels from IrfanView, leading to 4.97× performance, and one stencil from the miniGMG multigrid benchmark netting a 4.25× improvement in performance. We manually rejuvenated Photoshop by replacing eleven of Photoshop’s filters with our lifted implementations, giving 1.12× speedup without affecting the user experience.United States. Dept. of Energy (Award DE-SC0005288)United States. Dept. of Energy (Award DE-SC0008923)United States. Defense Advanced Research Projects Agency (Agreement FA8759-14-2-0009)MIT Energy Initiative (Fellowship

DSpace@MIT

Crossref

BHive: A Benchmark Suite and Measurement Framework for Validating x86-64 Basic Block Performance Models

Author: Amarasinghe Saman P
Atkinson Eric Hamilton
Brahmakshatriya Ajay
Carbin Michael James
Chen Yishen
Mendis Thirimadura C. Yasend
Renda Alex
Sykora Ondrej
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 23/11/2020
Field of study

Compilers and performance engineers use hardware performance models to simplify program optimizations. Performance models provide a necessary abstraction over complex modern processors. However, constructing and maintaining a performance model can be onerous, given the numerous microarchi-tectural optimizations employed by modern processors. Despite their complexity and reported inaccuracy (e.g., deviating from native measurement by more than 30%), existing performance models-such as IACA and llvm-mca-have not been systematically validated, because there is no scalable machine code profiler that can automatically obtain throughput of arbitrary basic blocks while conforming to common modeling assumptions. In this paper, we present a novel profiler that can profile arbitrary memory-accessing basic blocks without any user intervention. We used this profiler to build BHive, a benchmark for systematic validation of performance models of x86-64 basic blocks. We used BHive to evaluate four existing performance models: IACA, llvm-mca, Ithemal, and OSACA. We automatically cluster basic blocks in the benchmark suite based on their utilization of CPU resources. Using this clustering, our benchmark can give a detailed analysis of a performance model's strengths and weaknesses on different workloads (e.g., vectorized vs. scalar basic blocks). We additionally demonstrate that our dataset well captures basic properties of two Google applications: Spanner and Dremel

DSpace@MIT