Search CORE

6 research outputs found

Ubiquitous Memory Introspection (Preliminary Manuscript)

Author: Amarasinghe Saman
Rabbah Rodric
Rudolph Larry
Wong Weng-Fai
Zhao Qin
Publication venue
Publication date: 25/09/2006
Field of study

Modern memory systems play a critical role in the performance ofapplications, but a detailed understanding of the application behaviorin the memory system is not trivial to attain. It requires timeconsuming simulations of the memory hierarchy using long traces, andoften using detailed modeling. It is increasingly possible to accesshardware performance counters to measure events in the memory system,but the measurements remain coarse grained, better suited forperformance summaries than providing instruction level feedback. Theavailability of a low cost, online, and accurate methodology forderiving fine-grained memory behavior profiles can prove extremelyuseful for runtime analysis and optimization of programs.This paper presents a new methodology for Ubiquitous MemoryIntrospection (UMI). It is an online and lightweight mini-simulationmethodology that focuses on simulating short memory access tracesrecorded from frequently executed code regions. The simulations arefast and can provide profiling results at varying granularities, downto that of a single instruction or address. UMI naturally complementsruntime optimizations techniques and enables new opportunities formemory specific optimizations.In this paper, we present a prototype implementation of a runtimesystem implementing UMI. The prototype is readily deployed oncommodity processors, requires no user intervention, and can operatewith stripped binaries and legacy software. The prototype operateswith an average runtime overhead of 20% but this slowdown is only 6%slower than a state of the art binary instrumentation tool. We used32 benchmarks, including the full suite of SPEC2000 benchmarks, forour evaluation. We show that the mini-simulation results accuratelyreflect the cache performance of two existing memory systems, anIntel Pentium~4 and an AMD Athlon MP (K7) processor. We alsodemonstrate that low level profiling information from the onlinesimulation can serve to identify high-miss rate load instructions with a77% rate of accuracy compared to full offline simulations thatrequired days to complete. The online profiling results are used atruntime to implement a simple software prefetching strategy thatachieves a speedup greater than 60% in the best case

DSpace@MIT

Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code

Author: Amarasinghe Saman P.
Bosboom Jeffrey William
Kamil Shoaib
Mendis Thirimadura C. Yasend
Paris Sylvain
Ragan-Kelley Jonathan
Wu Kevin
Zhao Qin
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/06/2015
Field of study

Highly optimized programs are prone to bit rot, where performance quickly becomes suboptimal in the face of new hardware and compiler techniques. In this paper we show how to automatically lift performance-critical stencil kernels from a stripped x86 binary and generate the corresponding code in the high-level domain-specific language Halide. Using Halide’s state-of-the-art optimizations targeting current hardware, we show that new optimized versions of these kernels can replace the originals to rejuvenate the application for newer hardware. The original optimized code for kernels in stripped binaries is nearly impossible to analyze statically. Instead, we rely on dynamic traces to regenerate the kernels. We perform buffer structure reconstruction to identify input, intermediate and output buffer shapes. We abstract from a forest of concrete dependency trees which contain absolute memory addresses to symbolic trees suitable for high-level code generation. This is done by canonicalizing trees, clustering them based on structure, inferring higher-dimensional buffer accesses and finally by solving a set of linear equations based on buffer accesses to lift them up to simple, high-level expressions. Helium can handle highly optimized, complex stencil kernels with input-dependent conditionals. We lift seven kernels from Adobe Photoshop giving a 75% performance improvement, four kernels from IrfanView, leading to 4.97× performance, and one stencil from the miniGMG multigrid benchmark netting a 4.25× improvement in performance. We manually rejuvenated Photoshop by replacing eleven of Photoshop’s filters with our lifted implementations, giving 1.12× speedup without affecting the user experience.United States. Dept. of Energy (Award DE-SC0005288)United States. Dept. of Energy (Award DE-SC0008923)United States. Defense Advanced Research Projects Agency (Agreement FA8759-14-2-0009)MIT Energy Initiative (Fellowship

DSpace@MIT

Crossref

Transparent dynamic instrumentation

Author: Amarasinghe Saman P.
Bruening Derek
Zhao Qin
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2012
Field of study

Process virtualization provides a virtual execution environment within which an unmodified application can be monitored and controlled while it executes. The provided layer of control can be used for purposes ranging from sandboxing to compatibility to profiling. The additional operations required for this layer are performed clandestinely alongside regular program execution. Software dynamic instrumentation is one method for implementing process virtualization which dynamically instruments an application such that the application's code and the inserted code are interleaved together. DynamoRIO is a process virtualization system implemented using software code cache techniques that allows users to build customized dynamic instrumentation tools. There are many challenges to building such a runtime system. One major obstacle is transparency. In order to support executing arbitrary applications, DynamoRIO must be fully transparent so that an application cannot distinguish between running inside the virtual environment and native execution. In addition, any desired extra operations for a particular tool must avoid interfering with the behavior of the application. Transparency has historically been provided on an ad-hoc basis, as a reaction to observed problems in target applications. This paper identifies a necessary set of transparency requirements for running mainstream Windows and Linux applications. We discuss possible solutions to each transparency issue, evaluate tradeoffs between different choices, and identify cases where maintaining transparency is not practically solvable. We believe this will provide a guideline for better design and implementation of transparent dynamic instrumentation, as well as other similar process virtualization systems using software code caches

CiteSeerX

DSpace@MIT

Crossref

Ubiquitous memory introspection

Author: Amarasinghe S.
Rabbah R.
Rudolph L.
Wong W.-F.
Zhao Q.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2007
Field of study

10.1109/CGO.2007.12International Symposium on Code Generation and Optimization, CGO 2007299-31

CiteSeerX

Crossref

ScholarBank@NUS

Ubiquitous Memory Introspection (Preliminary Manuscript)

Author: Larry Rudolph
Larry Rudolph
Qin Zhao
Qin Zhao
Rodric Rabbah
Rodric Rabbah
Saman Amarasinghe
Saman Amarasinghe
Weng-fai Wong
Weng-fai Wong
Publication venue
Publication date
Field of study

CiteSeerX