Search CORE

33,424 research outputs found

A Similarity Measure for GPU Kernel Subgraph Matching

Author: A Sabne
BP Miller
C Böhm
F Zhang
G Ammons
L Adhianto
MH Williams
R Lim
R Singh
RC Gonzales
SS Shende
T Ball
Publication venue
Publication date: 21/03/2019
Field of study

Accelerator architectures specialize in executing SIMD (single instruction, multiple data) in lockstep. Because the majority of CUDA applications are parallelized loops, control flow information can provide an in-depth characterization of a kernel. CUDAflow is a tool that statically separates CUDA binaries into basic block regions and dynamically measures instruction and basic block frequencies. CUDAflow captures this information in a control flow graph (CFG) and performs subgraph matching across various kernel's CFGs to gain insights to an application's resource requirements, based on the shape and traversal of the graph, instruction operations executed and registers allocated, among other information. The utility of CUDAflow is demonstrated with SHOC and Rodinia application case studies on a variety of GPU architectures, revealing novel thread divergence characteristics that facilitates end users, autotuners and compilers in generating high performing code

arXiv.org e-Print Archive

Crossref

Beyond Reuse Distance Analysis: Dynamic Analysis for Characterization of Data Locality Potential

Author: Elango Venmugil
Fauzia Naznin
Pouchet Louis-Noël
Ramanujam J.
Rastello Fabrice
Ravishankar Mahesh
Rountev Atanas
Sadayappan P.
Publication venue
Publication date: 01/12/2013
Field of study

Emerging computer architectures will feature drastically decreased flops/byte (ratio of peak processing rate to memory bandwidth) as highlighted by recent studies on Exascale architectural trends. Further, flops are getting cheaper while the energy cost of data movement is increasingly dominant. The understanding and characterization of data locality properties of computations is critical in order to guide efforts to enhance data locality. Reuse distance analysis of memory address traces is a valuable tool to perform data locality characterization of programs. A single reuse distance analysis can be used to estimate the number of cache misses in a fully associative LRU cache of any size, thereby providing estimates on the minimum bandwidth requirements at different levels of the memory hierarchy to avoid being bandwidth bound. However, such an analysis only holds for the particular execution order that produced the trace. It cannot estimate potential improvement in data locality through dependence preserving transformations that change the execution schedule of the operations in the computation. In this article, we develop a novel dynamic analysis approach to characterize the inherent locality properties of a computation and thereby assess the potential for data locality enhancement via dependence preserving transformations. The execution trace of a code is analyzed to extract a computational directed acyclic graph (CDAG) of the data dependences. The CDAG is then partitioned into convex subsets, and the convex partitioning is used to reorder the operations in the execution trace to enhance data locality. The approach enables us to go beyond reuse distance analysis of a single specific order of execution of the operations of a computation in characterization of its data locality properties. It can serve a valuable role in identifying promising code regions for manual transformation, as well as assessing the effectiveness of compiler transformations for data locality enhancement. We demonstrate the effectiveness of the approach using a number of benchmarks, including case studies where the potential shown by the analysis is exploited to achieve lower data movement costs and better performance.Comment: Transaction on Architecture and Code Optimization (2014

arXiv.org e-Print Archive

HAL-ENS-LYON

INRIA a CCSD electronic archive server

Hal-Diderot

Securing Real-Time Internet-of-Things

Author: Chen Chien-Ying
Hasan Monowar
Mohan Sibin
Publication venue: 'MDPI AG'
Publication date: 01/12/2018
Field of study

Modern embedded and cyber-physical systems are ubiquitous. A large number of critical cyber-physical systems have real-time requirements (e.g., avionics, automobiles, power grids, manufacturing systems, industrial control systems, etc.). Recent developments and new functionality requires real-time embedded devices to be connected to the Internet. This gives rise to the real-time Internet-of-things (RT-IoT) that promises a better user experience through stronger connectivity and efficient use of next-generation embedded devices. However RT- IoT are also increasingly becoming targets for cyber-attacks which is exacerbated by this increased connectivity. This paper gives an introduction to RT-IoT systems, an outlook of current approaches and possible research challenges towards secure RT- IoT frameworks

arXiv.org e-Print Archive

Multidisciplinary Digital Publishing Institute

Directory of Open Access Journals

Leveraging Program Analysis to Reduce User-Perceived Latency in Mobile Applications

Author: Mickens James W
Netravali Ravi
Ossa B De La
PRESTO
Ravindranath Lenin
Wang Xiao Sophia
Wu C.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 20/10/2018
Field of study

Reducing network latency in mobile applications is an effective way of improving the mobile user experience and has tangible economic benefits. This paper presents PALOMA, a novel client-centric technique for reducing the network latency by prefetching HTTP requests in Android apps. Our work leverages string analysis and callback control-flow analysis to automatically instrument apps using PALOMA's rigorous formulation of scenarios that address "what" and "when" to prefetch. PALOMA has been shown to incur significant runtime savings (several hundred milliseconds per prefetchable HTTP request), both when applied on a reusable evaluation benchmark we have developed and on real applicationsComment: ICSE 201

arXiv.org e-Print Archive

Crossref