33,424 research outputs found
A Similarity Measure for GPU Kernel Subgraph Matching
Accelerator architectures specialize in executing SIMD (single instruction,
multiple data) in lockstep. Because the majority of CUDA applications are
parallelized loops, control flow information can provide an in-depth
characterization of a kernel. CUDAflow is a tool that statically separates CUDA
binaries into basic block regions and dynamically measures instruction and
basic block frequencies. CUDAflow captures this information in a control flow
graph (CFG) and performs subgraph matching across various kernel's CFGs to gain
insights to an application's resource requirements, based on the shape and
traversal of the graph, instruction operations executed and registers
allocated, among other information. The utility of CUDAflow is demonstrated
with SHOC and Rodinia application case studies on a variety of GPU
architectures, revealing novel thread divergence characteristics that
facilitates end users, autotuners and compilers in generating high performing
code
Beyond Reuse Distance Analysis: Dynamic Analysis for Characterization of Data Locality Potential
Emerging computer architectures will feature drastically decreased flops/byte
(ratio of peak processing rate to memory bandwidth) as highlighted by recent
studies on Exascale architectural trends. Further, flops are getting cheaper
while the energy cost of data movement is increasingly dominant. The
understanding and characterization of data locality properties of computations
is critical in order to guide efforts to enhance data locality. Reuse distance
analysis of memory address traces is a valuable tool to perform data locality
characterization of programs. A single reuse distance analysis can be used to
estimate the number of cache misses in a fully associative LRU cache of any
size, thereby providing estimates on the minimum bandwidth requirements at
different levels of the memory hierarchy to avoid being bandwidth bound.
However, such an analysis only holds for the particular execution order that
produced the trace. It cannot estimate potential improvement in data locality
through dependence preserving transformations that change the execution
schedule of the operations in the computation. In this article, we develop a
novel dynamic analysis approach to characterize the inherent locality
properties of a computation and thereby assess the potential for data locality
enhancement via dependence preserving transformations. The execution trace of a
code is analyzed to extract a computational directed acyclic graph (CDAG) of
the data dependences. The CDAG is then partitioned into convex subsets, and the
convex partitioning is used to reorder the operations in the execution trace to
enhance data locality. The approach enables us to go beyond reuse distance
analysis of a single specific order of execution of the operations of a
computation in characterization of its data locality properties. It can serve a
valuable role in identifying promising code regions for manual transformation,
as well as assessing the effectiveness of compiler transformations for data
locality enhancement. We demonstrate the effectiveness of the approach using a
number of benchmarks, including case studies where the potential shown by the
analysis is exploited to achieve lower data movement costs and better
performance.Comment: Transaction on Architecture and Code Optimization (2014
Securing Real-Time Internet-of-Things
Modern embedded and cyber-physical systems are ubiquitous. A large number of
critical cyber-physical systems have real-time requirements (e.g., avionics,
automobiles, power grids, manufacturing systems, industrial control systems,
etc.). Recent developments and new functionality requires real-time embedded
devices to be connected to the Internet. This gives rise to the real-time
Internet-of-things (RT-IoT) that promises a better user experience through
stronger connectivity and efficient use of next-generation embedded devices.
However RT- IoT are also increasingly becoming targets for cyber-attacks which
is exacerbated by this increased connectivity. This paper gives an introduction
to RT-IoT systems, an outlook of current approaches and possible research
challenges towards secure RT- IoT frameworks
Leveraging Program Analysis to Reduce User-Perceived Latency in Mobile Applications
Reducing network latency in mobile applications is an effective way of
improving the mobile user experience and has tangible economic benefits. This
paper presents PALOMA, a novel client-centric technique for reducing the
network latency by prefetching HTTP requests in Android apps. Our work
leverages string analysis and callback control-flow analysis to automatically
instrument apps using PALOMA's rigorous formulation of scenarios that address
"what" and "when" to prefetch. PALOMA has been shown to incur significant
runtime savings (several hundred milliseconds per prefetchable HTTP request),
both when applied on a reusable evaluation benchmark we have developed and on
real applicationsComment: ICSE 201
- …