52 research outputs found
A Similarity Measure for GPU Kernel Subgraph Matching
Accelerator architectures specialize in executing SIMD (single instruction,
multiple data) in lockstep. Because the majority of CUDA applications are
parallelized loops, control flow information can provide an in-depth
characterization of a kernel. CUDAflow is a tool that statically separates CUDA
binaries into basic block regions and dynamically measures instruction and
basic block frequencies. CUDAflow captures this information in a control flow
graph (CFG) and performs subgraph matching across various kernel's CFGs to gain
insights to an application's resource requirements, based on the shape and
traversal of the graph, instruction operations executed and registers
allocated, among other information. The utility of CUDAflow is demonstrated
with SHOC and Rodinia application case studies on a variety of GPU
architectures, revealing novel thread divergence characteristics that
facilitates end users, autotuners and compilers in generating high performing
code
In-Vitro Anti-Fungal Activity and Phytochemical Screening of Stem Bark Extracts from Ventilago denticulata
The objective of the present study was to assess the antifungal activity of pet. Ether extract, acetone extract, ethyl acetate, and ethanol bark extract of Ventilago denticulata (VD).The material was dried in shade made to a coarse powder and weighted quantity of the powder (1000 g) was subjected to hot percolation in a soxhlet apparatus using petroleum ether, ethyl acetate, acetone and ethanol, at a temperature range of 40-800C. Phytochemical tests were done in presence of phytoconstituents like glycosides, alkaloids, tannins, steroids, flavonoids. The anti-fungal activity was carried out by using cup method using Sabraud’s agar as medium. Plates were incubated at 250C for 42hr and later observed for zones of inhibition. The effect of the extracts on fungal isolates was compared with Griseofluvin at a concentration of 10 mg/ml. The Ethyl acetate extract at low as well as high doses gives antifungal effect. Pet-ether extract, acetone extract and ethanolic extract did not produce any antifungal effect at both doses. Ethyl acetate extract shows zone of inhibition at low dose (T1 10 mg/ml) 10 mm and at high dose (T2 20 mg/ml) 16 mm.
Keyword: Ventilago denticulata, Anti- fungal, Griseofluvin
Performance analysis of gpu programming models using the roofline scaling trajectories
Performance analysis is a daunting job, especially for the rapid-evolving accelerator technologies. The Roofline Scaling Trajectories technique aims at diagnosing various performance bottlenecks for GPU programming models through the visually intuitive Roofline plots. In this work, we introduce the use of the Roofline Scaling Trajectories to capture major performance bottlenecks on NVIDIA Volta GPU architectures, such as warp efficiency, occupancy, and locality. Using this analysis technique, we explain the performance characteristics of the NAS Parallel Benchmarks (NPB) written with two programming models, CUDA and OpenACC. We present the influence of the programming model on the performance and scaling characteristics. We also leverage the insights of the Roofline Scaling Trajectory analysis to tune some of the NAS Parallel Benchmarks, achieving up to 2 speedup
Overview of Application Instrumentation for Performance Analysis and Tuning
Profiling and tuning of parallel applications is an essential part of HPC. Analysis and improvement of the hot spots of an application can be done using one of many available tools, that provides measurement of resources consumption for each instrumented part of the code. Since complex applications show different behavior in each part of the code, it is desired to insert instrumentation to separate these parts. Besides manual instrumentation, some profiling libraries provide different ways of instrumentation. Out of these, the binary patching is the most universal mechanism, that highly improves user-friendliness and robustness of the tool. We provide an overview of the most often used binary patching tools and show a workflow of how to use them to implement a binary instrumentation tool for any profiler or autotuner. We have also evaluated the minimum overhead of the manual and binary instrumentation
Analysis of the Jobs Resource Utilization on a Production System
Abstract. In HPC community the System Utilization metric enables to determine if the resources of the cluster are efficiently used by the batch scheduler. This metric considers that all the allocated resources (memory, disk, processors, etc) are full-time utilized. To optimize the system performance, we have to consider the effective physical consumption by jobs regarding the resource allocations. This information gives an insight into whether the cluster resources are efficiently used by the jobs. In this work we propose an analysis of production clusters based on the jobs resource utilization. The principle is to collect simultaneously traces from the job scheduler (provided by logs) and jobs resource consumptions. The latter has been realized by developing a job monitoring tool, whose impact on the system has been measured as lightweight (0.35 % speed-down). The key point is to statistically analyze both traces to detect and explain underutilization of the resources. This could enable to detect abnormal behavior, bottlenecks in the cluster leading to a poor scalability, and justifying optimizations such as gang scheduling or besteffort scheduling. This method has been applied to two medium sized production clusters on a period of eight months
Score-P and OMPT: Navigating the Perils of Callback-Driven Parallel Runtime Introspection
Event-based performance analysis aims at modeling the behavior of parallel applications through a series of state transitions during execution. Different approaches to obtain such transition points for OpenMP programs include source-level instrumentation (e.g., OPARI) and callback-driven runtime support (e.g., OMPT).In this paper, we revisit a previous evaluation and comparison of OPARI and an LLVM OMPT implementation—now updated to the OpenMP 5.0 specification—in the context of Score-P. We describe the challenges faced while trying to use OMPT as a drop-in replacement for the existing instrumentation-based approach and the changes in event order that could not be avoided. Furthermore, we provide details on Score-P measurements using OPARI and OMPT as event sources with the EPCC and SPEC OpenMP benchmark suites
- …