1,213 research outputs found
A Fast Causal Profiler for Task Parallel Programs
This paper proposes TASKPROF, a profiler that identifies parallelism
bottlenecks in task parallel programs. It leverages the structure of a task
parallel execution to perform fine-grained attribution of work to various parts
of the program. TASKPROF's use of hardware performance counters to perform
fine-grained measurements minimizes perturbation. TASKPROF's profile execution
runs in parallel using multi-cores. TASKPROF's causal profile enables users to
estimate improvements in parallelism when a region of code is optimized even
when concrete optimizations are not yet known. We have used TASKPROF to isolate
parallelism bottlenecks in twenty three applications that use the Intel
Threading Building Blocks library. We have designed parallelization techniques
in five applications to in- crease parallelism by an order of magnitude using
TASKPROF. Our user study indicates that developers are able to isolate
performance bottlenecks with ease using TASKPROF.Comment: 11 page
ReSHAPE: A Framework for Dynamic Resizing and Scheduling of Homogeneous Applications in a Parallel Environment
Applications in science and engineering often require huge computational
resources for solving problems within a reasonable time frame. Parallel
supercomputers provide the computational infrastructure for solving such
problems. A traditional application scheduler running on a parallel cluster
only supports static scheduling where the number of processors allocated to an
application remains fixed throughout the lifetime of execution of the job. Due
to the unpredictability in job arrival times and varying resource requirements,
static scheduling can result in idle system resources thereby decreasing the
overall system throughput. In this paper we present a prototype framework
called ReSHAPE, which supports dynamic resizing of parallel MPI applications
executed on distributed memory platforms. The framework includes a scheduler
that supports resizing of applications, an API to enable applications to
interact with the scheduler, and a library that makes resizing viable.
Applications executed using the ReSHAPE scheduler framework can expand to take
advantage of additional free processors or can shrink to accommodate a high
priority application, without getting suspended. In our research, we have
mainly focused on structured applications that have two-dimensional data arrays
distributed across a two-dimensional processor grid. The resize library
includes algorithms for processor selection and processor mapping. Experimental
results show that the ReSHAPE framework can improve individual job turn-around
time and overall system throughput.Comment: 15 pages, 10 figures, 5 tables Submitted to International Conference
on Parallel Processing (ICPP'07
GiViP: A Visual Profiler for Distributed Graph Processing Systems
Analyzing large-scale graphs provides valuable insights in different
application scenarios. While many graph processing systems working on top of
distributed infrastructures have been proposed to deal with big graphs, the
tasks of profiling and debugging their massive computations remain time
consuming and error-prone. This paper presents GiViP, a visual profiler for
distributed graph processing systems based on a Pregel-like computation model.
GiViP captures the huge amount of messages exchanged throughout a computation
and provides an interactive user interface for the visual analysis of the
collected data. We show how to take advantage of GiViP to detect anomalies
related to the computation and to the infrastructure, such as slow computing
units and anomalous message patterns.Comment: Appears in the Proceedings of the 25th International Symposium on
Graph Drawing and Network Visualization (GD 2017
Are Web Applications Ready for Parallelism?
In recent years, web applications have become pervasive. Their backbone is JavaScript, the only programming language supported by all major web browsers. Most browsers run on desktop or mobile devices with parallel hardware. However, JavaScript is by design sequential, and current web applications make little use of hardware parallelism. Are web applications ready to exploit parallel hardware? \ \ To answer this question we take a two-step approach. First, we survey 174 web developers regarding the potential and challenges of using parallelism. Then, we study the performance and computation shape of a set of web applications that are representative for the emerging web. We identify performance bottlenecks and examine memory access patterns to determine possible data parallelism. \ \ Our findings indicate that emerging web applications do have latent data parallelism, and JavaScript developers\u27 programming style are not a significant impediment to exploiting this parallelism
Data-centric Performance Measurement and Mapping for Highly Parallel Programming Models
Modern supercomputers have complex features: many hardware threads, deep memory hierarchies, and many co-processors/accelerators. Productively and effectively designing programs to utilize those hardware features is crucial in gaining the best performance. There are several highly parallel programming models in active development that allow programmers to write efficient code on those architectures. Performance profiling is a very important technique in the development to achieve the best performance.
In this dissertation, I proposed a new performance measurement and mapping technique that can associate performance data with program variables instead of code blocks. To validate the applicability of my data-centric profiling idea, I designed and implemented a profiler for PGAS and CUDA. For PGAS, I developed ChplBlamer, for both single-node and multi-node Chapel programs. My tool also provides new features such as data-centric inter-node load imbalance identification. For CUDA, I developed CUDABlamer for GPU-accelerated applications. CUDABlamer also attributes performance data to program variables, which is a feature that was not found in any previous CUDA profilers. Directed by the insights from the tools, I optimized several widely-studied benchmarks and significantly improved program performance by a factor of up to 4x for Chapel and 47x for CUDA kernels
- …