270 research outputs found

    Advances in Engineering Software for Multicore Systems

    Get PDF
    The vast amounts of data to be processed by today’s applications demand higher computational power. To meet application requirements and achieve reasonable application performance, it becomes increasingly profitable, or even necessary, to exploit any available hardware parallelism. For both new and legacy applications, successful parallelization is often subject to high cost and price. This chapter proposes a set of methods that employ an optimistic semi-automatic approach, which enables programmers to exploit parallelism on modern hardware architectures. It provides a set of methods, including an LLVM-based tool, to help programmers identify the most promising parallelization targets and understand the key types of parallelism. The approach reduces the manual effort needed for parallelization. A contribution of this work is an efficient profiling method to determine the control and data dependences for performing parallelism discovery or other types of code analysis. Another contribution is a method for detecting code sections where parallel design patterns might be applicable and suggesting relevant code transformations. Our approach efficiently reports detailed runtime data dependences. It accurately identifies opportunities for parallelism and the appropriate type of parallelism to use as task-based or loop-based

    LLOV: A Fast Static Data-Race Checker for OpenMP Programs

    Full text link
    In the era of Exascale computing, writing efficient parallel programs is indispensable and at the same time, writing sound parallel programs is very difficult. Specifying parallelism with frameworks such as OpenMP is relatively easy, but data races in these programs are an important source of bugs. In this paper, we propose LLOV, a fast, lightweight, language agnostic, and static data race checker for OpenMP programs based on the LLVM compiler framework. We compare LLOV with other state-of-the-art data race checkers on a variety of well-established benchmarks. We show that the precision, accuracy, and the F1 score of LLOV is comparable to other checkers while being orders of magnitude faster. To the best of our knowledge, LLOV is the only tool among the state-of-the-art data race checkers that can verify a C/C++ or FORTRAN program to be data race free.Comment: Accepted in ACM TACO, August 202

    khmer: Working with Big Data in Bioinformatics

    Full text link
    We introduce design and optimization considerations for the 'khmer' package.Comment: Invited chapter for forthcoming book on Performance of Open Source Application

    Low-Impact Profiling of Streaming, Heterogeneous Applications

    Get PDF
    Computer engineers are continually faced with the task of translating improvements in fabrication process technology: i.e., Moore\u27s Law) into architectures that allow computer scientists to accelerate application performance. As feature-size continues to shrink, architects of commodity processors are designing increasingly more cores on a chip. While additional cores can operate independently with some tasks: e.g. the OS and user tasks), many applications see little to no improvement from adding more processor cores alone. For many applications, heterogeneous systems offer a path toward higher performance. Significant performance and power gains have been realized by combining specialized processors: e.g., Field-Programmable Gate Arrays, Graphics Processing Units) with general purpose multi-core processors. Heterogeneous applications need to be programmed differently than traditional software. One approach, stream processing, fits these systems particularly well because of the segmented memories and explicit expression of parallelism. Unfortunately, debugging and performance tools that support streaming, heterogeneous applications do not exist. This dissertation presents TimeTrial, a performance measurement system that enables performance optimization of streaming applications by profiling the application deployed on a heterogeneous system. TimeTrial performs low-impact measurements by dedicating computing resources to monitoring and by aggressively compressing performance traces into statistical summaries guided by user specification of the performance queries of interest

    Automatic Test Case Reduction for OpenCL

    Get PDF
    ABSTRACT We report on an extension to the C-Reduce tool, for automatic reduction of C test cases, to handle OpenCL kernels. This enables an automated method for detecting bugs in OpenCL compilers, by generating large random kernels using the CLsmith generator, identifying kernels that yield result differences across OpenCL platforms and optimisation levels, and using our novel extension to C-Reduce to automatically reduce such kernels to minimal forms that can be filed as bug reports. A major part of our effort involved the design of ShadowKeeper, a new plugin for the Oclgrind simulator that provides accurate detection of accesses to uninitialised data. We present experimental results showing the effectiveness of our method for finding bugs in a number of OpenCL compilers

    Dynamic slicing long running programs through execution fast forwarding

    Full text link
    Fixing runtime bugs in long running programs using tracing based analyses such as dynamic slicing was believed to be prohibitively expensive. In this paper, we present a novel execution fast forward-ing technique that makes it feasible. While a naive solution is to divide the entire execution by checkpoints, and then apply dynamic slicing enabled by tracing on one checkpoint interval at a time, it is still too costly even with state-of-the-art tracing techniques. Our technique is derived from two key observations. The first one is that long running programs are usually driven by events, which has been taken advantage of by checkpointing/replaying techniques to deterministically replay an execution from the event log. The sec-ond observation is that all the events are not relevant to replaying a particular part of the execution, in which the programmer sus-pects an error happened. We develop a slicing-like technique on the event log such that many irrelevant events are successfully pruned. Driven by the reduced log, the replayed execution is now traced for fault location. This replayed execution has the effect of fast forwarding, i.e the amount of executed instructions is significantly reduced without losing the accuracy of reproducing the failure. We describe how execution fast forwarding is combined with check-pointing and tracing based dynamic slicing, which we believe is the first attempt to integrate these two techniques. The dynamic slices of a set of reported bugs for long running programs are studied to show the effectiveness of dynamic slicing, which is a significant step forward compared to our prior work. 1

    Addressing concerns in performance prediction : the impact of data dependencies and denormal arithmetic in scientific codes

    Get PDF
    To meet the increasing computational requirements of the scientific community, the use of parallel programming has become commonplace, and in recent years distributed applications running on clusters of computers have become the norm. Both parallel and distributed applications face the problem of predictive uncertainty and variations in runtime. Modern scientific applications have varying I/O, cache, and memory profiles that have significant and difficult to predict effects on their runtimes. Data-dependent sensitivities such as the costs of denormal floating point calculations introduce more variations in runtime, further hindering predictability. Applications with unpredictable performance or which have highly variable runtimes can cause several problems. If the runtime of an application is unknown or varies widely, workflow schedulers cannot e�ciently allocate them to compute nodes, leading to the under-utilisation of expensive resources. Similarly, a lack of accurate knowledge of the performance of an application on new hardware can lead to misguided procurement decisions. In heavily parallel applications, minor variations in runtime on individual nodes can have disproportionate effects on the overall application runtime. Even on a smaller scale, a lack of certainty about an application's runtime can preclude its use in real-time or time-critical applications such as clinical diagnosis. This thesis investigates two sources of data-dependent performance variability. The first source is algorithmic and is seen in a state-of-the-art C++ biomedical imaging application. It identifies the cause of the variability in the application and develops a means of characterising the variability. This 'probe task' based model is adapted for use with a workflow scheduler, and the scheduling improvements it brings are examined. The second source of variability is more subtle as it is micro-architectural in nature. Depending on the input data, two runs of an application executing exactly the same sequence of instructions and with exactly the same memory access patterns can have large differences in runtime due to deficiencies in common hardware implementations of denormal arithmetic1. An exception-based profiler is written to detect occurrences of denormal arithmetic and it is shown how this is insufficient to isolate the sources of denormal arithmetic in an application. A novel tool based on theValgrind binary instrumentation framework is developed which can trace the origins of denormal values and the frequency of their occurrence in an application's data structures. This second tool is used to isolate and remove the cause of denormal arithmetic both from a simple numerical code, and then from a face recognition application

    Dynamic Analysis of Embedded Software

    Get PDF
    abstract: Most embedded applications are constructed with multiple threads to handle concurrent events. For optimization and debugging of the programs, dynamic program analysis is widely used to collect execution information while the program is running. Unfortunately, the non-deterministic behavior of multithreaded embedded software makes the dynamic analysis difficult. In addition, instrumentation overhead for gathering execution information may change the execution of a program, and lead to distorted analysis results, i.e., probe effect. This thesis presents a framework that tackles the non-determinism and probe effect incurred in dynamic analysis of embedded software. The thesis largely consists of three parts. First of all, we discusses a deterministic replay framework to provide reproducible execution. Once a program execution is recorded, software instrumentation can be safely applied during replay without probe effect. Second, a discussion of probe effect is presented and a simulation-based analysis is proposed to detect execution changes of a program caused by instrumentation overhead. The simulation-based analysis examines if the recording instrumentation changes the original program execution. Lastly, the thesis discusses data race detection algorithms that help to remove data races for correctness of the replay and the simulation-based analysis. The focus is to make the detection efficient for C/C++ programs, and to increase scalability of the detection on multi-core machines.Dissertation/ThesisDoctoral Dissertation Computer Science 201
    corecore