25 research outputs found

    PARSNIP: Performant Architecture for Race Safety with No Impact on Precision

    Get PDF
    Data race detection is a useful dynamic analysis for multithreaded programs that is a key building block in record-and-replay, enforcing strong consistency models, and detecting concurrency bugs. Existing software race detectors are precise but slow, and hardware support for precise data race detection relies on assumptions like type safety that many programs violate in practice. We propose PARSNIP, a fully precise hardware-supported data race detector. PARSNIP exploits new insights into the redundancy of race detection metadata to reduce storage overheads. PARSNIP also adopts new race detection metadata encodings that accelerate the common case while preserving soundness and completeness. When bounded hardware resources are exhausted, PARSNIP falls back to a software race detector to preserve correctness. PARSNIP does not assume that target programs are type safe, and is thus suitable for race detection on arbitrary code. Our evaluation of PARSNIP on several PARSEC benchmarks shows that performance overheads range from negligible to 2.6x, with an average overhead of just 1.5x. Moreover, Parsnip outperforms the state-of-the-art Radish hardware race detector by 4.6x

    LASER: Light, Accurate Sharing dEtection and Repair

    Get PDF
    Contention for shared memory, in the forms of true sharing and false sharing, is a challenging performance bug to discover and to repair. Understanding cache contention requires global knowledge of the program\u27s actual sharing behavior, and can even arise invisibly in the program due to the opaque decisions of the memory allocator. Previous schemes have focused only on false sharing, and impose significant performance penalties or require non-trivial alterations to the operating system or runtime system environment. This paper presents the Light, Accurate Sharing dEtection and Repair (LASER) system, which leverages new performance counter capabilities available on Intel\u27s Haswell architecture that identify the source of expensive cache coherence events. Using records of these events generated by the hardware, we build a system for online contention detection and repair that operates with low performance overhead and does not require any invasive program, compiler or operating system changes. Our experiments show that LASER imposes just 2% average runtime overhead on the Phoenix, Parsec and Splash2x benchmarks. LASER can automatically improve the performance of programs by up to 19% on commodity hardware

    Technical Report: Anytime Computation and Control for Autonomous Systems

    Get PDF
    The correct and timely completion of the sensing and action loop is of utmost importance in safety critical autonomous systems. A crucial part of the performance of this feedback control loop are the computation time and accuracy of the estimator which produces state estimates used by the controller. These state estimators, especially those used for localization, often use computationally expensive perception algorithms like visual object tracking. With on-board computers on autonomous robots being computationally limited, the computation time of a perception-based estimation algorithm can at times be high enough to result in poor control performance. In this work, we develop a framework for co-design of anytime estimation and robust control algorithms while taking into account computation delays and estimation inaccuracies. This is achieved by constructing a perception-based anytime estimator from an off-the-shelf perception-based estimation algorithm, and in the process we obtain a trade-off curve for its computation time versus estimation error. This information is used in the design of a robust predictive control algorithm that at run-time decides a contract for the estimator, or the mode of operation of estimator, in addition to trying to achieve its control objectives at a reduced computation energy cost. In cases where the estimation delay can result in possibly degraded control performance, we provide an optimal manner in which the controller can use this trade-off curve to reduce estimation delay at the cost of higher inaccuracy, all the while guaranteeing that control objectives are robustly satisfied. Through experiments on a hexrotor platform running a visual odometry algorithm for state estimation, we show how our method results in upto a 10% improvement in control performance while saving 5-6% in computation energy as compared to a method that does not leverage the co-design

    Publications Conference Papers GPUDet: A Deterministic GPU Architecture

    No full text
    Research Interests My main research interests are in the fields of computer architecture and programming languages. I'm interested in using language and hardware innovations to provide better support for parallel programming

    Deterministic Execution for Arbitrary Multithreaded Programs

    No full text
    Thesis (Ph.D.)--University of Washington, 2012Nondeterminism is one of the main reasons that parallel programming is so difficult. Bugs can vanish when programs are rerun or run under a debugger, thwarting attempts at their removal. Stress-testing is a common practice to flush out rare defects though it consumes extensive processing power and offers no real guarantees of discovering bugs. Deployment can similarly expose new issues that are difficult to reproduce. Finally, nondeterminism frustrates replicating multithreaded programs for fault-tolerance or performance as the replicas can diverge silently. Determinism eliminates these problems, making debugging and replication possible and making testing more valuable and efficient. Previous efforts at providing determinism required programs to be (re-)written in restrictive languages. In contrast to these language-level determinism schemes, this dissertation shows how execution-level determinism can be provided for arbitrary parallel programs, even programs that contain concurrency errors. First, we employ a hardware-based approach to provide determinism for unmodified binaries running on a deterministic multiprocessor system. Second, we show that memory consistency relaxations both enable a pure software-based implementation of execution-level determinism for arbitrary programs and also admit a simpler deterministic multiprocessor design. Finally, we describe a hybrid mechanism that integrates execution-level and language-level determinism techniques to provide determinism for arbitrary programs with higher performance than an execution-level approach alone

    DMP: Deterministic Shared Memory Multiprocessing

    No full text
    Current shared memory multicore and multiprocessor systems are nondeterministic. Each time these systems execute a multithreaded application, even if supplied with the same input, they can produce a different output. This frustrates debugging and limits the ability to properly test multithreaded code, becoming a major stumbling block to the much-needed widespread adoption of parallel programming. In this paper we make the case for fully deterministic shared memory multiprocessing (DMP). The behavior of an arbitrary multithreaded program on a DMP system is only a function of its inputs. The core idea is to make inter-thread communication fully deterministic. Previous approaches to coping with nondeterminism in multithreaded programs have focused on replay, a technique useful only for debugging. In contrast, while DMP systems are directly useful for debugging by offering repeatability by default, we argue that parallel programs should execute deterministically in the field as well. This has the potential to make testing more assuring and increase the reliability of deployed multithreaded software. We propose a range of approaches to enforcing determinism and discuss their implementation trade-offs. We show that determinism can be provided with little performance cost using our architecture proposals on future hardware, and that software-only approaches can be utilized on existing systems
    corecore