37 research outputs found

    LASER: Light, Accurate Sharing dEtection and Repair

    Get PDF
    Contention for shared memory, in the forms of true sharing and false sharing, is a challenging performance bug to discover and to repair. Understanding cache contention requires global knowledge of the program\u27s actual sharing behavior, and can even arise invisibly in the program due to the opaque decisions of the memory allocator. Previous schemes have focused only on false sharing, and impose significant performance penalties or require non-trivial alterations to the operating system or runtime system environment. This paper presents the Light, Accurate Sharing dEtection and Repair (LASER) system, which leverages new performance counter capabilities available on Intel\u27s Haswell architecture that identify the source of expensive cache coherence events. Using records of these events generated by the hardware, we build a system for online contention detection and repair that operates with low performance overhead and does not require any invasive program, compiler or operating system changes. Our experiments show that LASER imposes just 2% average runtime overhead on the Phoenix, Parsec and Splash2x benchmarks. LASER can automatically improve the performance of programs by up to 19% on commodity hardware

    CODAG: Characterizing and Optimizing Decompression Algorithms for GPUs

    Full text link
    Data compression and decompression have become vital components of big-data applications to manage the exponential growth in the amount of data collected and stored. Furthermore, big-data applications have increasingly adopted GPUs due to their high compute throughput and memory bandwidth. Prior works presume that decompression is memory-bound and have dedicated most of the GPU's threads to data movement and adopted complex software techniques to hide memory latency for reading compressed data and writing uncompressed data. This paper shows that these techniques lead to poor GPU resource utilization as most threads end up waiting for the few decoding threads, exposing compute and synchronization latencies. Based on this observation, we propose CODAG, a novel and simple kernel architecture for high throughput decompression on GPUs. CODAG eliminates the use of specialized groups of threads, frees up compute resources to increase the number of parallel decompression streams, and leverages the ample compute activities and the GPU's hardware scheduler to tolerate synchronization, compute, and memory latencies. Furthermore, CODAG provides a framework for users to easily incorporate new decompression algorithms without being burdened with implementing complex optimizations to hide memory latency. We validate our proposed architecture with three different encoding techniques, RLE v1, RLE v2, and Deflate, and a wide range of large datasets from different domains. We show that CODAG provides 13.46x, 5.69x, and 1.18x speed up for RLE v1, RLE v2, and Deflate, respectively, when compared to the state-of-the-art decompressors from NVIDIA RAPIDS

    Inadequate prenatal care and its association with adverse pregnancy outcomes: A comparison of indices

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The objectives of this study were to determine rates of prenatal care utilization in Winnipeg, Manitoba, Canada from 1991 to 2000; to compare two indices of prenatal care utilization in identifying the proportion of the population receiving inadequate prenatal care; to determine the association between inadequate prenatal care and adverse pregnancy outcomes (preterm birth, low birth weight [LBW], and small-for-gestational age [SGA]), using each of the indices; and, to assess whether or not, and to what extent, gestational age modifies this association.</p> <p>Methods</p> <p>We conducted a population-based study of women having a hospital-based singleton live birth from 1991 to 2000 (N = 80,989). Data sources consisted of a linked mother-baby database and a physician claims file maintained by Manitoba Health. Rates of inadequate prenatal care were calculated using two indices, the R-GINDEX and the APNCU. Logistic regression analysis was used to determine the association between inadequate prenatal care and adverse pregnancy outcomes. Stratified analysis was then used to determine whether the association between inadequate prenatal care and LBW or SGA differed by gestational age.</p> <p>Results</p> <p>Rates of inadequate/no prenatal care ranged from 8.3% using APNCU to 8.9% using R-GINDEX. The association between inadequate prenatal care and preterm birth and LBW varied depending on the index used, with adjusted odds ratios (AOR) ranging from 1.0 to 1.3. In contrast, both indices revealed the same strength of association of inadequate prenatal care with SGA (AOR 1.4). Both indices demonstrated heterogeneity (non-uniformity) across gestational age strata, indicating the presence of effect modification by gestational age.</p> <p>Conclusion</p> <p>Selection of a prenatal care utilization index requires careful consideration of its methodological underpinnings and limitations. The two indices compared in this study revealed different patterns of utilization of prenatal care, and should not be used interchangeably. Use of these indices to study the association between utilization of prenatal care and pregnancy outcomes affected by the duration of pregnancy should be approached cautiously.</p

    Achieving sustainable quality in maternity services – using audit of incontinence and dyspareunia to identify shortfalls in meeting standards

    Get PDF
    BACKGROUND: Some complications of childbirth (for example, faecal incontinence) are a source of social embarrassment for women, and are often under reported. Therefore, it was felt important to determine levels of complications (against established standards) and to consider obstetric measures aimed at reducing them. METHODS: Clinical information was collected on 1036 primiparous women delivering at North and South Staffordshire Acute and Community Trusts over a 5-month period in 1997. A questionnaire was sent to 970 women which included self-assessment of levels of incontinence and dyspareunia prior to pregnancy, at 6 weeks post delivery and 9 to 14 months post delivery. RESULTS: The response rate was 48%(470/970). Relatively high levels of obstetric interventions were found. In addition, the rates of instrumental deliveries differed between the two hospitals. The highest rates of postnatal symptoms had occurred at 6 weeks, but for many women problems were still present at the time of the survey. At 9–14 months high rates of dyspareunia (29%(102/347)) and urinary incontinence (35%(133/382)) were reported. Seventeen women (4%) complained of faecal incontinence at this time. Similar rates of urinary incontinence and dyspareunia were seen regardless of mode of delivery. CONCLUSION: Further work should be undertaken to reduce the obstetric interventions, especially instrumental deliveries. Improvements in a number of areas of care should be undertaken, including improved patient information, improved professional communication and improved professional recognition and management of third degree tears. It is likely that these measures would lead to a reduction in incontinence and dyspareunia after childbirth

    Programming Abstractions for Data Locality

    Get PDF
    The goal of the workshop and this report is to identify common themes and standardize concepts for locality-preserving abstractions for exascale programming models. Current software tools are built on the premise that computing is the most expensive component, we are rapidly moving to an era that computing is cheap and massively parallel while data movement dominates energy and performance costs. In order to respond to exascale systems (the next generation of high performance computing systems), the scientific computing community needs to refactor their applications to align with the emerging data-centric paradigm. Our applications must be evolved to express information about data locality. Unfortunately current programming environments offer few ways to do so. They ignore the incurred cost of communication and simply rely on the hardware cache coherency to virtualize data movement. With the increasing importance of task-level parallelism on future systems, task models have to support constructs that express data locality and affinity. At the system level, communication libraries implicitly assume all the processing elements are equidistant to each other. In order to take advantage of emerging technologies, application developers need a set of programming abstractions to describe data locality for the new computing ecosystem. The new programming paradigm should be more data centric and allow to describe how to decompose and how to layout data in the memory.Fortunately, there are many emerging concepts such as constructs for tiling, data layout, array views, task and thread affinity, and topology aware communication libraries for managing data locality. There is an opportunity to identify commonalities in strategy to enable us to combine the best of these concepts to develop a comprehensive approach to expressing and managing data locality on exascale programming systems. These programming model abstractions can expose crucial information about data locality to the compiler and runtime system to enable performance-portable code. The research question is to identify the right level of abstraction, which includes techniques that range from template libraries all the way to completely new languages to achieve this goal

    Iteration Mapping: Loop Software Pipelining on an XIMD

    No full text
    The multiple instruction streams, low synchronization cost and synchronous nature of the XIMD (variable instruction stream, multiple data stream) architecture create an oppor-tunity for a new architecture-compiler interface. As an extension to the VLIW (Very Long Instruction Word) architecture, the XIMD can exploit all VLIW scheduling techniques but these do not take full advantage of the unique features of the XIMD. A new loop scheduling method for the XIMD, Iteration Mapping, is proposed that can exceed the performance of VLIW loop scheduling techniques on an XIMD. The medium-grained parallelism between loop iterations has been selected as the target for the first XIMD compiler implementation. This paper discusses how the relationship between the characteristics of loops and the architecture affect scheduling, presents the Iteration Mapping scheduling technique, presents performance data for loops scheduled using this new technique and compares it with recent results on software pipelining for the VLIW, and demonstrates the applicability of the technique for loop-intensive code. The weighted harmonic mean of the speedup for 4 functional units on the first fourteen Livermore Loops is currently 3.5, and the weighted harmonic mean of the speedup for selected loops that Iteration Mapping is particularly well suited for on 16 functional units is 14.5, more than 90 % of linear speedup. The concepts behind Iteration Mapping are clean and understandable, leading to a compiler that is straight-forward, fast, and easy to implement.

    Node Labeling

    No full text
    This document describes a scheme for labeling a program dependence graph (PDG) or control flow graph (CFG) in order to codify the hierarchical control dependence structure of a procedure. The first section describes the scheme and its features. Section 2 provides some applications for the labels. Section 3 proves the correctness of the label relations described in Subsection 2.1. Section 4 shows statistics about labels and program structure for several SPEC benchmarks

    A PDG-Based Tool and Its Use in Analyzing Program Control Dependences

    No full text
    : This paper explores the potential of a program representation called the program dependence graph (PDG) for representing and exposing programs&apos; hierarchical control dependence structure. It presents several extensions to current PDG designs, including a node labeling scheme that simplifies and generalizes PDG traversal. A post-pass PDG-based tool called PEDIGREE has been implemented. It is used to generate and analyze the PDGs for several benchmarks, including the SPEC92 suite. In particular, initial results characterize the control dependence structure of these programs to provide insight into the scheduling benefits of employing speculative execution, and exploiting control equivalence information. Some of the benefits of using the PDG instead of the CFG are demonstrated. Our ultimate aim is to use this tool for exploiting multi-grained parallelism. 1. Introduction A program representation called the program dependence graph (PDG) [6] explicitly represents all control dependences..
    corecore