37 research outputs found
LASER: Light, Accurate Sharing dEtection and Repair
Contention for shared memory, in the forms of true sharing and false sharing, is a challenging performance bug to discover and to repair. Understanding cache contention requires global knowledge of the program\u27s actual sharing behavior, and can even arise invisibly in the program due to the opaque decisions of the memory allocator. Previous schemes have focused only on false sharing, and impose significant performance penalties or require non-trivial alterations to the operating system or runtime system environment.
This paper presents the Light, Accurate Sharing dEtection and Repair (LASER) system, which leverages new performance counter capabilities available on Intel\u27s Haswell architecture that identify the source of expensive cache coherence events. Using records of these events generated by the hardware, we build a system for online contention detection and repair that operates with low performance overhead and does not require any invasive program, compiler or operating system changes. Our experiments show that LASER imposes just 2% average runtime overhead on the Phoenix, Parsec and Splash2x benchmarks. LASER can automatically improve the performance of programs by up to 19% on commodity hardware
CODAG: Characterizing and Optimizing Decompression Algorithms for GPUs
Data compression and decompression have become vital components of big-data
applications to manage the exponential growth in the amount of data collected
and stored. Furthermore, big-data applications have increasingly adopted GPUs
due to their high compute throughput and memory bandwidth. Prior works presume
that decompression is memory-bound and have dedicated most of the GPU's threads
to data movement and adopted complex software techniques to hide memory latency
for reading compressed data and writing uncompressed data. This paper shows
that these techniques lead to poor GPU resource utilization as most threads end
up waiting for the few decoding threads, exposing compute and synchronization
latencies.
Based on this observation, we propose CODAG, a novel and simple kernel
architecture for high throughput decompression on GPUs. CODAG eliminates the
use of specialized groups of threads, frees up compute resources to increase
the number of parallel decompression streams, and leverages the ample compute
activities and the GPU's hardware scheduler to tolerate synchronization,
compute, and memory latencies. Furthermore, CODAG provides a framework for
users to easily incorporate new decompression algorithms without being burdened
with implementing complex optimizations to hide memory latency. We validate our
proposed architecture with three different encoding techniques, RLE v1, RLE v2,
and Deflate, and a wide range of large datasets from different domains. We show
that CODAG provides 13.46x, 5.69x, and 1.18x speed up for RLE v1, RLE v2, and
Deflate, respectively, when compared to the state-of-the-art decompressors from
NVIDIA RAPIDS
Inadequate prenatal care and its association with adverse pregnancy outcomes: A comparison of indices
<p>Abstract</p> <p>Background</p> <p>The objectives of this study were to determine rates of prenatal care utilization in Winnipeg, Manitoba, Canada from 1991 to 2000; to compare two indices of prenatal care utilization in identifying the proportion of the population receiving inadequate prenatal care; to determine the association between inadequate prenatal care and adverse pregnancy outcomes (preterm birth, low birth weight [LBW], and small-for-gestational age [SGA]), using each of the indices; and, to assess whether or not, and to what extent, gestational age modifies this association.</p> <p>Methods</p> <p>We conducted a population-based study of women having a hospital-based singleton live birth from 1991 to 2000 (N = 80,989). Data sources consisted of a linked mother-baby database and a physician claims file maintained by Manitoba Health. Rates of inadequate prenatal care were calculated using two indices, the R-GINDEX and the APNCU. Logistic regression analysis was used to determine the association between inadequate prenatal care and adverse pregnancy outcomes. Stratified analysis was then used to determine whether the association between inadequate prenatal care and LBW or SGA differed by gestational age.</p> <p>Results</p> <p>Rates of inadequate/no prenatal care ranged from 8.3% using APNCU to 8.9% using R-GINDEX. The association between inadequate prenatal care and preterm birth and LBW varied depending on the index used, with adjusted odds ratios (AOR) ranging from 1.0 to 1.3. In contrast, both indices revealed the same strength of association of inadequate prenatal care with SGA (AOR 1.4). Both indices demonstrated heterogeneity (non-uniformity) across gestational age strata, indicating the presence of effect modification by gestational age.</p> <p>Conclusion</p> <p>Selection of a prenatal care utilization index requires careful consideration of its methodological underpinnings and limitations. The two indices compared in this study revealed different patterns of utilization of prenatal care, and should not be used interchangeably. Use of these indices to study the association between utilization of prenatal care and pregnancy outcomes affected by the duration of pregnancy should be approached cautiously.</p
Achieving sustainable quality in maternity services – using audit of incontinence and dyspareunia to identify shortfalls in meeting standards
BACKGROUND: Some complications of childbirth (for example, faecal incontinence) are a source of social embarrassment for women, and are often under reported. Therefore, it was felt important to determine levels of complications (against established standards) and to consider obstetric measures aimed at reducing them. METHODS: Clinical information was collected on 1036 primiparous women delivering at North and South Staffordshire Acute and Community Trusts over a 5-month period in 1997. A questionnaire was sent to 970 women which included self-assessment of levels of incontinence and dyspareunia prior to pregnancy, at 6 weeks post delivery and 9 to 14 months post delivery. RESULTS: The response rate was 48%(470/970). Relatively high levels of obstetric interventions were found. In addition, the rates of instrumental deliveries differed between the two hospitals. The highest rates of postnatal symptoms had occurred at 6 weeks, but for many women problems were still present at the time of the survey. At 9–14 months high rates of dyspareunia (29%(102/347)) and urinary incontinence (35%(133/382)) were reported. Seventeen women (4%) complained of faecal incontinence at this time. Similar rates of urinary incontinence and dyspareunia were seen regardless of mode of delivery. CONCLUSION: Further work should be undertaken to reduce the obstetric interventions, especially instrumental deliveries. Improvements in a number of areas of care should be undertaken, including improved patient information, improved professional communication and improved professional recognition and management of third degree tears. It is likely that these measures would lead to a reduction in incontinence and dyspareunia after childbirth
Programming Abstractions for Data Locality
The goal of the workshop and this report is to identify common themes and standardize concepts for locality-preserving abstractions for exascale programming models. Current software tools are built on the premise that computing is the most expensive component, we are rapidly moving to an era that computing is cheap and massively parallel while data movement dominates energy and performance costs. In order to respond to exascale systems (the next generation of high performance computing systems), the scientific computing community needs to refactor their applications to align with the emerging data-centric paradigm. Our applications must be evolved to express information about data locality. Unfortunately current programming environments offer few ways to do so. They ignore the incurred cost of communication and simply rely on the hardware cache coherency to virtualize data movement. With the increasing importance of task-level parallelism on future systems, task models have to support constructs that express data locality and affinity. At the system level, communication libraries implicitly assume all the processing elements are equidistant to each other. In order to take advantage of emerging technologies, application developers need a set of programming abstractions to describe data locality for the new computing ecosystem. The new programming paradigm should be more data centric and allow to describe how to decompose and how to layout data in the memory.Fortunately, there are many emerging concepts such as constructs for tiling, data layout, array views, task and thread affinity, and topology aware communication libraries for managing data locality. There is an opportunity to identify commonalities in strategy to enable us to combine the best of these concepts to develop a comprehensive approach to expressing and managing data locality on exascale programming systems. These programming model abstractions can expose crucial information about data locality to the compiler and runtime system to enable performance-portable code. The research question is to identify the right level of abstraction, which includes techniques that range from template libraries all the way to completely new languages to achieve this goal
Iteration Mapping: Loop Software Pipelining on an XIMD
The multiple instruction streams, low synchronization cost and synchronous nature of the XIMD (variable instruction stream, multiple data stream) architecture create an oppor-tunity for a new architecture-compiler interface. As an extension to the VLIW (Very Long Instruction Word) architecture, the XIMD can exploit all VLIW scheduling techniques but these do not take full advantage of the unique features of the XIMD. A new loop scheduling method for the XIMD, Iteration Mapping, is proposed that can exceed the performance of VLIW loop scheduling techniques on an XIMD. The medium-grained parallelism between loop iterations has been selected as the target for the first XIMD compiler implementation. This paper discusses how the relationship between the characteristics of loops and the architecture affect scheduling, presents the Iteration Mapping scheduling technique, presents performance data for loops scheduled using this new technique and compares it with recent results on software pipelining for the VLIW, and demonstrates the applicability of the technique for loop-intensive code. The weighted harmonic mean of the speedup for 4 functional units on the first fourteen Livermore Loops is currently 3.5, and the weighted harmonic mean of the speedup for selected loops that Iteration Mapping is particularly well suited for on 16 functional units is 14.5, more than 90 % of linear speedup. The concepts behind Iteration Mapping are clean and understandable, leading to a compiler that is straight-forward, fast, and easy to implement.
Node Labeling
This document describes a scheme for labeling a program dependence graph (PDG) or control flow graph (CFG) in order to codify the hierarchical control dependence structure of a procedure. The first section describes the scheme and its features. Section 2 provides some applications for the labels. Section 3 proves the correctness of the label relations described in Subsection 2.1. Section 4 shows statistics about labels and program structure for several SPEC benchmarks
A PDG-Based Tool and Its Use in Analyzing Program Control Dependences
: This paper explores the potential of a program representation called the program dependence graph (PDG) for representing and exposing programs' hierarchical control dependence structure. It presents several extensions to current PDG designs, including a node labeling scheme that simplifies and generalizes PDG traversal. A post-pass PDG-based tool called PEDIGREE has been implemented. It is used to generate and analyze the PDGs for several benchmarks, including the SPEC92 suite. In particular, initial results characterize the control dependence structure of these programs to provide insight into the scheduling benefits of employing speculative execution, and exploiting control equivalence information. Some of the benefits of using the PDG instead of the CFG are demonstrated. Our ultimate aim is to use this tool for exploiting multi-grained parallelism. 1. Introduction A program representation called the program dependence graph (PDG) [6] explicitly represents all control dependences..