161 research outputs found

    A case for asymmetric-cell cache memories

    Full text link

    Reactive NUMA: A design for unifying S-COMA and CC-NUMA

    Get PDF
    This paper proposes and evaluates a new approach to directory-based cache coherence protocols called Reactive NUMA (R-NUMA). An R-NUMA system combines a conventional CC-NUMA coherence protocol with a more-recent Simple-COMA (S-COMA) protocol. What makes R-NUMA novel is the way it dynamically reacts to program and system behavior to switch between CC- NUMA and S-COMA and exploit the best aspects of both protocols. This reactive behavior allows each node in an R-NUMA system to independently choose the best protocol for a particular page, thus providing much greater performance stability than either CC-NUMA or S-COMA alone. Our evaluation is both qualitative and quantitative. We first show the theoretical result that R-NUMA's worst-case performance is bounded within a small constant factor (i.e., two to three times) of the best of CC-NUMA and S-COMA. We then use detailed execution-driven simulation to show that, in practice, R-NUMA usually performs better than either a pure CC-NUMA or pure S-COMA protocol, and no more than 57% worse than the best of CC-NUMA and S- COMA, for our benchmarks and base system assumptions

    Scheduling communication on an SMP node parallel machine

    Get PDF
    Distributed-memory parallel computers and networks of workstations (NOWs) both rely on efficient communication over increasingly high-speed networks. Software communication protocols are often the performance bottleneck. Several current and proposed parallel systems address this problem by dedicating one general-purpose processor in a symmetric multiprocessor (SMP) node specifically for protocol processing. This scheduling convention reduces communication latency and increases effective bandwidth but also reduces the peak performance since the dedicated processor no longer performs computation. In this paper, we study a parallel machine with SMP nodes and compare two protocol processing policies: Fixed, which uses a dedicated protocol processor; and Floating, where all processors perform both computation and protocol processing. The results from synthetic microbenchmarks and five macrobenchmarks show that: (i) a dedicated protocol processor benefits light-weight protocols much more than heavy- weight protocols; (ii) fixed improves performance over Floating when communication becomes the bottleneck, which is more likely when the application is very communication-intensive, overheads are very high, or there are multiple (i.e., more than two) processors per node; (iii) a system with optimal cost-effectiveness is likely to include a dedicated protocol processor, at least for light-weight protocol

    Cost/performance of a parallel computer simulator

    Get PDF
    This paper examines the cost/performance of simulating a hypothetical target parallel computer using a commercial host parallel computer. We address the question of whether parallel simulation is simply faster than sequential simulation, and whether it is also more cost-effective. To answer this, we develop a performance model of the Wisconsin Wind Tunnel (WWT), a system that simulates cache-coherent shared-memory machines on a message- passing Thinking Machines CM-5. The performance model uses Kruskal and Weiss's fork-join model to account for the effect of event processing time variability on WWT's conservative fixed-window simulation algorith

    Parallel Dispatch Queue: a queue-based programming abstraction to parallelize fine-grain communication protocols

    Get PDF
    This paper proposes a novel queue-based programming abstraction, Parallel Dispatch Queue (PDQ), that enables efficient parallel execution of fine-grain software communication protocols. Parallel systems often use fine-grain software handlers to integrate a network message into computation. Executing such handlers in parallel requires access synchronization around resources. Much as a monitor construct in a concurrent language protects accesses to a set of data structures, PDQ allows messages to include a synchronization key protecting handler accesses to a group of protocol resources. By simply synchronizing messages in a queue prior to dispatch, PDQ not only eliminates the overhead of acquiring/releasing synchronization primitives but also prevents busy-waiting within handlers. In this paper, we study PDQ's impact on software protocol performance in the context of fine- grain distributed shared memory (DSM) on an SMP cluster. Simulation results running shared-memory applications indicate that: (i) parallel software protocol execution using PDQ significantly improves performance in fine- grain DSM, (ii) tight integration of PDQ and embedded processors into a single custom device can offer performance competitive or better than an all- hardware DSM, and (iii) PDQ best benefits cost-effective systems that use idle SMP processors (rather than custom embedded processors) to execute protocols. On a cluster of 4 16-way SMPs, a PDQ-based parallel protocol running on idle SMP processors improves application performance by a factor of 2.6 over a system running a serial protocol on a single dedicated processo

    Modeling cost/performance of a parallel computer simulator

    Get PDF
    This article examines the cost/performance of simulating a hypothetical target parallel computer using a commercial host parallel computer. We address the question of whether parallel simulation is simply faster than sequential simulation, or if it is also more cost-effective. To answer this, we develop a performance model of the Wisconsin Wind Tunnel (WWT), a system that simulates cache-coherent shared-memory machines on a message-passing Thinking Machines CM-5. The performance model uses Kruskal and Weiss's fork-join model to account for the effect of event processing time variability on WWT's conservative fixed-window simulation algorithm. A generalization of Thiebaut and Stone's footprint model accurately predicts the effect of cache interference on the CM-5. The model is calibrated using parameters extracted from a fully parallel simulation (p = N), and validated by measuring the speedup as the number of processors (p) ranges from 1 to the number of target nodes (N). Together with simple cost models, the performance model indicates that for target system sizes of 32 nodes and larger, parallel simulation is more cost-effective than sequential simulation. The key intuition behind this result is that large simulations require large memories, which dominate the cost of a uniprocessor; parallel computers allow multiple processors to simultaneously access this large memory

    Whole blood transcriptional responses of very preterm infants during late-onset sepsis

    Get PDF
    Background Host immune responses during late-onset sepsis (LOS) in very preterm infants are poorly characterised due to a complex and dynamic pathophysiology and challenges in working with small available blood volumes. We present here an unbiased transcriptomic analysis of whole peripheral blood from very preterm infants at the time of LOS. Methods RNA-Seq was performed on peripheral blood samples (6–29 days postnatal age) taken at the time of suspected LOS from very preterm infants <30 weeks gestational age. Infants were classified based on blood culture positivity and elevated C-reactive protein concentrations as having confirmed LOS (n = 5), possible LOS (n = 4) or no LOS (n = 9). Bioinformatics and statistical analyses performed included pathway over-representation and protein-protein interaction network analyses. Plasma cytokine immunoassays were performed to validate differentially expressed cytokine pathways. Results The blood leukocyte transcriptional responses of infants with confirmed LOS differed significantly from infants without LOS (1,317 differentially expressed genes). However, infants with possible LOS could not be distinguished from infants with no LOS or confirmed LOS. Transcriptional alterations associated with LOS included genes involved in pathogen recognition (mainly TLR pathways), cytokine signalling (both pro-inflammatory and inhibitory responses), immune and haematological regulation (including cell death pathways), and metabolism (altered cholesterol biosynthesis). At the transcriptional-level cytokine responses during LOS were characterised by over-representation of IFN-α/β, IFN-γ, IL-1 and IL-6 signalling pathways and up-regulation of genes for inflammatory responses. Infants with confirmed LOS had significantly higher levels of IL-1α and IL-6 in their plasma. Conclusions Blood responses in very preterm infants with LOS are characterised by altered host immune responses that appear to reflect unbalanced immuno-metabolic homeostasis

    Coherent network interfaces for fine-grain communication

    Get PDF
    Using coherence can improve performance by facilitating burst transfers of whole cache blocks and reducing control overheads. This paper describes an attempt to explore network interfaces that use coherence, i.e., coherent network interfaces (CNIs), to improve communication performance. First, it reports on the development and optimization of two mechanisms that CNIs use to communicate with processors. A taxonomy and comparison of four CNIs with a more conventional NI are then presented

    Longer-Term Postcure Measurement of Cuspal Deformation Induced by Dimensional Changes in Dental Materials

    Get PDF
    Aim. This paper presents a simple, versatile in vitro methodology that enables indirect quantification of shrinkage and expansion stresses under clinically relevant conditions without the need for a dedicated instrument. Methods. For shrinkage effects, resulting cusp deformation of aluminum blocks with MOD type cavity, filled with novel filling compositions and commercial cements, has been measured using a bench-top micrometer and a Linear Variable Differential Transformer (LVDT, a displacement transducer) based instrument. Results. The results demonstrated the validity of the proposed simple methodology. The technique was successfully used in longer-term measurements of shrinkage and expansion stress for several dental compositions. Conclusions. In contrast to in situ techniques where a measuring instrument is dedicated to the sample and its data collection, the proposed simple methodology allows for transfer of the samples to the environment of choice for storage and conditioning. The presented technique can be reliably used to quantify stress development of curing materials under clinically relevant (oral) conditions. This enables direct examination and comparison of structural properties corresponding to the final stage of formed networks. The proposed methodology is directly applicable to the study of self-curing systems as they require mouth-type conditions (temperature and humidity) to achieve their designed kinetics and reactions
    • …
    corecore