147 research outputs found

    Resolve: Enabling Accurate Parallel Monitoring under Relaxed Memory Models

    Get PDF
    Hardware-assisted instruction-grain monitoring frameworks provide high-coverage, low overhead debugging support for parallel programs. Unfortunately, existing frameworks are ill-suited for the relaxed memory models employed by nearly all modern processor architectures—e.g., TSO (x86, SPARC), RMO (SPARC), and Weak Consistency (ARMv7). For TSO, prior proposals hint at a solution, but provide no implementation or evaluation, and fail to correctly handle important corner cases such as byte-level dependences. For more relaxed memory models such as RMO and Weak Consistency, prior frameworks deadlock, rendering them unable to detect any bugs past the first deadlock! This paper presents Resolve, the first hardware-assisted instruction-grain monitoring framework that is complete, correct and deadlock-free under relaxed memory models. Resolve is based on the observation that while relaxed memory models can produce cycles of dependences that deadlock prior approaches, these cycles can be overcome by consulting the dataflow graph of the application threads being monitored, instead of their program order. Resolve handles all possible cycles arising in relaxed memory models, through a careful approach that uses both dataflow-based processing and versioning of monitoring state, as appropriate. Moreover, we provide the first quantitative characterization of the cycles arising under RMO, demonstrating that such cycles are prevalent and persistent, and hence deadlock is a real problem that must be addressed. Yet they are not so frequent or complex, so that Resolve’s overheads are negligible. Finally, we present a simple and novel hardware mechanism for properly synchronizing updates to monitoring state under relaxed memory models, improving performance by up to 35% over the judicious use of memory fences

    Flexible hardware acceleration for instruction-grain program monitoring

    Get PDF
    Instruction-grain program monitoring tools, which check and analyze executing programs at the granularity of individual instructions, are invaluable for quickly detecting bugs and security attacks and then limiting their damage (via containment and/or recovery). Unfortunately, their fine-grain nature implies very high monitoring overheads for software-only tools, which are typically based on dynamic binary instrumentation. Previous hardware proposals either focus on mechanisms that target specific bugs or address only the cost of binary instrumentation. In this paper, we propose a flexible hardware solution for accelerating a wide range of instruction-grain monitoring tools. By examining a number of diverse tools (for memory checking, security tracking, and data race detection), we identify three significant common sources of overheads and then propose three novel hardware techniques for addressing these overheads; Inheritance Tracking, Idempotent Filters, and Metadata-TLBs. Together, these constitute a general-purpose hardware acceleration framework. Experimental results show our framework reduces overheads by 2-3X over the previous state-of-the-art, while supporting the needed flexibility. © 2008 IEEE

    ParaLog: enabling and accelerating online parallel monitoring of multithreaded applications

    Get PDF
    Instruction-grain lifeguards monitor the events of a running application at the level of individual instructions in order to identify and help mitigate application bugs and security exploits. Because such lifeguards impose a 10-100X slowdown on existing platforms, previous studies have proposed hardware designs to accelerate lifeguard processing. However, these accelerators are either tailored to a specific class of lifeguards or suitable only for monitoring singlethreaded programs. We present ParaLog, the first design of a system enabling fast online parallel monitoring of multithreaded parallel applications. ParaLog supports a broad class of software-defined lifeguards. We show how three existing accelerators can be enhanced to support online multithreaded monitoring, dramatically reducing lifeguard overheads. We identify and solve several challenges in monitoring parallel applications and/or parallelizing these accelerators, including (i) enforcing inter-thread data dependences, (ii) dealing with inter-thread effects that are not reflected in coherence traffic, (iii) dealing with unmonitored operating system activity, and (iv) ensuring lifeguards can access shared metadata with negligible synchronization overheads. We present our system design for both Sequentially Consistent and Total Store Ordering processors. We implement and evaluate our design on a 16 core simulated CMP, using benchmarks from SPLASH-2 and PARSEC and two lifeguards: a data-flow tracking lifeguard and a memory-access checker lifeguard. Our results show that (i) our parallel accelerators improve performance by 2-9X and 1.13-3.4X for our two lifeguards, respectively, (ii) we are 5-126X faster than the time-slicing approach required by existing techniques, and (iii) our average overheads for applications with eight threads are 51% and 28% for the two lifeguards, respectively

    Parallel depth first vs. work stealing schedulers on CMP architectures

    Get PDF
    In chip multiprocessors (CMPs), limiting the number of off-chip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this brief announcement, we highlight our ongoing study [4] comparing the performance of two schedulers designed for fine-grained multithreaded programs: Parallel Depth First (PDF) [2], which is designed for constructive sharing, and Work Stealing (WS) [3], which takes a more traditional approach.Overview of schedulers. In PDF, processing cores are allocated ready-to-execute program tasks such that higher scheduling priority is given to those tasks the sequential program would have executed earlier. As a result, PDF tends to co-schedule threads in a way that tracks the sequential execution. Hence, the aggregate working set is (provably) not much larger than the single thread working set [1]. In WS, each processing core maintains a local work queue of readyto-execute threads. Whenever its local queue is empty, the core steals a thread from the bottom of the first non-empty queue it finds. WS is an attractive scheduling policy because when there is plenty of parallelism, stealing is quite rare. However, WS is not designed for constructive cache sharing, because the cores tend to have disjoint working sets.CMP configurations studied. We evaluated the performance of PDF and WS across a range of simulated CMP configurations. We focused on designs that have fixed-size private L1 caches and a shared L2 cache on chip. For a fixed die size (240 mm2), we varied the number of cores from 1 to 32. For a given number of cores, we used a (default) configuration based on current CMPs and realistic projections of future CMPs, as process technologies decrease from 90nm to 32nm.Summary of findings. We studied a variety of benchmark programs to show the following findings.For several application classes, PDF enables significant constructive sharing between threads, leading to better utilization of the on-chip caches and reducing off-chip traffic compared to WS. In particular, bandwidth-limited irregular programs and parallel divide-and-conquer programs present a relative speedup of 1.3-1.6X over WS, observing a 13- 41% reduction in off-chip traffic. An example is shown in Figure 1, for parallel merge sort. For each schedule, the number of L2 misses (i.e., the off-chip traffic) is shown on the left and the speed-up over running on one core is shown on the right, for 1 to 32 cores. Note that reducing the offchip traffic has the additional benefit of reducing the power consumption. Moreover, PDF's smaller working sets provide opportunities to power down segments of the cache without increasing the running time. Furthermore, when multiple programs are active concurrently, the PDF version is also less of a cache hog and its smaller working set is more likely to remain in the cache across context switches.For several other applications classes, PDF and WS have roughly the same execution times, either because there is only limited data reuse that can be exploited or because the programs are not limited by off-chip bandwidth. In the latter case, the constructive sharing PDF enables does provide the power and multiprogramming benefits discussed above.Finally, most parallel benchmarks to date, written for SMPs, use such a coarse-grained threading that they cannot exploit the constructive cache behavior inherent in PDF.We find that mechanisms to finely grain multithreaded applications are crucial to achieving good performance on CMPs

    Murid herpesvirus-4 lacking thymidine kinase reveals route-dependent requirements for host colonization

    Get PDF
    Gammaherpesviruses infect at least 90 % of the world's population. Infection control is difficult, in part because some fundamental features of host colonization remain unknown, for example whether normal latency establishment requires viral lytic functions. Since human gammaherpesviruses have narrow species tropisms, answering such questions requires animal models. Murid herpesvirus-4 (MuHV-4) provides one of the most tractable. MuHV-4 genomes delivered to the lung or peritoneum persist without lytic replication. However, they fail to disseminate systemically, suggesting that the outcome is inoculation route-dependent. After upper respiratory tract inoculation, MuHV-4 infects mice without involving the lungs or peritoneum. We examined whether host entry by this less invasive route requires the viral thymidine kinase (TK), a gene classically essential for lytic replication in terminally differentiated cells. MuHV-4 TK knockouts delivered to the lung or peritoneum were attenuated but still reached lymphoid tissue. In contrast, TK knockouts delivered to the upper respiratory tract largely failed to establish a detectable infection. Therefore TK, and by implication lytic replication, is required for MuHV-4 to establish a significant infection by a non-invasive route

    Study of exclusive one-pion and one-eta production using hadron and dielectron channels in pp reactions at kinetic beam energies of 1.25 GeV and 2.2 GeV with HADES

    Get PDF
    We present measurements of exclusive ensuremathπ+,0 and η production in pp reactions at 1.25GeV and 2.2GeV beam kinetic energy in hadron and dielectron channels. In the case of π+ and π0 , high-statistics invariant-mass and angular distributions are obtained within the HADES acceptance as well as acceptance-corrected distributions, which are compared to a resonance model. The sensitivity of the data to the yield and production angular distribution of Δ (1232) and higher-lying baryon resonances is shown, and an improved parameterization is proposed. The extracted cross-sections are of special interest in the case of pp → pp η , since controversial data exist at 2.0GeV; we find \ensuremathσ=0.142±0.022 mb. Using the dielectron channels, the π0 and η Dalitz decay signals are reconstructed with yields fully consistent with the hadronic channels. The electron invariant masses and acceptance-corrected helicity angle distributions are found in good agreement with model predictions

    Scheduling threads for constructive cache sharing on CMPs

    Get PDF
    In chip multiprocessors (CMPs), limiting the number of offchip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this paper, we compare the performance of two state-of-the-art schedulers proposed for fine-grained multithreaded programs: Parallel Depth First (PDF), which is specifically designed for constructive cache sharing, and Work Stealing (WS), which is a more traditional design. Our experimental results indicate that PDF scheduling yields a 1.3 - 1.6X performance improvement relative to WS for several fine- grain parallel benchmarks on projected future CMP configurations; we also report several issues that may limit the advantage of PDF in certain applications. These results also indicate that PDF more effectively utilizes off-chip bandwidth, making it possible to trade-off on-chip cache for a larger number of cores. Moreover, we find that task granularity plays a key role in cache performance. Therefore, we present an automatic approach for selecting effective grain sizes, based on a new working set profiling algorithm that is an order of magnitude faster than previous approaches. This is the first paper demonstrating the effectiveness of PDF on real benchmarks, providing a direct comparison between PDF and WS, revealing the limiting factors for PDF in practice, and presenting an approach for overcoming these factors. Copyright 2007 ACM
    • …
    corecore