Search CORE

147 research outputs found

Resolve: Enabling Accurate Parallel Monitoring under Relaxed Memory Models

Author: Falsafi Babak
Fytraki Sotiria
Gibbons Phillip B.
Kozuch Michael A.
Vlachos Evangelos
Publication venue
Publication date: 31/03/2014
Field of study

Hardware-assisted instruction-grain monitoring frameworks provide high-coverage, low overhead debugging support for parallel programs. Unfortunately, existing frameworks are ill-suited for the relaxed memory models employed by nearly all modern processor architectures—e.g., TSO (x86, SPARC), RMO (SPARC), and Weak Consistency (ARMv7). For TSO, prior proposals hint at a solution, but provide no implementation or evaluation, and fail to correctly handle important corner cases such as byte-level dependences. For more relaxed memory models such as RMO and Weak Consistency, prior frameworks deadlock, rendering them unable to detect any bugs past the first deadlock! This paper presents Resolve, the first hardware-assisted instruction-grain monitoring framework that is complete, correct and deadlock-free under relaxed memory models. Resolve is based on the observation that while relaxed memory models can produce cycles of dependences that deadlock prior approaches, these cycles can be overcome by consulting the dataflow graph of the application threads being monitored, instead of their program order. Resolve handles all possible cycles arising in relaxed memory models, through a careful approach that uses both dataflow-based processing and versioning of monitoring state, as appropriate. Moreover, we provide the first quantitative characterization of the cycles arising under RMO, demonstrating that such cycles are prevalent and persistent, and hence deadlock is a real problem that must be addressed. Yet they are not so frequent or complex, so that Resolve’s overheads are negligible. Finally, we present a simple and novel hardware mechanism for properly synchronizing updates to monitoring state under relaxed memory models, improving performance by up to 35% over the judicious use of memory fences

Infoscience - École polytechnique fédérale de Lausanne

Flexible hardware acceleration for instruction-grain program monitoring

Author: Chen Shimin
Falsafi Babak
Gibbons Phillip B.
Kozuch Michael
Mowry Todd C.
Ramachandran Vijaya
Ruwase Olatunji
Ryan Michael
Strigkos Theodoros
Vlachos Evangelos
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 06/04/2009
Field of study

Instruction-grain program monitoring tools, which check and analyze executing programs at the granularity of individual instructions, are invaluable for quickly detecting bugs and security attacks and then limiting their damage (via containment and/or recovery). Unfortunately, their fine-grain nature implies very high monitoring overheads for software-only tools, which are typically based on dynamic binary instrumentation. Previous hardware proposals either focus on mechanisms that target specific bugs or address only the cost of binary instrumentation. In this paper, we propose a flexible hardware solution for accelerating a wide range of instruction-grain monitoring tools. By examining a number of diverse tools (for memory checking, security tracking, and data race detection), we identify three significant common sources of overheads and then propose three novel hardware techniques for addressing these overheads; Inheritance Tracking, Idempotent Filters, and Metadata-TLBs. Together, these constitute a general-purpose hardware acceleration framework. Experimental results show our framework reduces overheads by 2-3X over the previous state-of-the-art, while supporting the needed flexibility. © 2008 IEEE

Infoscience - École polytechnique fédérale de Lausanne

Crossref

ParaLog: enabling and accelerating online parallel monitoring of multithreaded applications

Author: Chen Shimin
Falsafi Babak
Gibbons Phillip B.
Goodstein Michelle L.
Kozuch Michael A.
Mowry Todd C.
Vlachos Evangelos
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2010
Field of study

Instruction-grain lifeguards monitor the events of a running application at the level of individual instructions in order to identify and help mitigate application bugs and security exploits. Because such lifeguards impose a 10-100X slowdown on existing platforms, previous studies have proposed hardware designs to accelerate lifeguard processing. However, these accelerators are either tailored to a specific class of lifeguards or suitable only for monitoring singlethreaded programs. We present ParaLog, the first design of a system enabling fast online parallel monitoring of multithreaded parallel applications. ParaLog supports a broad class of software-defined lifeguards. We show how three existing accelerators can be enhanced to support online multithreaded monitoring, dramatically reducing lifeguard overheads. We identify and solve several challenges in monitoring parallel applications and/or parallelizing these accelerators, including (i) enforcing inter-thread data dependences, (ii) dealing with inter-thread effects that are not reflected in coherence traffic, (iii) dealing with unmonitored operating system activity, and (iv) ensuring lifeguards can access shared metadata with negligible synchronization overheads. We present our system design for both Sequentially Consistent and Total Store Ordering processors. We implement and evaluate our design on a 16 core simulated CMP, using benchmarks from SPLASH-2 and PARSEC and two lifeguards: a data-flow tracking lifeguard and a memory-access checker lifeguard. Our results show that (i) our parallel accelerators improve performance by 2-9X and 1.13-3.4X for our two lifeguards, respectively, (ii) we are 5-126X faster than the time-slicing approach required by existing techniques, and (iii) our average overheads for applications with eight threads are 51% and 28% for the two lifeguards, respectively

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Flexible Hardware Acceleration for Instruction-Grain Lifeguards

Author: Babak Falsafi
Evangelos Vlachos
Michael Kozuch
Michael Ryan
Olatunji Ruwase
Phillip B. Gibbons
Shimin Chen
Theodoros Strigkos
Todd C. Mowry
Vijaya Ramachandran
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Parallel depth first vs. work stealing schedulers on CMP architectures

Author: Ailamaki Anastassia
Blelloch Guy E.
Chen Shimin
Falsafi Babak
Fix Limor
Gibbons Phillip B.
Hardavellas Nikos
Kozuch Michael
Liaskovitis Vasileios
Mowry Todd C.
Wilkerson Chris
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 23/01/2009
Field of study

In chip multiprocessors (CMPs), limiting the number of off-chip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this brief announcement, we highlight our ongoing study [4] comparing the performance of two schedulers designed for fine-grained multithreaded programs: Parallel Depth First (PDF) [2], which is designed for constructive sharing, and Work Stealing (WS) [3], which takes a more traditional approach.Overview of schedulers. In PDF, processing cores are allocated ready-to-execute program tasks such that higher scheduling priority is given to those tasks the sequential program would have executed earlier. As a result, PDF tends to co-schedule threads in a way that tracks the sequential execution. Hence, the aggregate working set is (provably) not much larger than the single thread working set [1]. In WS, each processing core maintains a local work queue of readyto-execute threads. Whenever its local queue is empty, the core steals a thread from the bottom of the first non-empty queue it finds. WS is an attractive scheduling policy because when there is plenty of parallelism, stealing is quite rare. However, WS is not designed for constructive cache sharing, because the cores tend to have disjoint working sets.CMP configurations studied. We evaluated the performance of PDF and WS across a range of simulated CMP configurations. We focused on designs that have fixed-size private L1 caches and a shared L2 cache on chip. For a fixed die size (240 mm2), we varied the number of cores from 1 to 32. For a given number of cores, we used a (default) configuration based on current CMPs and realistic projections of future CMPs, as process technologies decrease from 90nm to 32nm.Summary of findings. We studied a variety of benchmark programs to show the following findings.For several application classes, PDF enables significant constructive sharing between threads, leading to better utilization of the on-chip caches and reducing off-chip traffic compared to WS. In particular, bandwidth-limited irregular programs and parallel divide-and-conquer programs present a relative speedup of 1.3-1.6X over WS, observing a 13- 41% reduction in off-chip traffic. An example is shown in Figure 1, for parallel merge sort. For each schedule, the number of L2 misses (i.e., the off-chip traffic) is shown on the left and the speed-up over running on one core is shown on the right, for 1 to 32 cores. Note that reducing the offchip traffic has the additional benefit of reducing the power consumption. Moreover, PDF's smaller working sets provide opportunities to power down segments of the cache without increasing the running time. Furthermore, when multiple programs are active concurrently, the PDF version is also less of a cache hog and its smaller working set is more likely to remain in the cache across context switches.For several other applications classes, PDF and WS have roughly the same execution times, either because there is only limited data reuse that can be exploited or because the programs are not limited by off-chip bandwidth. In the latter case, the constructive sharing PDF enables does provide the power and multiprogramming benefits discussed above.Finally, most parallel benchmarks to date, written for SMPs, use such a coarse-grained threading that they cannot exploit the constructive cache behavior inherent in PDF.We find that mechanisms to finely grain multithreaded applications are crucial to achieving good performance on CMPs

Infoscience - École polytechnique fédérale de Lausanne

Murid herpesvirus-4 lacking thymidine kinase reveals route-dependent requirements for host colonization

Author: Boname
Christopher M. Smith
Coen
Coleman
de Lima
Debbie E. Wright
Efstathiou
Faulkner
Gaspar
Gillet
Gillet
Hayashi
Hoagland
Janet S. May
Janz
Kayhan
Kozuch
May
Michael B. Gill
Milho
Moser
Nash
Philip G. Stevenson
Raulo
Rosa
Smith
Sokal
Stevenson
Stevenson
Stevenson
Stevenson
Thorley-Lawson
Thorley-Lawson
Tibbetts
Valyi-Nagy
Virgin
Yao
Publication venue: Society for General Microbiology
Publication date: 01/01/2009
Field of study

Gammaherpesviruses infect at least 90 % of the world's population. Infection control is difficult, in part because some fundamental features of host colonization remain unknown, for example whether normal latency establishment requires viral lytic functions. Since human gammaherpesviruses have narrow species tropisms, answering such questions requires animal models. Murid herpesvirus-4 (MuHV-4) provides one of the most tractable. MuHV-4 genomes delivered to the lung or peritoneum persist without lytic replication. However, they fail to disseminate systemically, suggesting that the outcome is inoculation route-dependent. After upper respiratory tract inoculation, MuHV-4 infects mice without involving the lungs or peritoneum. We examined whether host entry by this less invasive route requires the viral thymidine kinase (TK), a gene classically essential for lytic replication in terminally differentiated cells. MuHV-4 TK knockouts delivered to the lung or peritoneum were attenuated but still reached lymphoid tissue. In contrast, TK knockouts delivered to the upper respiratory tract largely failed to establish a detectable infection. Therefore TK, and by implication lytic replication, is required for MuHV-4 to establish a significant infection by a non-invasive route

Crossref

PubMed Central

University of Queensland eSpace

Study of exclusive one-pion and one-eta production using hadron and dielectron channels in pp reactions at kinetic beam energies of 1.25 GeV and 2.2 GeV with HADES

Author: Agakishiev Geydar
Alvarez-Pol Héctor
Balanda Andrzej
Bassini Roberto
Bokemeyer Helmut
Boyard Jean-Louis
Böhmer Michael
Cabanelas Pablo
Chernenko Sergey
Christ Tassilo
Destefanis Marco
Dohrmann Frank
Dybczak Adrian
Eberl Thomas
Fabbietti Laura
Fateev Oleg
Finocchiaro Paolo
Friese Jürgen
Fröhlich Ingo
Galatyuk Tetyana
Garzón Juan A.
Gernhäuser Roman
Gilardi Camilla
Golubeva Marina
González-Dıaz Diego
Guber Fedor
Gumberidze Malgorzata
Hennino Thierry
Holzmann Romain
Ierusalimov Alexander
Iori Ileana
Ivashkin Alexander
Jurkovic Martin
Kanaki Kalliopi
Karavicheva Tatiana
Koenig Ilse
Koenig Wolfgang
Kolb Burkhard W.
Kotte Roland
Kozuch Anna
Krizek Filip
Kugler Andrej
Kurepin Alexei
Kämpfer Burkhard
Kühn Wolfgang
Lang Simon
Lapidus Kirill
Liu T.
Maier Ludwig
Markert Jochen
Metag Volker
Michalska Beata
Morinière Emilie
Mousa Jehad
Münch Christian
Münch Mathias
Naumann Lothar
Otwinowski Jacek Tomasz
Pachmayer Yvonne C.
Pechenov Vladimir
Pechenova Olga
Pietraszko Jerzy
Pospısil Vladimir
Przygoda Witold
Pérez Cavalcanti Tiago
Ramstein Béatrice
Reshetin Andrey
Roy-Stephan M.
Rustamov Anar
Sadovsky Alexander
Sailer Benjamin
Salabura Piotr
Schmah Alexander
Schwab Erwin
Sobolev Yuri
Spataro Stefano
Spruck Björn
Stroth Joachim
Ströbele Herbert
Sturm Christian
Sánchez M.
Tarantola Attilio
Teilab Khaled
Tlusty Pavel
Toia Alberica
Traxler Michael
Trebacz Radoslaw
Tsertos Haralabos
Wagner Vladimir
Wisniowski Marcin
Wojcik Tomasz
Wüstenfeld Jörn
Yurevich Sergey
Zanevsky Yuri
Zumbruch Peter
Publication venue
Publication date: 29/05/2012
Field of study

We present measurements of exclusive ensuremathπ+,0 and η production in pp reactions at 1.25GeV and 2.2GeV beam kinetic energy in hadron and dielectron channels. In the case of π+ and π0 , high-statistics invariant-mass and angular distributions are obtained within the HADES acceptance as well as acceptance-corrected distributions, which are compared to a resonance model. The sensitivity of the data to the yield and production angular distribution of Δ (1232) and higher-lying baryon resonances is shown, and an improved parameterization is proposed. The extracted cross-sections are of special interest in the case of pp → pp η , since controversial data exist at 2.0GeV; we find \ensuremathσ=0.142±0.022 mb. Using the dielectron channels, the π0 and η Dalitz decay signals are reconstructed with yields fully consistent with the hadronic channels. The electron invariant masses and acceptance-corrected helicity angle distributions are found in good agreement with model predictions

Hochschulschriftenserver - Universität Frankfurt am Main

Scheduling threads for constructive cache sharing on CMPs

Author: Ailamaki Anastassia
Blelloch Guy E.
Chen Shimin
Falsafi Babak
Fix Limor
Gibbons Phillip B.
Hardavellas Nikos
Kozuch Michael
Liaskovitis Vasileios
Mowry Todd C.
Wilkerson Chris
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 06/04/2009
Field of study

In chip multiprocessors (CMPs), limiting the number of offchip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this paper, we compare the performance of two state-of-the-art schedulers proposed for fine-grained multithreaded programs: Parallel Depth First (PDF), which is specifically designed for constructive cache sharing, and Work Stealing (WS), which is a more traditional design. Our experimental results indicate that PDF scheduling yields a 1.3 - 1.6X performance improvement relative to WS for several fine- grain parallel benchmarks on projected future CMP configurations; we also report several issues that may limit the advantage of PDF in certain applications. These results also indicate that PDF more effectively utilizes off-chip bandwidth, making it possible to trade-off on-chip cache for a larger number of cores. Moreover, we find that task granularity plays a key role in cache performance. Therefore, we present an automatic approach for selecting effective grain sizes, based on a new working set profiling algorithm that is an order of magnitude faster than previous approaches. This is the first paper demonstrating the effectiveness of PDF on real benchmarks, providing a direct comparison between PDF and WS, revealing the limiting factors for PDF in practice, and presenting an approach for overcoming these factors. Copyright 2007 ACM

Infoscience - École polytechnique fédérale de Lausanne