Search CORE

526 research outputs found

Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load/Store Optimization

Author: Roth Amir
Publication venue: ScholarlyCommons
Publication date: 01/01/2004
Field of study

A high-bandwidth, low-latency load-store unit is a critical component of a dynamically scheduled processor. Unfortunately, it is also one of the most complex and non-scalable components. Recently, several researchers have proposed techniques that simplify the core load-store unit and improve its scalability in exchange for the in-order pre-retirement re-execution of some subset of the loads in the program. We call such techniques load/store optimizations. One recent optimization attacks load queue (LQ) scalability by replacing the expensive associative search that is used to enforce intra- and inter- thread ordering with load re-execution. A second attacks store queue (SQ) scalability by speculatively filtering some load accesses and some store entries from it. The speculatively accessed, speculatively populated SQ can be made smaller and faster, but load re-execution is required to verify the speculation. A third uses a hardware table to identify redundant loads and skip their execution altogether. Redundant load elimination is highly accurate but not 100%, so re-execution is needed to flag false eliminations. Unfortunately, the inherent benefits of load/store optimizations are mitigated by re-execution itself. Re-execution contends for cache bandwidths with store retirement, and serializes load re-execution with subsequent store retirement. If a particular technique requires a sufficient number of load re-executions, the cost of these re-executions will outweigh the benefits of the technique entirely and may even produce drastic slowdowns. This is the case for the SQ technique. Store Vulnerability Window (SVW) is a new mechanism that reduces the re-execution requirements of a given load/store optimization significantly, by an average of 85% across the three load/store optimizations we study. This reduction relieves cache port contention and removes many of the dynamic serialization events that contribute the bulk of re-execution’s cost, and allows these techniques to perform up to their full potential. For the scalable SQ optimization, this means the chnace to perform at all. Without SVW, this technique posts significant slowdowns. SVW is a simple scheme based on monotonic store sequence numbering and a novel application of Bloom Filtering. The cost of an effective SVW implementation is a 1KB buffer and an 2B field per LQ entry

ScholarlyCommons@Penn

Physical Register Reference Counting

Author: Roth Amir
Publication venue: ScholarlyCommons
Publication date: 01/01/2008
Field of study

Several recently proposed techniques including CPR (Checkpoint Processing and Recovery) and NoSQ (No Store Queue) rely on reference counting to manage physical registers. However, the register reference counting mechanism itself has received surprisingly little attention. This paper fills this gap by describing potential register reference counting schemes for NoSQ, CPR, and a hypothetical NoSQ/CPR hybrid. Although previously described in terms of binary counters, we find that reference counts are actually more naturally represented as matrices. Binary representations can be used as an optimization in specific situations

ScholarlyCommons@Penn

Energy-Effectiveness of Pre-Execution and Energy-Aware P-Thread Selection

Author: Petric Vlad
Roth Amir
Publication venue: ScholarlyCommons
Publication date: 04/06/2005
Field of study

Pre-execution removes the microarchitectural latency of problem loads from a program’s critical path by redundantly executing copies of their computations in parallel with the main program. There have been several proposed pre-execution systems, a quantitative framework (PTHSEL) for analytical pre-execution thread (p-thread) selection, and even a research prototype. To date, however, the energy aspects of pre-execution have not been studied. Cycle-level performance and energy simulations on SPEC2000 integer benchmarks that suffer from L2 misses show that energy-blind pre-execution naturally has a linear latency/energy trade-off, improving performance by 13.8% while increasing energy consumption by 11.9%. To improve this trade-off, we propose two extensions to PTHSEL. First, we replace the flat cycle-for-cycle load cost model with a model based on a critical-path estimation. This extension increases p-thread efficiency in an energy-independent way. Second, we add a parameterized energy model to PTHSEL (forming PTHSEL+E) that allows it to actively select p-threads that reduce energy rather than (or in combination with) execution latency. Experiments show that PTHSEL+E manipulates preexecution’s latency/energy more effectively. Latency targeted selection benefits from the improved load cost model: its performance improvements grow to an average of 16.4% while energy costs drop to 8.7%. ED targeted selection produces p-threads that improve performance by only 12.9%, but ED by 8.8%. Targeting p-thread selection for energy reduction, results in energy-free pre-execution, with average speedup of 5.4%, and a small decrease in total energy consumption (0.7%)

ScholarlyCommons@Penn

A Quantitative Framework for Automated Pre-Execution Thread Selection

Author: Roth Amir
Sohi Gurindar S.
Publication venue: ScholarlyCommons
Publication date: 01/01/2002
Field of study

Pre-execution attacks cache misses for which conventional address-prediction driven prefetching is ineffective. In pre-execution, copies of cache miss computations are isolated from the main program and launched as separate threads called p-threads whenever the processor anticipates an upcoming miss. P-thread selection is the task of deciding what computations should execute on p-threads and when they should be launched such that total execution time is minimized. P-thread selection is central to the success of pre-execution. We introduce a framework for automated static p-thread selection, a static p-thread being one whose dynamic instances are repeatedly launched during the course of program execution. Our approach is to formalize the problem quantitatively and then apply standard techniques to solve it analytically. The framework has two novel components. The slice tree is a new data structure that compactly represents the space of all possible static p-threads. Aggregate advantage is a formula that uses raw program statistics and computation structure to assign each candidate static p-thread a numeric score based on estimated latency tolerance and overhead aggregated over its expected dynamic executions. Our framework finds the set of p-threads whose aggregate advantages sum to a maximum. The framework is simple and intuitively parameterized to model the salient microarchitecture features. We apply our framework to the task of choosing p-threads that cover L2 cache misses. Using detailed simulation, we study the effectiveness of our framework, and pre-execution in general, under difference conditions. We measure the effect of constraining p-thread length, of adding localized optimization to p-threads, and of using various program samples as a statistical basis for the p-thread selection, and show that our framework responds to these changes in an intuitive way. In the microarchitecture dimension, we measure the effect of varying memory latency and processor width and observe that our framework adapts well to these changes. Each experiment includes a validation component which checks that the formal model presented to our framework correctly represents actual execution

CiteSeerX

ScholarlyCommons@Penn

Encoding Mini-Graphs With Handle Prefix Outlining

Author: Bracy Anne W.
Roth Amir
Publication venue: ScholarlyCommons
Publication date: 01/01/2008
Field of study

Recently proposed techniques like mini-graphs, CCA-subgraphs, and static strands exploit application-specific compound or fused instructions to reduce execution time, energy consumption, and/or processor complexity. To achieve their full potential, these techniques rely on static tools to identify common instruction sequences that make good fusion candidates. As a result, they also rely on ISA extension facilities that can encode these chosen instruction groups in a way that supports efficient execution on fusion-enabled hardware as well as compatibility across different implementations, including fusion-agnostic implementations. This paper describes handle prefix outlining, the ISA extension scheme used by mini-graph processors. Handle prefix outlining can be thought of as a hybrid of the encoding scheme used by three previous instruction aggregation techniques: PRISC, static strands, and CCA-subgraphs. It combines the best features of each scheme to deliver both full compatibility and execution efficiency on fusion-enabled processors

ScholarlyCommons@Penn

NoSQ: Store-Load Communication without a Store Queue

Author: Martin Milo
Roth Amir
Sha Tingting
Publication venue: ScholarlyCommons
Publication date: 01/12/2006
Field of study

This paper presents NoSQ (short for No Store Queue), a microarchitecture that performs store-load communication without a store queue and without executing stores in the out-of-order engine. NoSQ implements store-load communication using speculative memory bypassing (SMB), the dynamic short-circuiting of DEF-store-load-USE chains to DEF-USE chains. Whereas previous proposals used SMB as an opportunistic complement to conventional store queue-based forwarding, NoSQ uses SMB as a store queue replacement. NoSQ relies on two supporting mechanisms. The first is an advanced store-load bypassing predictor that for a given dynamic load can predict whether that load will bypass and the identity of the communicating store. The second is an efficient verification mechanism for both bypassed and non-bypassed loads using in-order load re-execution with an SMB-aware store vulnerability window (SVW) filter. The primary benefit of NoSQ is a simple, fast datapath that does not contain store-load forwarding hardware; all loads get their values either from the data cache or from the register file. Experiments show that this simpler design - despite being more speculative - slightly outperforms a conventional store-queue based design on most benchmarks (by 2% on average)

Crossref

ScholarlyCommons@Penn

Source Localization of Brain States Associated with Canonical Neuroimaging Postures

Author: Lifshitz Michael
Raz Amir
Roth Raquel R.
Thibault Robert T.
Publication venue: Chapman University Digital Commons
Publication date: 31/05/2017
Field of study

Cognitive neuroscientists rarely consider the influence that body position exerts on brain activity; yet, postural variation holds important implications for the acquisition and interpretation of neuroimaging data. Whereas participants in most behavioral and EEG experiments sit upright, many prominent brain imaging techniques (e.g., fMRI) require participants to lie supine. Here we demonstrate that physical comportment profoundly alters baseline brain activity as measured by magnetoencephalography (MEG)—an imaging modality that permits multipostural acquisition. We collected resting-state MEG data from 12 healthy participants in three postures (lying supine, reclining at 45°, and sitting upright). Source-modeling analysis revealed a broadly distributed influence of posture on resting brain function. Sitting upright versus lying supine was associated with greater high-frequency (i.e., beta and gamma) activity in widespread parieto-occipital cortex. Moreover, sitting upright and reclining postures correlated with dampened activity in prefrontal regions across a range of bandwidths (i.e., from alpha to low gamma). The observed effects were large, with a mean Cohen\u27s d of 0.95 (SD = 0.23). In addition to neural activity, physiological parameters such as muscle tension and eye blinks may have contributed to these posture-dependent changes in brain signal. Regardless of the underlying mechanisms, however, the present results have important implications for the acquisition and interpretation of multimodal imaging data (e.g., studies combining fMRI or PET with EEG or MEG). More broadly, our findings indicate that generalizing results—from supine neuroimaging measurements to erect positions typical of ecological human behavior—would call for considering the influence that posture wields on brain dynamics

Crossref

Chapman University Digital Commons

Neurofeedback with fMRI: A Critical Systematic Review

Author: Lifshitz Michael
MacPherson Amanda
Raz Amir
Roth Raquel R.
Thibault Robert T.
Publication venue: Chapman University Digital Commons
Publication date: 27/12/2017
Field of study

Neurofeedback relying on functional magnetic resonance imaging (fMRI-nf) heralds new prospects for self-regulating brain and behavior. Here we provide the first comprehensive review of the fMRI-nf literature and the first systematic database of fMRI-nf findings. We synthesize information from 99 fMRI-nf experiments—the bulk of currently available data. The vast majority of fMRI-nf findings suggest that self-regulation of specific brain signatures seems viable; however, replication of concomitant behavioral outcomes remains sparse. To disentangle placebo influences and establish the specific effects of neurofeedback, we highlight the need for double-blind placebo-controlled studies alongside rigorous and standardized statistical analyses. Before fMRI-nf can join the clinical armamentarium, research must first confirm the sustainability, transferability, and feasibility of fMRI-nf in patients as well as in healthy individuals. Whereas modulating specific brain activity promises to mold cognition, emotion, thought, and action, reducing complex mental health issues to circumscribed brain regions may represent a tenuous goal. We can certainly change brain activity with fMRI-nf. However, it remains unclear whether such changes translate into meaningful behavioral improvements in the clinical domain

Crossref

Chapman University Digital Commons

Short-range correlations in nuclear matter using Green's functions within a discrete pole approximation

Author: Amir-Azimi-Nili
Benhar
Bozek
Bozek
Ciofi degli Atti
D. Van Neck
de Jong
Dewulf
Dickhoff
Dickhoff
Gearhart
M. Waroquier
Müther
Müther
Müther
Müther
Ramos
Roth
Stoks
Vonderfecht
Y. Dewulf
Publication venue: 'Elsevier BV'
Publication date: 01/01/2001
Field of study

We treat short-range correlations in nuclear matter, induced by the repulsive core of the nucleon-nucleon potential, within the framework of a self-consistent Green's function theory. The effective in-medium interaction sums the ladder diagrams of both the particle-particle and hole-hole type. The demand of self-consistency results in a set of nonlinear equations which must be solved by iteration. We explore the possibility of approximating the single-particle Green's function by a limited number of poles and residues.Comment: 9 pages, 3 eps-figures; added two tables dealing with calculations including larger sets of BAGEL-pole

arXiv.org e-Print Archive

Crossref

Ghent University Academic Bibliography

CERN Document Server

A Time Projection Chamber with GEM-Based Readout

Author: Attié David
Behnke Ties
Bellerive Alain
Bezshyyko Oleg
Bhattacharya Deb Sankar
Bhattacharya Purba
Bhattacharya Sudeb
Caiazza Stefano
Colas Paul
De Lentdecker Gilles
Dehmelt Klaus
Desch Klaus
Diener Ralf
Dixit Madhu
Fleck Ivor
Fujii Keisuke
Fusayasu Takahiro
Ganjour Serguei
Gao Yuanning
Gros Philippe
Hayman Peter
Hedberg Vincent
Ikematsu Katsumasa
Jönsson Leif
Kaminski Jochen
Kato Yukihiro
Kawada Shin-ichi
Killenberg Martin
Kleinwort Claus
Kobayashi Makoto
Krylov Vladyslav
Li Bo
Li Yulan
Lundberg Björn
Lupberger Michael
Majumdar Nayana
Matsuda Takeshi
Mehdiyev Rashid
Mjörnmark Ulf
Mukhopadhyay Supratik
Müller Felix
Münnich Astrid
Ogawa Tomohisa
Oskarsson Anders
Peterson Daniel
Riallot Marc
Rosemann Christoph
Roth Stefan
Schade Peter
Schäfer Oliver
Settles Ronald Dean
Shirazi Amir Noori
Smirnova Oxana
Sugiyama Akira
Takahashi Tohru
The LCTPC Collaboration
Tian Junping
Timmermans Jan
Titov Maksym
Tsionou Dimitra
Vauth Annika
Wang Wenxin
Watanabe Takashi
Werthenbach Ulrich
Yang Yifan
Yang Zhenwei
Yonamine Ryo
Zenker Klaus
Zhang Fan
Österman Lennart
Publication venue
Publication date: 01/01/2016
Field of study

For the International Large Detector concept at the planned International Linear Collider, the use of time projection chambers (TPC) with micro-pattern gas detector readout as the main tracking detector is investigated. In this paper, results from a prototype TPC, placed in a 1 T solenoidal field and read out with three independent GEM-based readout modules, are reported. The TPC was exposed to a 6 GeV electron beam at the DESY II synchrotron. The efficiency for reconstructing hits, the measurement of the drift velocity, the space point resolution and the control of field inhomogeneities are presented.Comment: 22 pages, 19 figure

arXiv.org e-Print Archive

DESY Publication Database

Lund University Publications

HAL-IN2P3

Crossref

Carleton University's Institutional Repository