162 research outputs found
A comprehensive approach to DRAM power management
This paper describes a comprehensive approach for using the memory controller to improve DRAM energy efficiency and manage DRAM power. We make three contributions: (1) we describe a simple power-down policy for exploiting low power modes of modern DRAMs; (2) we show how the idea of adaptive history-based memory schedulers can be naturally extended to manage power and energy; and (3) for situations in which additional DRAM power reduction is needed, we present a throttling approach that arbitrarily reduces DRAM activity by delaying the issuance of memory commands. Using detailed microarchitectural simulators of the IBM Power5+ and a DDR2-533 SDRAM, we show that our first two techniques combine to increase DRAM energy efficiency by an average of 18.2%, 21.7%, 46.1%, and 37.1 % for the Stream, NAS, SPEC2006fp, and commercial benchmarks, respectively. We also show that our throttling approach provides performance that is within 4.4 % of an idealized oracular approach.
Accurate and Scalable Many-Node Simulation
Accurate performance estimation of future many-node machines is challenging
because it requires detailed simulation models of both node and network.
However, simulating the full system in detail is unfeasible in terms of compute
and memory resources. State-of-the-art techniques use a two-phase approach that
combines detailed simulation of a single node with network-only simulation of
the full system. We show that these techniques, where the detailed node
simulation is done in isolation, are inaccurate because they ignore two
important node-level effects: compute time variability, and inter-node
communication.
We propose a novel three-stage simulation method to allow scalable and
accurate many-node simulation, combining native profiling, detailed node
simulation and high-level network simulation. By including timing variability
and the impact of external nodes, our method leads to more accurate estimates.
We validate our technique against measurements on a multi-node cluster, and
report an average 6.7% error on 64 nodes (maximum error of 12%), compared to on
average 27% error and up to 54% when timing variability and the scaling
overhead are ignored. At higher node counts, the prediction error of ignoring
variable timings and scaling overhead continues to increase compared to our
technique, and may lead to selecting the wrong optimal cluster configuration.
Using our technique, we are able to accurately project performance to
thousands of nodes within a day of simulation time, using only a single or a
few simulation hosts. Our method can be used to quickly explore large many-node
design spaces, including node micro-architecture, node count and network
configuration
FaulTM: Fault-tolerance using hardware transactional memory
Fault-tolerance has become an essential concern for processor designers due to increasing soft-error rates. In this study, we are motivated by the fact that Transactional Memory (TM) hardware provides an ideal base upon which to build a fault-tolerant system. We show how it is possible to provide low-cost faulttolerance for serial programs by using a minimallymodified Hardware Transactional Memory (HTM) that features lazy conflict detection, lazy data versioning. This scheme, called FaulTM, employs a hybrid hardware-software fault-tolerance technique. On the software side, FaulTM programming model is able to provide the flexibility for programmers to decide between performance and reliability. Our experimental results indicate that FaulTM produces relatively less performance overhead by reducing the number of comparisons and by leveraging already proposed TM hardware. We also conduct experiments which indicate that the baseline FaulTM design has a good error coverage. To the best of our knowledge, this is the first architectural fault-tolerance proposal using Hardware Transactional Memory.Peer ReviewedPostprint (published version
Automatic SMT threading for OpenMP applications on the Intel Xeon Phi co-processor
Simultaneous multithreading is a technique that can improve performance when running parallel applications on the Intel Xeon Phi co-processor. Selecting the most efficient thread count is however non-trivial, as the potential increase in efficiency has to be balanced against other, potentially negative factors such as inter-thread competition for cache capacity and increased synchronization overheads. In this paper, we extend CRUST (ClusteR-aware Under-subscribed Scheduling of Threads), a technique for finding the optimum thread count of OpenMP applications running on clustered cache architectures, to take the behavior of simultaneous multithreading on the Xeon Phi into account. CRUST can automatically find the optimum thread count at sub-application granularity by exploiting application phase behavior at OpenMP parallel section boundaries, and uses hardware performance counter information to gain insight into the application's behavior. We implement a CRUST prototype inside the Intel OpenMP runtime library and show its efficiency running on real Xeon Phi hardware
Circuit design of a dual-versioning L1 data cache for optimistic concurrency
This paper proposes a novel L1 data cache design with dual-versioning SRAM cells (dvSRAM) for chip multi-processors (CMP) that implement optimistic concurrency proposals. In this new cache architecture, each dvSRAM cell has two cells, a main cell and a secondary cell, which keep two versions of the same data. These values can be accessed, modified, moved back and forth between the main and secondary cells within the access time of the cache. We design and simulate a 32-KB dual-versioning L1 data cache with 45nm CMOS technology at 2GHz processor frequency and 1V supply voltage, which we describe in detail. We also introduce three well-known use cases that make use of optimistic concurrency execution and that can benefit from our proposed design. Moreover, we evaluate one of the use cases to show the impact of the dual-versioning cell in both performance and energy consumption. Our experiments show that large speedups can be achieved with acceptable overall energy dissipation.Postprint (published version
From plasma to beefarm: Design experience of an FPGA-based multicore prototype
In this paper, we take a MIPS-based open-source uniprocessor soft core, Plasma, and extend it to obtain the Beefarm infrastructure for FPGA-based multiprocessor emulation, a popular research topic of the last few years both in the FPGA and the computer architecture communities. We discuss various design tradeoffs and we demonstrate superior scalability through experimental results compared to traditional software instruction set simulators. Based on our experience of designing and building a complete FPGA-based multiprocessor emulation system that supports run-time and compiler infrastructure and on the actual executions of our experiments running Software Transactional Memory (STM) benchmarks, we comment on the pros, cons and future trends of using hardware-based emulation for research.Peer ReviewedPostprint (author's final draft
Hardware transactional memory with software-defined conflicts
In this paper we propose conflict-defined blocks, a programming language construct that allows programmers to change the concept of conflict from one transaction to another, or even throughout the course of the same transaction. Defining conflicts in software makes possible the removal of dependencies which, though not necessary for the correct execution of the transactions, arise as a result of the coarse synchronization style encouraged by TM. Programmers take advantage of their knowledge about the problem and specify through confict-defined blocks what types of dependencies are superfluous in a certain part of the transaction, in order to extract more performance out of coarse-grained transactions without having to write minimally synchronized code. Our experiments with several transactional benchmarks reveal that using software-defined conflicts, the programmer achieves significant reductions in the number of aborted transactions and improve scalability.Peer ReviewedPostprint (author's final draft
Robust orienting to protofacial stimuli in autism
Newborn infants exhibit a remarkable tendency to orient to faces. This behavior is thought to be mediated by a subcortical mechanism tuned to the protoface stimulus: a face-like configuration comprising three dark areas on a lighter background. When this unique stimulus translates across their visual field, neurotypical infants will change their gaze or head direction to track the protoface [1–3] . Orienting to this low spatial frequency pattern is thought to encourage infants to attend to faces, despite their poor visual acuity [2,3] . By biasing the input into the newborn’s visual system, this primitive instinct may serve to ‘canalize’ the development of more sophisticated face representation. Leading accounts attribute deficits of face perception associated with Autism Spectrum Disorders (ASD) [4] to abnormalities within this orienting mechanism. If infants who are later diagnosed with ASD exhibit reduced protoface orienting, this may compromise the emergence of perceptual expertise for faces [5] . Here we report a novel effect that confirms that the protoface stimulus captures adults’ attention via an involuntary, exogenous process (Experiment 1). Contrary to leading developmental accounts of face perception deficits in ASD, we go on to show that this orienting response is intact in autistic individuals (Experiment 2)
Neutral Higgs bosons in the MNMSSM with explicit CP violation
Within the framework of the minimal non-minimal supersymmetric standard model
(MNMSSM) with tadpole terms, CP violation effects in the Higgs sector are
investigated at the one-loop level, where the radiative corrections from the
loops of the quark and squarks of the third generation are taken into account.
Assuming that the squark masses are not degenerate, the radiative corrections
due to the stop and sbottom quarks give rise to CP phases, which trigger the CP
violation explicitly in the Higgs sector of the MNMSSM. The masses, the
branching ratios for dominant decay channels, and the total decay widths of the
five neutral Higgs bosons in the MNMSSM are calculated in the presence of the
explicit CP violation. The dependence of these quantities on the CP phases is
quite recognizable, for given parameter values.Comment: 25 pages, 8 figure
- …