133 research outputs found

    A Quantitative Framework for Automated Pre-Execution Thread Selection

    Get PDF
    Pre-execution attacks cache misses for which conventional address-prediction driven prefetching is ineffective. In pre-execution, copies of cache miss computations are isolated from the main program and launched as separate threads called p-threads whenever the processor anticipates an upcoming miss. P-thread selection is the task of deciding what computations should execute on p-threads and when they should be launched such that total execution time is minimized. P-thread selection is central to the success of pre-execution. We introduce a framework for automated static p-thread selection, a static p-thread being one whose dynamic instances are repeatedly launched during the course of program execution. Our approach is to formalize the problem quantitatively and then apply standard techniques to solve it analytically. The framework has two novel components. The slice tree is a new data structure that compactly represents the space of all possible static p-threads. Aggregate advantage is a formula that uses raw program statistics and computation structure to assign each candidate static p-thread a numeric score based on estimated latency tolerance and overhead aggregated over its expected dynamic executions. Our framework finds the set of p-threads whose aggregate advantages sum to a maximum. The framework is simple and intuitively parameterized to model the salient microarchitecture features. We apply our framework to the task of choosing p-threads that cover L2 cache misses. Using detailed simulation, we study the effectiveness of our framework, and pre-execution in general, under difference conditions. We measure the effect of constraining p-thread length, of adding localized optimization to p-threads, and of using various program samples as a statistical basis for the p-thread selection, and show that our framework responds to these changes in an intuitive way. In the microarchitecture dimension, we measure the effect of varying memory latency and processor width and observe that our framework adapts well to these changes. Each experiment includes a validation component which checks that the formal model presented to our framework correctly represents actual execution

    Mixed-mode multicore reliability

    Get PDF
    Future processors are expected to observe increasing rates of hardware faults. Using Dual-Modular Redundancy (DMR), two cores of a multicore can be loosely coupled to redundantly execute a single software thread, providing very high coverage from many difference sources of faults. This reliability, however, comes at a high price in terms of per-thread IPC and overall system throughput. We make the observation that a user may want to run both applications requiring high reliability, such as financial software, and more fault tolerant applications requiring high performance, such as media or web software, on the same machine at the same time. Yet a traditional DMR system must fully operate in redundant mode whenever any application requires high reliability. This paper proposes a Mixed-Mode Multicore (MMM), which enables most applications, including the system software, to run with high reliability in DMR mode, while applications that need high performance can avoid the penalty of DMR. Though conceptually simple, two key challenges arise: 1) care must be taken to protect reliable applications from any faults occurring to applications running in high performance mode, and 2) the desire to execute additional independent software threads for a performance application complicates the scheduling of computation to cores. After solving these issues, an MMM is shown to improve overall system performance, compared to a traditional DMR system, by approximately 2X when one reliable and one performance application are concurrently executing

    Adaptive, efficient, parallel execution of parallel programs

    Get PDF
    Abstract Future multicore processors will be heterogeneous, be increasingly less reliable, and operate in dynamically changing operating conditions. Such environments will result in a constantly varying pool of hardware resources which can greatly complicate the task of efficiently exposing a program's parallelism onto these resources. Coupled with this uncertainty is the diverse set of efficiency metrics that users may desire. This paper proposes Varuna, a system that dynamically, continuously, rapidly and transparently adapts a program's parallelism to best match the instantaneous capabilities of the hardware resources while satisfying different efficiency metrics. Varuna is applicable to both multithreaded and task-based programs and can be seamlessly inserted between the program and the operating system without needing to change the source code of either. We demonstrate Varuna's effectiveness in diverse execution environments using unaltered C/C++ parallel programs from various benchmark suites. Regardless of the execution environment, Varuna always outperformed the state-of-the-art approaches for the efficiency metrics considered

    ATM: approximate task memoization in the runtime system

    Get PDF
    Redundant computations appear during the execution of real programs. Multiple factors contribute to these unnecessary computations, such as repetitive inputs and patterns, calling functions with the same parameters or bad programming habits. Compilers minimize non useful code with static analysis. However, redundant execution might be dynamic and there are no current approaches to reduce these inefficiencies. Additionally, many algorithms can be computed with different levels of accuracy. Approximate computing exploits this fact to reduce execution time at the cost of slightly less accurate results. In this case, expert developers determine the desired tradeoff between performance and accuracy for each application. In this paper, we present Approximate Task Memoization (ATM), a novel approach in the runtime system that transparently exploits both dynamic redundancy and approximation at the task granularity of a parallel application. Memoization of previous task executions allows predicting the results of future tasks without having to execute them and without losing accuracy. To further increase performance improvements, the runtime system can memoize similar tasks, which leads to task approximate computing. By defining how to measure task similarity and correctness, we present an adaptive algorithm in the runtime system that automatically decides if task approximation is beneficial or not. When evaluated on a real 8-core processor with applications from different domains (financial analysis, stencil-computation, machine-learning and linear-algebra), ATM achieves a 1.4x average speedup when only applying memoization techniques. When adding task approximation, ATM achieves a 2.5x average speedup with an average 0.7% accuracy loss (maximum of 3.2%).This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the Spanish Government (grant SEV2015-0493 of the Severo Ochoa Program), by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316), by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and the European HiPEAC Network of Excellence. M. Moretó has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI-2012-15047. M. Casas is supported by the Secretary for Universities and Research of the Ministry of Economy and Knowledge of the Government of Catalonia and the Cofund programme of the Marie Curie Actions of the 7th R&D Framework Programme of the European Union (Contract 2013 BP B 00243). I. Brumar has been partially supported by the Spanish Ministry of Education, Culture and Sports under grant FPU2015/12849.Peer ReviewedPostprint (author's final draft

    Use-Based Register Caching with Decoupled Indexing

    Full text link

    Multiple metal contamination from house paints: consequences of power sanding and paint scraping in New Orleans.

    Get PDF
    Power sanding exterior paint is a common practice during repainting of old houses in New Orleans, Louisiana, that triggers lead poisoning and releases more than Pb. In this study we quantified the Pb, zinc, cadmium, manganese, nickel, copper, cobalt, chromium, and vanadium in exterior paint samples collected from New Orleans homes (n = 31). We used interior dust wipes to compare two exterior house-painting projects. House 1 was measured in response to the plight of a family after a paint contractor power sanded all exterior paint from the weatherboards. The Pb content (approximately 130,000 microg Pb/g) was first realized when the family pet died; the children were hospitalized, the family was displaced, and cleanup costs were high. To determine the quantity of dust generated by power sanding and the benefits of reducing Pb-contaminated dust, we tested a case study house (house 2) for Pb (approximately 90,000 microg/g) before the project was started; the house was then dry scraped and the paint chips were collected. Although the hazards of Pb-based paints are well known, there are other problems as well, because other toxic metals exist in old paints. If house 2 had been power sanded to bare wood like house 1, the repainting project would have released as dust about 7.4 kg Pb, 3.5 kg Zn, 9.7 g Cd, 14.8 g Cu, 8.8 g Mn, 1.5 g Ni, 5.4 g Co, 2.4 g Cr, and 0.3 g V. The total tolerable daily intake (TTDI) for a child under 6 years of age is 6 microg Pb from all sources. Converting 7.4 kg Pb to this scale is vexing--more than 1 billion (10(9)) times the TTDI. Also for perspective, the one-time release of 7.4 x 10(9) microg of Pb dust from sanding compares to 50 x 10(9) microg of Pb dust emitted annually per 0.1 mile (0.16 km) from street traffic during the peak use of leaded gasoline. In this paper, we broaden the discussion to include an array of metals in paint and underscore the need and possibilities for curtailing the release of metal dust

    The Effects of Faults in Cache Memories

    No full text
    corecore