2,786 research outputs found

    Deadlock-free fine-grained thread migration

    Get PDF
    Several recent studies have proposed fine-grained, hardware-level thread migration in multicores as a solution to power, reliability, and memory coherence problems. The need for fast thread migration has been well documented, however, a fast, deadlock-free migration protocol is sorely lacking: existing solutions either deadlock or are too slow and cumbersome to ensure performance with frequent, fine-grained thread migrations. In this study, we introduce the Exclusive Native Context (ENC) protocol, a general, provably deadlock-free migration protocol for instruction-level thread migration architectures. Simple to implement, ENC does not require additional hardware beyond common migration-based architectures. Our evaluation using synthetic migrations and the SPLASH-2 application suite shows that ENC offers performance within 11.7% of an idealized deadlock-free migration protocol with infinite resources

    Coarse-grained reconfigurable array architectures

    Get PDF
    Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code

    Concrete resource analysis of the quantum linear system algorithm used to compute the electromagnetic scattering cross section of a 2D target

    Get PDF
    We provide a detailed estimate for the logical resource requirements of the quantum linear system algorithm (QLSA) [Phys. Rev. Lett. 103, 150502 (2009)] including the recently described elaborations [Phys. Rev. Lett. 110, 250504 (2013)]. Our resource estimates are based on the standard quantum-circuit model of quantum computation; they comprise circuit width, circuit depth, the number of qubits and ancilla qubits employed, and the overall number of elementary quantum gate operations as well as more specific gate counts for each elementary fault-tolerant gate from the standard set {X, Y, Z, H, S, T, CNOT}. To perform these estimates, we used an approach that combines manual analysis with automated estimates generated via the Quipper quantum programming language and compiler. Our estimates pertain to the example problem size N=332,020,680 beyond which, according to a crude big-O complexity comparison, QLSA is expected to run faster than the best known classical linear-system solving algorithm. For this problem size, a desired calculation accuracy 0.01 requires an approximate circuit width 340 and circuit depth of order 102510^{25} if oracle costs are excluded, and a circuit width and depth of order 10810^8 and 102910^{29}, respectively, if oracle costs are included, indicating that the commonly ignored oracle resources are considerable. In addition to providing detailed logical resource estimates, it is also the purpose of this paper to demonstrate explicitly how these impressively large numbers arise with an actual circuit implementation of a quantum algorithm. While our estimates may prove to be conservative as more efficient advanced quantum-computation techniques are developed, they nevertheless provide a valid baseline for research targeting a reduction of the resource requirements, implying that a reduction by many orders of magnitude is necessary for the algorithm to become practical.Comment: 37 pages, 40 figure

    An input centric paradigm for program dynamic optimizations and lifetime evolvement

    Get PDF
    Accurately predicting program behaviors (e.g., memory locality, method calling frequency) is fundamental for program optimizations and runtime adaptations. Despite decades of remarkable progress, prior studies have not systematically exploited the use of program inputs, a deciding factor of program behaviors, to help in program dynamic optimizations. Triggered by the strong and predictive correlations between program inputs and program behaviors that recent studies have uncovered, the dissertation work aims to bring program inputs into the focus of program behavior analysis and program dynamic optimization, cultivating a new paradigm named input-centric program behavior analysis and dynamic optimization.;The new optimization paradigm consists of three components, forming a three-layer pyramid. at the base is program input characterization, a component for resolving the complexity in program raw inputs and extracting important features. In the middle is input-behavior modeling, a component for recognizing and modeling the correlations between characterized input features and program behaviors. These two components constitute input-centric program behavior analysis, which (ideally) is able to predict the large-scope behaviors of a program\u27s execution as soon as the execution starts. The top layer is input-centric adaptation, which capitalizes on the novel opportunities created by the first two components to facilitate proactive adaptation for program optimizations.;This dissertation aims to develop this paradigm in two stages. In the first stage, we concentrate on exploring the implications of program inputs for program behaviors and dynamic optimization. We construct the basic input-centric optimization framework based on of line training to realize the basic functionalities of the three major components of the paradigm. For the second stage, we focus on making the paradigm practical by addressing multi-facet issues in handling input complexities, transparent training data collection, predictive model evolvement across production runs. The techniques proposed in this stage together cultivate a lifelong continuous optimization scheme with cross-input adaptivity.;Fundamentally the new optimization paradigm provides a brand new solution for program dynamic optimization. The techniques proposed in the dissertation together resolve the adaptivity-proactivity dilemma that has been limiting the effectiveness of existing optimization techniques. its benefits are demonstrated through proactive dynamic optimizations in Jikes RVM and version selection using IBM XL C Compiler, yielding significant performance improvement on a set of Java and C/C++ programs. It may open new opportunities for a broad range of runtime optimizations and adaptations. The evaluation results on both Java and C/C++ applications demonstrate the new paradigm is promising in advancing the current state of program optimizations

    Measuring Thread Timing to Assess the Feasibility of Early-bird Message Delivery

    Full text link
    Early-bird communication is a communication/computation overlap technique that combines fine-grained communication with partitioned communication to improve application run-time. Communication is divided among the compute threads such that each individual thread can initiate transmission of its portion of the data as soon as it is complete rather than waiting for all of the threads. However, the benefit of early-bird communication depends on the completion timing of the individual threads. In this paper, we measure and evaluate the potential overlap, the idle time each thread experiences between finishing their computation and the final thread finishing. These measurements help us understand whether a given application could benefit from early-bird communication. We present our technique for gathering this data and evaluate data collected from three proxy applications: MiniFE, MiniMD, and MiniQMC. To characterize the behavior of these workloads, we study the thread timings at both a macro level, i.e., across all threads across all runs of an application, and a micro level, i.e., within a single process of a single run. We observe that these applications exhibit significantly different behavior. While MiniFE and MiniQMC appear to be well-suited for early-bird communication because of their wider thread distribution and more frequent laggard threads, the behavior of MiniMD may limit its ability to leverage early-bird communication
    corecore