2,786 research outputs found
Deadlock-free fine-grained thread migration
Several recent studies have proposed fine-grained, hardware-level thread migration in multicores as a solution to power, reliability, and memory coherence problems. The need for fast thread migration has been well documented, however, a fast, deadlock-free migration protocol is sorely lacking: existing solutions either deadlock or are too slow and cumbersome to ensure performance with frequent, fine-grained thread migrations.
In this study, we introduce the Exclusive Native Context (ENC) protocol, a general, provably deadlock-free migration protocol for instruction-level thread migration architectures. Simple to implement, ENC does not require additional hardware beyond common migration-based architectures. Our evaluation using synthetic migrations and the SPLASH-2 application suite shows that ENC offers performance within 11.7% of an idealized deadlock-free migration protocol with infinite resources
Coarse-grained reconfigurable array architectures
Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code
Concrete resource analysis of the quantum linear system algorithm used to compute the electromagnetic scattering cross section of a 2D target
We provide a detailed estimate for the logical resource requirements of the
quantum linear system algorithm (QLSA) [Phys. Rev. Lett. 103, 150502 (2009)]
including the recently described elaborations [Phys. Rev. Lett. 110, 250504
(2013)]. Our resource estimates are based on the standard quantum-circuit model
of quantum computation; they comprise circuit width, circuit depth, the number
of qubits and ancilla qubits employed, and the overall number of elementary
quantum gate operations as well as more specific gate counts for each
elementary fault-tolerant gate from the standard set {X, Y, Z, H, S, T, CNOT}.
To perform these estimates, we used an approach that combines manual analysis
with automated estimates generated via the Quipper quantum programming language
and compiler. Our estimates pertain to the example problem size N=332,020,680
beyond which, according to a crude big-O complexity comparison, QLSA is
expected to run faster than the best known classical linear-system solving
algorithm. For this problem size, a desired calculation accuracy 0.01 requires
an approximate circuit width 340 and circuit depth of order if oracle
costs are excluded, and a circuit width and depth of order and
, respectively, if oracle costs are included, indicating that the
commonly ignored oracle resources are considerable. In addition to providing
detailed logical resource estimates, it is also the purpose of this paper to
demonstrate explicitly how these impressively large numbers arise with an
actual circuit implementation of a quantum algorithm. While our estimates may
prove to be conservative as more efficient advanced quantum-computation
techniques are developed, they nevertheless provide a valid baseline for
research targeting a reduction of the resource requirements, implying that a
reduction by many orders of magnitude is necessary for the algorithm to become
practical.Comment: 37 pages, 40 figure
An input centric paradigm for program dynamic optimizations and lifetime evolvement
Accurately predicting program behaviors (e.g., memory locality, method calling frequency) is fundamental for program optimizations and runtime adaptations. Despite decades of remarkable progress, prior studies have not systematically exploited the use of program inputs, a deciding factor of program behaviors, to help in program dynamic optimizations. Triggered by the strong and predictive correlations between program inputs and program behaviors that recent studies have uncovered, the dissertation work aims to bring program inputs into the focus of program behavior analysis and program dynamic optimization, cultivating a new paradigm named input-centric program behavior analysis and dynamic optimization.;The new optimization paradigm consists of three components, forming a three-layer pyramid. at the base is program input characterization, a component for resolving the complexity in program raw inputs and extracting important features. In the middle is input-behavior modeling, a component for recognizing and modeling the correlations between characterized input features and program behaviors. These two components constitute input-centric program behavior analysis, which (ideally) is able to predict the large-scope behaviors of a program\u27s execution as soon as the execution starts. The top layer is input-centric adaptation, which capitalizes on the novel opportunities created by the first two components to facilitate proactive adaptation for program optimizations.;This dissertation aims to develop this paradigm in two stages. In the first stage, we concentrate on exploring the implications of program inputs for program behaviors and dynamic optimization. We construct the basic input-centric optimization framework based on of line training to realize the basic functionalities of the three major components of the paradigm. For the second stage, we focus on making the paradigm practical by addressing multi-facet issues in handling input complexities, transparent training data collection, predictive model evolvement across production runs. The techniques proposed in this stage together cultivate a lifelong continuous optimization scheme with cross-input adaptivity.;Fundamentally the new optimization paradigm provides a brand new solution for program dynamic optimization. The techniques proposed in the dissertation together resolve the adaptivity-proactivity dilemma that has been limiting the effectiveness of existing optimization techniques. its benefits are demonstrated through proactive dynamic optimizations in Jikes RVM and version selection using IBM XL C Compiler, yielding significant performance improvement on a set of Java and C/C++ programs. It may open new opportunities for a broad range of runtime optimizations and adaptations. The evaluation results on both Java and C/C++ applications demonstrate the new paradigm is promising in advancing the current state of program optimizations
Measuring Thread Timing to Assess the Feasibility of Early-bird Message Delivery
Early-bird communication is a communication/computation overlap technique
that combines fine-grained communication with partitioned communication to
improve application run-time. Communication is divided among the compute
threads such that each individual thread can initiate transmission of its
portion of the data as soon as it is complete rather than waiting for all of
the threads. However, the benefit of early-bird communication depends on the
completion timing of the individual threads. In this paper, we measure and
evaluate the potential overlap, the idle time each thread experiences between
finishing their computation and the final thread finishing. These measurements
help us understand whether a given application could benefit from early-bird
communication. We present our technique for gathering this data and evaluate
data collected from three proxy applications: MiniFE, MiniMD, and MiniQMC. To
characterize the behavior of these workloads, we study the thread timings at
both a macro level, i.e., across all threads across all runs of an application,
and a micro level, i.e., within a single process of a single run. We observe
that these applications exhibit significantly different behavior. While MiniFE
and MiniQMC appear to be well-suited for early-bird communication because of
their wider thread distribution and more frequent laggard threads, the behavior
of MiniMD may limit its ability to leverage early-bird communication
- …