2,287 research outputs found

    Partitioning loops with variable dependence distances

    Get PDF
    A new technique to parallelize loops,vith variable distance vectors is presented The method extends previous methods in two ways. First, the present method makes it possible for array subscripts to be any linear combination of all loop indices. The solutions to the linear dependence equations established from such army subscripts are characterized by a pseudo distance matrix(PDM). Second, it allows us to exploit loop parallelism from the PDM by applying unimodular and partitioning transformations that preserve the lexicographical order of the dependent iterations. The algorithms to derive the PDM, to find a suitable loop transformation and to generate parallel code are described showing that it is possible to parallelize a wider range of loops automatically

    Non-uniform dependences partitioned by recurrence chains

    Get PDF
    Non-uniform distance loop dependences are a known obstacle to find parallel iterations. To find the outermost loop parallelism in these �irregular� loops, a novel method is presented based on recurrence chains. The scheme organizes non-uniformly dependent iterations into lexicographically ordered monotonic chains. While the initial and final iteration of monotonic chains form two parallel sets, the remaining iterations form an intermediate set that can be partitioned further. When there is only one pair of coupled array references, the non-uniform dependences are represented by a single recurrence equation. In that case, the chains in the intermediate set do not bifurcate and each can be executed as a WHILE loop. The independent iterations and the initial iterations of monotonic dependence chains constitute the outermost parallelism. The proposed approach compares favorably with other treatments of nonuniform dependences in the literature. When there are multiple recurrence equations, a dataflow parallel execution can be scheduled using the technique extensively to find maximum loop parallelism

    Compiler Optimization Techniques for Scheduling and Reducing Overhead

    Get PDF
    Exploiting parallelism in loops in programs is an important factor in realizing the potential performance of processors today. This dissertation develops and evaluates several compiler optimizations aimed at improving the performance of loops on processors. An important feature of a class of scientific computing problems is the regularity exhibited by their access patterns. Chapter 2 presents an approach of optimizing the address generation of these problems that results in the following: (i) elimination of redundant arithmetic computation by recognizing and exploiting the presence of common sub-expressions across different iterations in stencil codes; and (ii) conversion of as many array references to scalar accesses as possible, which leads to reduced execution time, decrease in address arithmetic overhead, access to data in registers as opposed to caches, etc. With the advent of VLIW processors, the exploitation of fine-grain instruction-level parallelism has become a major challenge to optimizing compilers. Fine-grain scheduling of inner loops has received a lot of attention, little work has been done in the area of applying it to nested loops. Chapter 3 presents an approach to fine-grain scheduling of nested loops by formulating the problem of finding theminimum iteration initiation interval as one of finding a rational affine schedule for each statement in the body of a perfectly nested loop which is then solved using linear programming. Frequent synchronization on multiprocessors is expensive due to its high cost. Chapter 4 presents a method for eliminating redundant synchronization for nested loops. In nested loops, a dependence may be redundant in only a portion of the iteration space. A characterization of the non-uniformity of the redundancy of a dependence is developed in terms of the relation between the dependences and the shape and size of the iteration space. Exploiting locality is critical for achieving high level of performance on a parallel machine. Chapter 5 presents an approach using the concept of affinity regions to find transformations such that a suitable iteration-to-processor mapping can be found for a sequence of loop nests accessing shared arrays. This not only improves the data locality but significantly reduces communication overhead

    A theoretical foundation for program transformations to reduce cache thrashing due to true data sharing

    Get PDF
    AbstractCache thrashing due to true data sharing can degrade the performance of parallel programs significantly. Our previous work showed that parallel task alignment via program transformations can be quite effective for the reduction of such cache thrashing. In this paper, we present a theoretical foundation for such program transformations. Based on linear algebra and the theory of numbers, our work analyzes the data dependences among the tasks created by a fork-join parallel program and determines at compile time how these tasks should be assigned to processors in order to reduce cache thrashing due to true data sharing. Our analysis and program transformations can be easily performed by compilers for parallel computers

    Self correction requires Energy Barrier for Abelian quantum doubles

    Get PDF
    We rigorously establish an Arrhenius law for the mixing time of quantum doubles based on any Abelian group Zd\mathbb{Z}_d. We have made the concept of the energy barrier therein mathematically well-defined, it is related to the minimum energy cost the environment has to provide to the system in order to produce a generalized Pauli error, maximized for any generalized Pauli errors, not only logical operators. We evaluate this generalized energy barrier in Abelian quantum double models and find it to be a constant independent of system size. Thus, we rule out the possibility of entropic protection for this broad group of models.Comment: 18 pages, 6 figure

    Iterative Schedule Optimization for Parallelization in the Polyhedron Model

    Get PDF
    In high-performance computing, one primary objective is to exploit the performance that the given target hardware can deliver to the fullest. Compilers that have the ability to automatically optimize programs for a specific target hardware can be highly useful in this context. Iterative (or search-based) compilation requires little or no prior knowledge and can adapt more easily to concrete programs and target hardware than static cost models and heuristics. Thereby, iterative compilation helps in situations in which static heuristics do not reflect the combination of input program and target hardware well. Moreover, iterative compilation may enable the derivation of more accurate cost models and heuristics for optimizing compilers. In this context, the polyhedron model is of help as it provides not only a mathematical representation of programs but, more importantly, a uniform representation of complex sequences of program transformations by schedule functions. The latter facilitates the systematic exploration of the set of legal transformations of a given program. Early approaches to purely iterative schedule optimization in the polyhedron model do not limit their search to schedules that preserve program semantics and, thereby, suffer from the need to explore numbers of illegal schedules. More recent research ensures the legality of program transformations but presumes a sequential rather than a parallel execution of the transformed program. Other approaches do not perform a purely iterative optimization. We propose an approach to iterative schedule optimization for parallelization and tiling in the polyhedron model. Our approach targets loop programs that profit from data locality optimization and coarse-grained loop parallelization. The schedule search space can be explored either randomly or by means of a genetic algorithm. To determine a schedule's profitability, we rely primarily on measuring the transformed code's execution time. While benchmarking is accurate, it increases the time and resource consumption of program optimization tremendously and can even make it impractical. We address this limitation by proposing to learn surrogate models from schedules generated and evaluated in previous runs of the iterative optimization and to replace benchmarking by performance prediction to the extent possible. Our evaluation on the PolyBench 4.1 benchmark set reveals that, in a given setting, iterative schedule optimization yields significantly higher speedups in the execution of the program to be optimized. Surrogate performance models learned from training data that was generated during previous iterative optimizations can reduce the benchmarking effort without strongly impairing the optimization result. A prerequisite for this approach is a sufficient similarity between the training programs and the program to be optimized

    Quasi-particle Statistics and Braiding from Ground State Entanglement

    Get PDF
    Topologically ordered phases are gapped states, defined by the properties of excitations when taken around one another. Here we demonstrate a method to extract the statistics and braiding of excitations, given just the set of ground-state wave functions on a torus. This is achieved by studying the Topological Entanglement Entropy (TEE) on partitioning the torus into two cylinders. In this setting, general considerations dictate that the TEE generally differs from that in trivial partitions and depends on the chosen ground state. Central to our scheme is the identification of ground states with minimum entanglement entropy, which reflect the quasi-particle excitations of the topological phase. The transformation of these states allows for a determination of the modular S and U matrices which encode quasi-particle properties. We demonstrate our method by extracting the modular S matrix of an SU(2) spin symmetric chiral spin liquid phase using a Monte Carlo scheme to calculate TEE, and prove that the quasi-particles obey semionic statistics. This method offers a route to a nearly complete determination of the topological order in certain cases.Comment: revised for clarity; 17 pages, 9 figures, 1 tabl

    LQG for the Bewildered

    Full text link
    We present a pedagogical introduction to the notions underlying the connection formulation of General Relativity - Loop Quantum Gravity (LQG) - with an emphasis on the physical aspects of the framework. We begin by reviewing General Relativity and Quantum Field Theory, to emphasise the similarities between them which establish a foundation upon which to build a theory of quantum gravity. We then explain, in a concise and clear manner, the steps leading from the Einstein-Hilbert action for gravity to the construction of the quantum states of geometry, known as \emph{spin-networks}, which provide the basis for the kinematical Hilbert space of quantum general relativity. Along the way we introduce the various associated concepts of \emph{tetrads}, \emph{spin-connection} and \emph{holonomies} which are a pre-requisite for understanding the LQG formalism. Having provided a minimal introduction to the LQG framework, we discuss its applications to the problems of black hole entropy and of quantum cosmology. A list of the most common criticisms of LQG is presented, which are then tackled one by one in order to convince the reader of the physical viability of the theory. An extensive set of appendices provide accessible introductions to several key notions such as the \emph{Peter-Weyl theorem}, \emph{duality} of differential forms and \emph{Regge calculus}, among others. The presentation is aimed at graduate students and researchers who have some familiarity with the tools of quantum mechanics and field theory and/or General Relativity, but are intimidated by the seeming technical prowess required to browse through the existing LQG literature. Our hope is to make the formalism appear a little less bewildering to the un-initiated and to help lower the barrier for entry into the field.Comment: 87 pages, 15 figures, manuscript submitted for publicatio

    A Comprehensive Methodology for Algorithm Characterization, Regularization and Mapping Into Optimal VLSI Arrays.

    Get PDF
    This dissertation provides a fairly comprehensive treatment of a broad class of algorithms as it pertains to systolic implementation. We describe some formal algorithmic transformations that can be utilized to map regular and some irregular compute-bound algorithms into the best fit time-optimal systolic architectures. The resulted architectures can be one-dimensional, two-dimensional, three-dimensional or nonplanar. The methodology detailed in the dissertation employs, like other methods, the concept of dependence vector to order, in space and time, the index points representing the algorithm. However, by differentiating between two types of dependence vectors, the ordering procedure is allowed to be flexible and time optimal. Furthermore, unlike other methodologies, the approach reported here does not put constraints on the topology or dimensionality of the target architecture. The ordered index points are represented by nodes in a diagram called Systolic Precedence Diagram (SPD). The SPD is a form of precedence graph that takes into account the systolic operation requirements of strictly local communications and regular data flow. Therefore, any algorithm with variable dependence vectors has to be transformed into a regular indexed set of computations with local dependencies. This can be done by replacing variable dependence vectors with sets of fixed dependence vectors. The SPD is transformed into an acyclic, labeled, directed graph called the Systolic Directed Graph (SDG). The SDG models the data flow as well as the timing for the execution of the given algorithm on a time-optimal array. The target architectures are obtained by projecting the SDG along defined directions. If more than one valid projection direction exists, different designs are obtained. The resulting architectures are then evaluated to determine if an improvement in the performance can be achieved by increasing PE fan-out. If so, the methodology provides the corresponding systolic implementation. By employing a new graph transformation, the SDG is manipulated so that it can be mapped into fixed-size and fixed-depth multi-linear arrays. The latter is a new concept of systolic arrays that is adaptable to changes in the state of technology. It promises a bonded clock skew, higher throughput and better performance than the linear implementation
    corecore