Search CORE

11 research outputs found

Effect inference for deterministic parallelism

Author: Faxén Karl-Filip
Publication venue: Swedish Institute of Computer Science
Publication date: 01/01/2008
Field of study

In this report we sketch a polymorphic type and effect inference system for ensuring deterministic execution of parallel programs containing shared mutable state. It differs from that of Gifford and Lucassen in being based on Hindley Milner polymorphism and in formalizing the operational semantics of parallel and sequential computation

RISE – Research Institutes of Sweden

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Swedish Institute of Computer Science Publications Database

A Hybrid Approach to Proving Memory Reference Monotonicity

Author: C.E. Oancea
H. Yu
J. Hoeflinger
L. Rauchwerger
L. Rauchwerger
M. Berry
M.W. Hall
P. Feautrier
P. Feautrier
S. Rus
T. Fahringer
W. Blume
W. Blume
W. Pugh
Y. Lin
Y. Lin
Y. Paek
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Crossref

Logical Inference Techniques for Loop Parallelization

Author: Cosmin E. Oancea
Publication venue
Publication date
Field of study

This paper presents a fully automatic approach to loop parallelization that integrates the use of static and run-time analysis and thus overcomes many known difficulties such as nonlinear and indirect array indexing and complex control flow. Our hybrid analysis framework validates the parallelization transformation by verifying the independence of the loop’s memory references. To this end it represents array references using the USR (uniform set representation) language and expresses the independence condition as an equation, S = ∅, where S is a set expression representing array indexes. Using a language instead of an array-abstraction representation for S results in a smaller number of conservative approximations but exhibits a potentially-high runtime cost. To alleviate this cost we introduce a language translation F from the USR set-expression language to an equally rich language of predicates (F(S) ⇒ S = ∅). Loop parallelization is then validated using a novel logic inference algorithm that factorizes the obtained complex predicates (F(S)) into a sequence of sufficient-independence conditions that are evaluated first statically and, when needed, dynamically, in increasing order of their estimated complexities. We evaluate our automated solution on 26 benchmarks from PERFECT-CLUB and SPEC suites and show that our approach is effective in parallelizing large, complex loops and obtains much better full program speedups than the Intel and IBM Fortran compilers

CiteSeerX

Doctor of Philosophy

Author: Venkat Anand
Publication venue: University of Utah
Publication date: 01/01/2016
Field of study

dissertationSparse matrix codes are found in numerous applications ranging from iterative numerical solvers to graph analytics. Achieving high performance on these codes has however been a significant challenge, mainly due to array access indirection, for example, of the form A[B[i]]. Indirect accesses make precise dependence analysis impossible at compile-time, and hence prevent many parallelizing and locality optimizing transformations from being applied. The expert user relies on manually written libraries to tailor the sparse code and data representations best suited to the target architecture from a general sparse matrix representation. However libraries have limited composability, address very specific optimization strategies, and have to be rewritten as new architectures emerge. In this dissertation, we explore the use of the inspector/executor methodology to accomplish the code and data transformations to tailor high performance sparse matrix representations. We devise and embed abstractions for such inspector/executor transformations within a compiler framework so that they can be composed with a rich set of existing polyhedral compiler transformations to derive complex transformation sequences for high performance. We demonstrate the automatic generation of inspector/executor code, which orchestrates code and data transformations to derive high performance representations for the Sparse Matrix Vector Multiply kernel in particular. We also show how the same transformations may be integrated into sparse matrix and graph applications such as Sparse Matrix Matrix Multiply and Stochastic Gradient Descent, respectively. The specific constraints of these applications, such as problem size and dependence structure, necessitate unique sparse matrix representations that can be realized using our transformations. Computations such as Gauss Seidel, with loop carried dependences at the outer most loop necessitate different strategies for high performance. Specifically, we organize the computation into level sets or wavefronts of irregular size, such that iterations of a wavefront may be scheduled in parallel but different wavefronts have to be synchronized. We demonstrate automatic code generation of high performance inspectors that do explicit dependence testing and level set construction at runtime, as well as high performance executors, which are the actual parallelized computations. For the above sparse matrix applications, we automatically generate inspector/executor code comparable in performance to manually tuned libraries

The University of Utah: J. Willard Marriott Digital Library

Compiler and runtime techniques for bulk-synchronous programming models on CPU architectures

Author: Kim Hee-Seok
Publication venue
Publication date: 01/12/2015
Field of study

The rising pressure to simultaneously improve performance and reduce power consumption is driving more heterogeneity into all aspects of computing devices. However, wide adoption of specialized computing devices such as GPUs and Xeon Phis comes with a programming challenge. A carefully optimized program that is well matched to the target hardware can run many times faster and more energy efficiently than one that is not. Ideally, programmers should write their code using a single programming model, and the compiler would transform the program to run optimally on the target architecture. In practice, however, programmers have to expend great effort to translate performance enjoyed on one platform to another. As such, single-source code-based portability has gained substantial momentum and OpenCL, a bulk-synchronous programming language, has become a popular choice, among others, to fulfill the need for portability. The assumed computing model of these languages is inevitably loosely coupled with an underlying architecture, obligating a combined compiler and runtime to find an efficient execution mapping from the input program onto the architecture which best exploits the hardware for performance. In this dissertation, I argue and demonstrate that obtaining high performance from executing OpenCL programs on CPU is feasible. In order to achieve the goal, I present compiler and runtime techniques to execute OpenCL programs on CPU architectures. First, I propose a compiler technique in which the execution of fine-grained parallel threads, called work-items, is collectively analyzed to consider the impact of scheduling them with respect to data locality. By analyzing the memory addresses accessed in a kernel, the technique can make better decisions on how to schedule work-items to construct better memory access patterns, thereby improving performance. The approach achieves geomean speedups of 3.32x over AMD's and 1.71x over Intel's state-of-the-art implementations on Parboil and Rodinia benchmarks. Second, I propose a runtime that allows a compiler to deposit differently optimized kernels to mitigate the stress on the compiler in deriving the most optimal code. The runtime systematically deploys candidate kernels on a small portion of the actual data to determine which achieves the best performance for the hardware-data combination. It exploits the fact that OpenCL programs typically come with a large number of independent work-groups, a feature that amortizes the cost of profiling execution of a few work-items, while the overhead is further reduced by retaining the profiling execution result to constitute the final execution output. The proposed runtime performs with an average overhead of 3% compared to an ideal/oracular runtime in execution time

Illinois Digital Environment for Access to Learning and Scholarship Repository

Recommended from our members

Dynamic Trace Analysis with Zero-Suppressed BDDs

Author: Price Graham David
Publication venue: CU Scholar
Publication date: 01/01/2011
Field of study

Instruction level parallelism (ILP) limitations have forced processor manufacturers to develop multi-core platforms with the expectation that programs will be able to exploit thread level parallelism (TLP). Multi-core programming shifts the burden of locating additional performance away from computer hardware to the software developers, who often attempt high-level redesigns focused on exposing thread level parallelism, as well as explore aggressive optimizations for sequential codes. Precise dynamic analysis can provide useful guidance for program optimization efforts, including efforts to find and extract thread level parallelism. Unfortunately, finding regions of code amenable to further optimization efforts requires analyzing traces that can quickly grow in size. Analysis of large dynamic traces (e.g. one billion instructions or more) is often impractical for commodity hardware. An ideal representation for dynamic trace data would provide compression. However, decompressing large software traces, even if decompressed data is never permanently stored, would make many analysis impractical. A better solution would allow analysis of the compressed data, without a costly decompression step. Prior works have developed trace compressors that generate an analyzable representation, but often limit the precision or scope of analyses. Zero-suppressed binary decision diagram (ZDDs) exhibit many of the desired properties of an ideal trace representation. This thesis shows: (1) dynamic trace data may be represented by zero-suppressed binary decision diagrams (ZDDs); (2) ZDDs allow many analyses to scale; (3) encoding traces as ZDDs can be performed in a reasonable amount of time; and, (4) ZDD-based analyses, such as irrelevant instruction detection and potential coarse-grained thread level parallelism extraction, can reveal a number of performanc

CU Scholar Institutional Repository

Verifiable early-reply with C++

Author: Cook Stephen Wendell
Publication venue: Texas A&M University
Publication date: 17/09/2007
Field of study

Concurrent programming can improve performance. However, it comes with two drawbacks. First, concurrent programs can be more difficult to design and reason about than their sequential counterparts. Second, error conditions that do not exist in sequential programs, such as data race conditions and deadlock, can make concurrent programs more unreliable. To make concurrent programming simpler and more reliable, while still providing sufficient performance gains, we present a concurrency framework based on an existing concurrency initiation mechanism called Ã¢ÂÂEarly-ReplyÃ¢ÂÂ. Early-Reply is based on the idea that some functions can produce final return values long before they terminate. Concurrent execution begins when return value of a function is returned to the caller, allowing the rest of the work of the function to be done on an auxiliary thread. The simpler sequential programming model can be used by the caller, because the concurrency is initiated and hidden within the function body. Pike and Sridhar recognized Early-Reply as a way for sequential programs to get the benefits of concurrent execution. They also discussed using object-oriented programming to serialize access to data that needs synchronization. Our work expands on their approach and provides an actual C++ implementation of an Early-Reply based framework. Our framework simplifies concurrent programming for both users and implementers by allowing developers to use sequential reasoning, and by providing a minimal framework interface. Concurrent programming is made more reliable by combining the concurrency synchronization and initiation into one mechanism within the framework, which isolates where race conditions and deadlock can occur. Furthermore, this isolation facilitates the development of a simple set of coding guidelines that can be used by developers (through inspection) or static analysis tools (through verification) to eliminate race conditions and deadlocks. As a motivating example, we parallelize an instructional compiler that processes multiple input source files. For each input file; the parsing and semantic analysis execute on the calling thread, while the code optimization and object code generation execute on an auxiliary thread. Speedups of 1.5 to 1.7 were observed on a dual processor confirming that sufficient performance gains are possible

Texas A&M Repository

Autotuning for Automatic Parallelization on Heterogeneous Systems

Author: Pfaffe Philip
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2020
Field of study

KITopen

Techniques d'exploration architecturale de design à usage spécifique pour l'accélération de boucles

Author: Mbaye Mame Maria
Publication venue
Publication date: 01/08/2010
Field of study

RÉSUMÉ De nos jours, les industriels privilégient les architectures flexibles afin de réduire le temps et les coûts de conception d’un système. Les processeurs à usage spécifique (ASIP) fournissent beaucoup de flexibilité, tout en atteignant des performances élevées. Une tendance qui a de plus en plus de succès dans le processus de conception d’un système sur puce consiste à spécifier le comportement du système en langage évolué tel que le C, SystemC, etc. La spécification est ensuite utilisée durant le partitionement pour déterminer les composantes logicielles et matérielles du système. Avec la maturité des générateurs automatiques de ASIP, les concepteurs peuvent rajouter dans leurs boîtes à outils un nouveau type d’architecture, à savoir les ASIP, en sachant que ces derniers sont conçus à partir d’une spécification décrite en langage évolué. D’un autre côté, dans le monde matériel, et cela depuis très longtemps, les chercheurs ont vu l’avantage de baser le processus de conception sur un langage évolué. Cette recherche a abouti à l’avénement de générateurs automatiques de matériel sur le marché qui sont des outils d’aide à la conception comme CapatultC, Forte’s Cynthetizer, etc. Ainsi, avec tous ces outils basés sur le langage C, les concepteurs ont un choix de types de design élargi mais, d’un autre côté, les options de designs possibles explosent, ce qui peut allonger au lieu de réduire le temps de conception. C’est dans ce cadre que notre thèse doctorale s’inscrit, puisqu’elle présente des méthodologies d’exploration architecturale de design à usage spécifique pour l’accélération de boucles afin de réduire le temps de conception, entre autres. Cette thèse a débuté par l’exploration de designs de ASIP. Les boucles de traitement sont de bonnes candidates à l’accélération, si elles comportent de bonnes possibilités de parallélisme et si ces dernières sont bien exploitées. Le matériel est très efficace à profiter des possibilités de parallélisme au niveau instruction, donc, une méthode de conception a été proposée. Cette dernière extrait le parallélisme d’une boucle afin d’exécuter plus d’opérations concurrentes dans des instructions spécialisées. Notre méthode se base aussi sur l’optimisation des données dans l’architecture du processeur.---------- ABSTRACT Time to market is a very important concern in industry. That is why the industry always looks for new CAD tools that contribute to reducing design time. Application-specific instruction-set processors (ASIPs) provide flexibility and they allow reaching good performance if they are well designed. One trend that gains more and more success is C-based design that uses a high level language such as C, SystemC, etc. The C-based specification is used during the partitionning phase to determine the software and hardware components of the system. Since automatic processor generators are mature now, designers have a new type of tool they can rely on during architecture design. In the hardware world, high level synthesis was and is still a hot research topic. The advances in ESL lead to commercial high-level synthesis tools such as CapatultC, Forte’s Cynthetizer, etc. The designers have more tools in their box but they have more solutions to explore, thus their use can have a reverse effect since the design time can increase instead of being reduced. Our doctoral research tackles this issue by proposing new methodologies for design space exploration of application specific architecture for loop acceleration in order to reduce the design time while reaching some targeted performances. Our thesis starts with the exploration of ASIP design. We propose a method that targets loop acceleration with highly coupled specialized-instructions executing loop operations. Loops are good candidates for acceleration when the parallelism they offer is well exploited (if they have any parallelization opportunities). Hardware components such as specialized-instructions can leverage parallelization opportunities at low level. Thus, we propose to extract loop parallelization opportunities and to execute more concurrent operations in specialized-instructions. The main contribution of this method is a new approach to specialized-instruction (SI) design based on loop acceleration where loop optimization and transformation are done in SIs directly, instead of optimizing the software code. Another contribution is the design of tightly-coupled specialized-instructions associated with loops based on a 5-pattern representation

PolyPublie

Hybrid analysis of memory references and its application to automatic parallelization

Author: Rus Silvius Vasile
Publication venue
Publication date: 15/05/2009
Field of study

Executing sequential code in parallel on a multithreaded machine has been an elusive goal of the academic and industrial research communities for many years. It has recently become more important due to the widespread introduction of multicores in PCs. Automatic multithreading has not been achieved because classic, static compiler analysis was not powerful enough and program behavior was found to be, in many cases, input dependent. Speculative thread level parallelization was a welcome avenue for advancing parallelization coverage but its performance was not always optimal due to the sometimes unnecessary overhead of checking every dynamic memory reference. In this dissertation we introduce a novel analysis technique, Hybrid Analysis, which unifies static and dynamic memory reference techniques into a seamless compiler framework which extracts almost maximum available parallelism from scientific codes and incurs close to the minimum necessary run time overhead. We present how to extract maximum information from the quantities that could not be sufficiently analyzed through static compiler methods, and how to generate sufficient conditions which, when evaluated dynamically, can validate optimizations. Our techniques have been fully implemented in the Polaris compiler and resulted in whole program speedups on a large number of industry standard benchmark applications

Texas A&M Repository