722 research outputs found

    Algorithm Libraries for Multi-Core Processors

    Get PDF
    By providing parallelized versions of established algorithm libraries, we ease the exploitation of the multiple cores on modern processors for the programmer. The Multi-Core STL provides basic algorithms for internal memory, while the parallelized STXXL enables multi-core acceleration for algorithms on large data sets stored on disk. Some parallelized geometric algorithms are introduced into CGAL. Further, we design and implement sorting algorithms for huge data in distributed external memory

    On Extracting Course-Grained Function Parallelism from C Programs

    Get PDF
    To efficiently utilize the emerging heterogeneous multi-core architecture, it is essential to exploit the inherent coarse-grained parallelism in applications. In addition to data parallelism, applications like telecommunication, multimedia, and gaming can also benefit from the exploitation of coarse-grained function parallelism. To exploit coarse-grained function parallelism, the common wisdom is to rely on programmers to explicitly express the coarse-grained data-flow between coarse-grained functions using data-flow or streaming languages. This research is set to explore another approach to exploiting coarse-grained function parallelism, that is to rely on compiler to extract coarse-grained data-flow from imperative programs. We believe imperative languages and the von Neumann programming model will still be the dominating programming languages programming model in the future. This dissertation discusses the design and implementation of a memory data-flow analysis system which extracts coarse-grained data-flow from C programs. The memory data-flow analysis system partitions a C program into a hierarchy of program regions. It then traverses the program region hierarchy from bottom up, summarizing the exposed memory access patterns for each program region, meanwhile deriving a conservative producer-consumer relations between program regions. An ensuing top-down traversal of the program region hierarchy will refine the producer-consumer relations by pruning spurious relations. We built an in-lining based prototype of the memory data-flow analysis system on top of the IMPACT compiler infrastructure. We applied the prototype to analyze the memory data-flow of several MediaBench programs. The experiment results showed that while the prototype performed reasonably well for the tested programs, the in-lining based implementation may not efficient for larger programs. Also, there is still room in improving the effectiveness of the memory data-flow analysis system. We did root cause analysis for the inaccuracy in the memory data-flow analysis results, which provided us insights on how to improve the memory data-flow analysis system in the future

    Compilation techniques for irregular problems on parallel machines

    Get PDF
    Massively parallel computers have ushered in the era of teraflop computing. Even though large and powerful machines are being built, they are used by only a fraction of the computing community. The fundamental reason for this situation is that parallel machines are difficult to program. Development of compilers that automatically parallelize programs will greatly increase the use of these machines.;A large class of scientific problems can be categorized as irregular computations. In this class of computation, the data access patterns are known only at runtime, creating significant difficulties for a parallelizing compiler to generate efficient parallel codes. Some compilers with very limited abilities to parallelize simple irregular computations exist, but the methods used by these compilers fail for any non-trivial applications code.;This research presents development of compiler transformation techniques that can be used to effectively parallelize an important class of irregular programs. A central aim of these transformation techniques is to generate codes that aggressively prefetch data. Program slicing methods are used as a part of the code generation process. In this approach, a program written in a data-parallel language, such as HPF, is transformed so that it can be executed on a distributed memory machine. An efficient compiler runtime support system has been developed that performs data movement and software caching

    Evaluating Speedup in Parallel Compilers

    Get PDF
    Parallel programming is prevalent in every field mainly to speed up computation. Advancements in multiprocessor technology fuel this trend toward parallel programming. However, modern compilers are still largely single threaded and do not take advantage of the machine resources available to them. There has been a lot of work done on compilers that add parallel constructs to the programs they are compiling, enabling programs to exploit parallelism at run time. Auto parallelization of loops by a compiler is one such example. Researchers have done very little work towards parallelizing the compilation process itself. The research done here focuses on parallel compilers that target computation speedup by parallelizing the process of program compilation during the lexical analysis and semantic analysis phase. Parallelization brings along with it issues like synchronization, concurrency and communication overhead. In the semantic analysis phase, these issues are of particular relevance during the construction of the symbol table. Research done on a concurrent compiler developed at the University of Toronto in 1991 proposed three techniques to address the generation of the symbol table [Seshadri91]. The goal here is to implement a parallel compiler using concepts from those techniques as references. The research done here will augment the work done formerly and measure the performance speedup obtained

    Scalability Validation of Parallel Sorting Algorithms

    Get PDF
    As single-core performance of processors is not improving significantly anymore, the computer industry is moving towards increasing the amount of cores per processor or, in the case of large-scale computers, by installing more processors per computer. Applications now need to scale in accordance with the increase of parallel computing power and software developers need to take advantage of this movement. And parallel sorting algorithms present basic building blocks for many complex applications. In this thesis, we will validate the expected execution time complexities of five state-of-the-art parallel sorting algorithms, implemented in C using MPI for parallelization, by using a scalability validation framework based on Score-P and Extra-P. For each of the parallel sorting algorithms, we will create a performance model. These models will allow us to compare their scalability behaviour to the expectations. Furthermore, we will attempt to parallelize the local sorting step of the splitter-based parallel sorting algorithms via C++11 threads, OpenMP tasks, and CUDA acceleration. We construct the performance models, on which we base our evaluations, using uniformly randomly generated data. For most of the parallel sorting algorithms, we show that the given expectations match the created models. We will discuss any other discrepancies in detail
    • …
    corecore