141 research outputs found

    Parallel computing 2011, ParCo 2011: book of abstracts

    Get PDF
    This book contains the abstracts of the presentations at the conference Parallel Computing 2011, 30 August - 2 September 2011, Ghent, Belgiu

    Structured parallelism discovery with hybrid static-dynamic analysis and evaluation technique

    Get PDF
    Parallel computer architectures have dominated the computing landscape for the past two decades; a trend that is only expected to continue and intensify, with increasing specialization and heterogeneity. This creates huge pressure across the software stack to produce programming languages, libraries, frameworks and tools which will efficiently exploit the capabilities of parallel computers, not only for new software, but also revitalizing existing sequential code. Automatic parallelization, despite decades of research, has had limited success in transforming sequential software to take advantage of efficient parallel execution. This thesis investigates three approaches that use commutativity analysis as the enabler for parallelization. This has the potential to overcome limitations of traditional techniques. We introduce the concept of liveness-based commutativity for sequential loops. We examine the use of a practical analysis utilizing liveness-based commutativity in a symbolic execution framework. Symbolic execution represents input values as groups of constraints, consequently deriving the output as a function of the input and enabling the identification of further program properties. We employ this feature to develop an analysis and discern commutativity properties between loop iterations. We study the application of this approach on loops taken from real-world programs in the OLDEN and NAS Parallel Benchmark (NPB) suites, and identify its limitations and related overheads. Informed by these findings, we develop Dynamic Commutativity Analysis (DCA), a new technique that leverages profiling information from program execution with specific input sets. Using profiling information, we track liveness information and detect loop commutativity by examining the code’s live-out values. We evaluate DCA against almost 1400 loops of the NPB suite, discovering 86% of them as parallelizable. Comparing our results against dependence-based methods, we match the detection efficacy of two dynamic and outperform three static approaches, respectively. Additionally, DCA is able to automatically detect parallelism in loops which iterate over Pointer-Linked Data Structures (PLDSs), taken from wide range of benchmarks used in the literature, where all other techniques we considered failed. Parallelizing the discovered loops, our methodology achieves an average speedup of 3.6× across NPB (and up to 55×) and up to 36.9× for the PLDS-based loops on a 72-core host. We also demonstrate that our methodology, despite relying on specific input values for profiling each program, is able to correctly identify parallelism that is valid for all potential input sets. Lastly, we develop a methodology to utilize liveness-based commutativity, as implemented in DCA, to detect latent loop parallelism in the shape of patterns. Our approach applies a series of transformations which subsequently enable multiple applications of DCA over the generated multi-loop code section and match its loop commutativity outcomes against the expected criteria for each pattern. Applying our methodology on sets of sequential loops, we are able to identify well-known parallel patterns (i.e., maps, reduction and scans). This extends the scope of parallelism detection to loops, such as those performing scan operations, which cannot be determined as parallelizable by simply evaluating liveness-based commutativity conditions on their original form

    Effective data parallel computing on multicore processors

    Get PDF
    The rise of chip multiprocessing or the integration of multiple general purpose processing cores on a single chip (multicores), has impacted all computing platforms including high performance, servers, desktops, mobile, and embedded processors. Programmers can no longer expect continued increases in software performance without developing parallel, memory hierarchy friendly software that can effectively exploit the chip level multiprocessing paradigm of multicores. The goal of this dissertation is to demonstrate a design process for data parallel problems that starts with a sequential algorithm and ends with a high performance implementation on a multicore platform. Our design process combines theoretical algorithm analysis with practical optimization techniques. Our target multicores are quad-core processors from Intel and the eight-SPE IBM Cell B.E. Target applications include Matrix Multiplications (MM), Finite Difference Time Domain (FDTD), LU Decomposition (LUD), and Power Flow Solver based on Gauss-Seidel (PFS-GS) algorithms. These applications are popular computation methods in science and engineering problems and are characterized by unit-stride (MM, LUD, and PFS-GS) or 2-point stencil (FDTD) memory access pattern. The main contributions of this dissertation include a cache- and space-efficient algorithm model, integrated data pre-fetching and caching strategies, and in-core optimization techniques. Our multicore efficient implementations of the above described applications outperform nai¨ve parallel implementations by at least 2x and scales well with problem size and with the number of processing cores

    SPOT: A DSL for Extending Fortran Programs with Metaprogramming

    Get PDF

    Veröffentlichungen und Vorträge 2009 der Mitglieder der Fakultät für Informatik

    Get PDF

    3rd Many-core Applications Research Community (MARC) Symposium. (KIT Scientific Reports ; 7598)

    Get PDF
    This manuscript includes recent scientific work regarding the Intel Single Chip Cloud computer and describes approaches for novel approaches for programming and run-time organization
    corecore