96 research outputs found

    Run-time parallelization and scheduling of loops

    Get PDF
    Run time methods are studied to automatically parallelize and schedule iterations of a do loop in certain cases, where compile-time information is inadequate. The methods presented involve execution time preprocessing of the loop. At compile-time, these methods set up the framework for performing a loop dependency analysis. At run time, wave fronts of concurrently executable loop iterations are identified. Using this wavefront information, loop iterations are reordered for increased parallelism. Symbolic transformation rules are used to produce: inspector procedures that perform execution time preprocessing and executors or transformed versions of source code loop structures. These transformed loop structures carry out the calculations planned in the inspector procedures. Performance results are presented from experiments conducted on the Encore Multimax. These results illustrate that run time reordering of loop indices can have a significant impact on performance. Furthermore, the overheads associated with this type of reordering are amortized when the loop is executed several times with the same dependency structure

    Run-time parallelization and scheduling of loops

    Get PDF
    The class of problems that can be effectively compiled by parallelizing compilers is discussed. This is accomplished with the doconsider construct which would allow these compilers to parallelize many problems in which substantial loop-level parallelism is available but cannot be detected by standard compile-time analysis. We describe and experimentally analyze mechanisms used to parallelize the work required for these types of loops. In each of these methods, a new loop structure is produced by modifying the loop to be parallelized. We also present the rules by which these loop transformations may be automated in order that they be included in language compilers. The main application area of the research involves problems in scientific computations and engineering. The workload used in our experiment includes a mixture of real problems as well as synthetically generated inputs. From our extensive tests on the Encore Multimax/320, we have reached the conclusion that for the types of workloads we have investigated, self-execution almost always performs better than pre-scheduling. Further, the improvement in performance that accrues as a result of global topological sorting of indices as opposed to the less expensive local sorting, is not very significant in the case of self-execution

    Efficient Generation of Parallel Spin-images Using Dynamic Loop Scheduling

    Get PDF
    High performance computing (HPC) systems underwent a significant increase in their processing capabilities. Modern HPC systems combine large numbers of homogeneous and heterogeneous computing resources. Scalability is, therefore, an essential aspect of scientific applications to efficiently exploit the massive parallelism of modern HPC systems. This work introduces an efficient version of the parallel spin-image algorithm (PSIA), called EPSIA. The PSIA is a parallel version of the spin-image algorithm (SIA). The (P)SIA is used in various domains, such as 3D object recognition, categorization, and 3D face recognition. EPSIA refers to the extended version of the PSIA that integrates various well-known dynamic loop scheduling (DLS) techniques. The present work: (1) Proposes EPSIA, a novel flexible version of PSIA; (2) Showcases the benefits of applying DLS techniques for optimizing the performance of the PSIA; (3) Assesses the performance of the proposed EPSIA by conducting several scalability experiments. The performance results are promising and show that using well-known DLS techniques, the performance of the EPSIA outperforms the performance of the PSIA by a factor of 1.2 and 2 for homogeneous and heterogeneous computing resources, respectively

    Enhancing the performance of Decoupled Software Pipeline through Backward Slicing

    Get PDF
    The rapidly increasing number of cores available in multicore processors does not necessarily lead directly to a commensurate increase in performance: programs written in conventional languages, such as C, need careful restructuring, preferably automatically, before the benefits can be observed in improved run-times. Even then, much depends upon the intrinsic capacity of the original program for concurrent execution. The subject of this paper is the performance gains from the combined effect of the complementary techniques of the Decoupled Software Pipeline (DSWP) and (backward) slicing. DSWP extracts threadlevel parallelism from the body of a loop by breaking it into stages which are then executed pipeline style: in effect cutting across the control chain. Slicing, on the other hand, cuts the program along the control chain, teasing out finer threads that depend on different variables (or locations). parts that depend on different variables. The main contribution of this paper is to demonstrate that the application of DSWP, followed by slicing offers notable improvements over DSWP alone, especially when there is a loop-carried dependence that prevents the application of the simpler DOALL optimization. Experimental results show an improvement of a factor of ?1.6 for DSWP + slicing over DSWP alone and a factor of ?2.4 for DSWP + slicing over the original sequential code

    A Survey on Thread-Level Speculation Techniques

    Get PDF
    Producción CientíficaThread-Level Speculation (TLS) is a promising technique that allows the parallel execution of sequential code without relying on a prior, compile-time-dependence analysis. In this work, we introduce the technique, present a taxonomy of TLS solutions, and summarize and put into perspective the most relevant advances in this field.MICINN (Spain) and ERDF program of the European Union: HomProg-HetSys project (TIN2014-58876-P), CAPAP-H5 network (TIN2014-53522-REDT), and COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS)

    Tiling Optimization For Nested Loops On Gpus

    Get PDF
    Optimizing nested loops has been considered as an important topic and widely studied in parallel programming. With the development of GPU architectures, the performance of these computations can be significantly boosted with the massively parallel hardware. General matrix-matrix multiplication is a typical example where executing such an algorithm on GPUs outperforms the performance obtained on other multicore CPUs. However, achieving ideal performance on GPUs usually requires a lot of human effort to manage the massively parallel computation resources. Therefore, the efficient implementation of optimizing nested loops on GPUs became a popular topic in recent years. We present our work based on the tiling strategy in this dissertation to address three kinds of popular problems. Different kinds of computations bring in different latency issues where dependencies in the computation may result in insufficient parallelism and the performance of computations without dependencies may be degraded due to intensive memory accesses. In this thesis, we tackle the challenges for each kind of problem and believe that other computations performed in nested loops can also benefit from the presented techniques. We improve a parallel approximation algorithm for the problem of scheduling jobs on parallel identical machines to minimize makespan with a high-dimensional tiling method. The algorithm is designed and optimized for solving this kind of problem efficiently on GPUs. Because the algorithm is based on a higher-dimensional dynamic programming approach, where dimensionality refers to the number of variables in the dynamic programming equation characterizing the problem, the existing implementation suffers from the pain of dimensionality and cannot fully utilize GPU resources. We design a novel data-partitioning technique to accelerate the higher-dimensional dynamic programming component of the algorithm. Both the load imbalance and exceeding memory capacity issues are addressed in our GPU solution. We present performance results to demonstrate how our proposed design improves the GPU utilization and makes it possible to solve large higher-dimensional dynamic programming problems within the limited GPU memory. Experimental results show that the GPU implementation achieves up to 25X speedup compared to the best existing OpenMP implementation. In addition, we focus on optimizing wavefront parallelism on GPUs. Wavefront parallelism is a well-known technique for exploiting the concurrency of applications that execute nested loops with uniform data dependencies. Recent research on such applications, which range from sequence alignment tools to partial differential equation solvers, has used GPUs to benefit from the massively parallel computing resources. Wavefront parallelism faces the load imbalance issue because the parallelism is passing along the diagonal. The tiling method has been introduced as a popular solution to address this issue. However, the use of hyperplane tiles increases the cost of synchronization and leads to poor data locality. In this paper, we present a highly optimized implementation of the wavefront parallelism technique that harnesses the GPU architecture. A balanced workload and maximum resource utilization are achieved with an extremely low synchronization overhead. We design the kernel configuration to significantly reduce the minimum number of synchronizations required and also introduce an inter-block lock to minimize the overhead of each synchronization. We evaluate the performance of our proposed technique for four different applications: Sequence Alignment, Edit Distance, Summed-Area Table, and 2DSOR. The performance results demonstrate that our method achieves speedups of up to six times compared to the previous best-known hyperplane tiling-based GPU implementation. Finally, we extend the hyperplane tiling to high order 2D stencil computations. Unlike wavefront parallelism that has dependence in the spatial dimension, dependence remains only across two adjacent time steps along the temporal dimension in stencil computations. Even if the no-dependence property significantly increases the parallelism obtained in the spatial dimensions, full parallelism may not be efficient on GPUs. Due to the limited cache capacity owned by each streaming multiprocessor, full parallelism can be obtained on global memory only, which has high latency to access. Therefore, the tiling technique can be applied to improve the memory efficiency by caching the small tiled blocks. Because the widely studied tiling methods, like overlapped tiling and split tiling, have considerable computation overhead caused by load imbalance or extra operations, we propose a time skewed tiling method, which is designed upon the GPU architecture. We work around the serialized computation issue and coordinate the intra-tile parallelism and inter-tile parallelism to minimize the load imbalance caused by pipelined processing. Moreover, we address the high-order stencil computations in our development, which has not been comprehensively studied. The proposed method achieves up to 3.5X performance improvement when the stencil computation is performed on a Moore neighborhood pattern

    A Theoretical Approach Involving Recurrence Resolution, Dependence Cycle Statement Ordering and Subroutine Transformation for the Exploitation of Parallelism in Sequential Code.

    Get PDF
    To exploit parallelism in Fortran code, this dissertation consists of a study of the following three issues: (1) recurrence resolution in Do-loops for vector processing, (2) dependence cycle statement ordering in Do-loops for parallel processing, and (3) sub-routine parallelization. For recurrence resolution, the major findings include: (1) the node splitting algorithm cannot be used directly to break an essential antidependence link, of which the source variable that results in antidependence is itself the sink variable of another true dependence so a correction method is proposed, (2) a sink variable renaming technique is capable of breaking an antidependence and/or output-dependence link, (3) for recurrences formed by only true dependences, a dynamic dependence concept and the derived technique are powerful, and (4) by integrating related techniques, an algorithm for resolving a general multistatement recurrence is developed. The performance of a parallel loop is determined by the level of parallelism and the time delay due to interprocessor communication and synchronization. For a dependence cycle of a single parallel loop executed in a general synchronization mode, the parallelism exposed varies with the alignment of statements. Statements are reordered on the basis of execution-time of the loop as estimated at compile-time. An improved timing formula and a derived statement ordering algorithm are proposed. Further extension of this algorithm to multiple perfectly nested Do-loops with simple global dependence cycle is also presented. The subroutine is a potential source for parallel processing. Several problems must be solved for subroutine parallelization: (1) the precedence of parallel executions of subroutines, (2) identification of the optimum execution mode for each subroutine and (3) the restructuring of a serial program. A five-step approach to parallelize called subroutines for a calling subroutine is proposed: (1) computation of control dependence, (2) approximation of the global effects of subroutines, (3) analysis of data dependence, (4) identification of execution mode, and (5) restructuring of calling and called subroutines. Application of these five steps in a recursive manner to different levels of calling subroutines in a program addresses the parallelization of subroutines

    GPU-TLS: an efficient runtime for speculative loop parallelization on GPUs

    Get PDF
    Recently GPUs have risen as one important parallel platform for general purpose applications, both in HPC and cloud environments. Due to the special execution model, developing programs for GPUs is difficult even with the recent introduction of high-level languages like CUDA and OpenCL. To ease the programming efforts, some research has proposed automatically generating parallel GPU codes by complex compile-time techniques. However, this approach can only parallelize loops 100% free of inter-iteration dependencies (i.e., DOALL loops). To exploit runtime parallelism, which cannot be proven by static analysis, in this work, we propose GPU-TLS, a runtime system to speculatively parallelize possibly-parallel loops in sequential programs on GPUs. GPU-TLS parallelizes a possibly-parallel loop by chopping it into smaller sub-loops, each of which is executed in parallel by a GPU kernel, speculating that no inter-iteration dependencies exist. After dependency checking, the buffered writes of iterations without mis-speculations are copied to the master memory while iterations encountering mis-speculations are re-executed. GPU-TLS addresses several key problems of speculative loop parallelization on GPUs: (1) The larger mis-speculation rate caused by larger number of threads is reduced by three approaches: the loop chopping parallelization approach, the deferred memory update scheme and intra-warp value forwarding method. (2) The larger overhead of dependency checking is reduced by a hybrid scheme: eager intra-warp dependency checking combined with lazy inter-warp dependency checking. (3) The bottleneck of serial commit is alleviated by a parallel commit scheme, which allows different iterations to enter the commit phase out of order but still guarantees sequential semantics. Extensive evaluations using both microbenchmarks and reallife applications on two recent NVIDIA GPU cards show that speculative loop parallelization using GPU-TLS can achieve speedups ranging from 5 to 160 for sequential programs with possibly-parallel loops. © 2013 IEEE.published_or_final_versio

    Paralelização de laços doacross usando anotações de componentes e probabilidade de Loop-Carried

    Get PDF
    Orientadores: Guido Costa Souza de Araújo, Márcio Machado PereiraDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: A paralelização de laços é usada para se obter melhor desempenho em algoritmos intensivos, entretando, não são todos os laços que podem ser facilmente paralelizados. Os laços chamados de DOACROSS possuem dependências entre iterações, i.e. uma iteração calcula um dado que é usado por outra iteração futura. Este tipo de dependência é chamada de loop-carried e não pode ser paralelizada trivialmente porque a ordem de execução das iterações deve ser respeitada. Algumas técnicas podem ser usadas para paralelizar este tipo de laço, porém o programador deve entender como funciona o algoritmo e deve escolher quais instruções podem ser executadas em paralelo e quais instruções devem ser executadas sequencialmente. Estas componentes sequenciais e paralelas precisam ser separadas manualmente pelo programador e a comunicação entre as componentes deve ser incluída, a fim de respeitar as dependências entre componentes e as dependências entre iterações. Implementar essas técnicas é um trabalho laborioso que requer uma certa experiência do programador para separar as componentes e encontrar as dependências para implementar a comunicação entre as componentes/threads. Esta comunicação pode ser feita através de filas ou buffers, dependendo do algoritmo de paralelização escolhido. Uma das técnicas de paralelização é o algoritmo mais tradicional, chamado de DOACROSS que foi implementado no OpenMP 4.5 através da cláusula depend da diretiva ordered. Este pragma deve ser usado dentro da região de um laço paralelo do OpenMP a fim de separar as componentes que devem ser sequenciais. A comunicação e a sincronização são implementadas automaticamente utilizando a biblioteca de runtime do OpenMP. Este método remove do programador o trabalho de programação, entretando, ainda é necessário delimitar explicitamente as componentes sequenciais. Outro algoritmo de paralelização estudado foi o Batched DoAcross (BDX). Este algoritmo pode ser usado para reduzir o overhead da comunicação entre componentes, entretanto, a implementação deve ser feita manualmente pelo programador e requer que o programador separe as componentes sequenciais e paralelas, crie barreiras de sincronização para as componentes sequenciais, crie buffers para a comunicação entre componentes e crie variáveis compartilhadas para a comunicação entre as threads (dependências entre iterações). Nos experimentos, foi percebido que a escolha do algoritmo de paralelização depende de alguns fatores, i.e. a estrutura do algoritmo, a proporção das dependências entre iterações, o número de iterações do laço e o tamanho do laço. Foi criada então uma nova cláusula para o OpenMP que, quando usada juntamente com a diretiva ordered, consegue separar as componentes sequenciais e paralelas e implementar essas técnicas de forma automática. Esta cláusula, chamada de use, deve receber um parâmetro que especifica qual técnica o programador quer utilizar para paralelizar o laçoAbstract: Loop parallelization can be used to achieve better performance on intensive algorithms, however, not all loops can be easily parallelized. The called 'DOACROSS' loops have dependences between different iterations, i.e. some iteration computes a data which is used in a later iteration. This kind of dependence is called loop-carried dependence and cannot be simply parallelized because iterations execution order must be respected. Some techniques can be used to parallelize this kind of loop, however, the programmer must understand how the algorithm works and choose which instructions can be executed in parallel and which instructions need to be serialized. These serial and parallel components need to be manually separated by programmer and communication between components must be included to respect dependences inside loop body and between threads to respect loop-carried dependences. Implementing these techniques is a laborious work that requires a certain expertise from programmer to separate loop components and find dependences to implement communication between components/threads. This communication can be done by using a queue or a buffer, depending on the algorithm used to parallelize. One of these parallelization techniques is the traditional DOACROSS, which was implemented by using depend clause for the ordered directive in OpenMP 4.5. This OpenMP construct is used within OpenMP loop region to separate serial and parallel components, then, communication and synchronization are automatically implemented by OpenMP Runtime. This method removes most of the programming work from the programmer, however still requires to explicitly delimit serial region. Another studied parallelization technique is the Batched DoAcross (BDX). This algorithm can be used to reduce the communication overhead of synchronization between components, however, the implementation must be done manually by programmer, which requires for the programmer to separate serial and parallel components, create barriers to synchronization in serial components, create buffers for communication between components and create the shared variables for communication between threads (loop-carried dependences). In our experiments, we noticed that some factors must be taken for the choice of parallelization technique, i.e. algorithm structure, loop-carried ratio, number of loop iterations and loop size. We created a new OpenMP clause that, used together with the ordered directive, can separate these components and implement these techniques automatically. This clause, is called use, receive a parameter for specifying which parallelization technique the programmer want to be implementedMestradoCiência da ComputaçãoMestre em Ciência da Computaçã
    corecore