80 research outputs found
Structured parallelism discovery with hybrid static-dynamic analysis and evaluation technique
Parallel computer architectures have dominated the computing landscape for the
past two decades; a trend that is only expected to continue and intensify, with increasing specialization and heterogeneity. This creates huge pressure across the software
stack to produce programming languages, libraries, frameworks and tools which will
efficiently exploit the capabilities of parallel computers, not only for new software, but
also revitalizing existing sequential code. Automatic parallelization, despite decades of
research, has had limited success in transforming sequential software to take advantage
of efficient parallel execution. This thesis investigates three approaches that use commutativity analysis as the enabler for parallelization. This has the potential to overcome
limitations of traditional techniques.
We introduce the concept of liveness-based commutativity for sequential loops.
We examine the use of a practical analysis utilizing liveness-based commutativity in a
symbolic execution framework. Symbolic execution represents input values as groups
of constraints, consequently deriving the output as a function of the input and enabling
the identification of further program properties. We employ this feature to develop an
analysis and discern commutativity properties between loop iterations. We study the
application of this approach on loops taken from real-world programs in the OLDEN
and NAS Parallel Benchmark (NPB) suites, and identify its limitations and related
overheads.
Informed by these findings, we develop Dynamic Commutativity Analysis (DCA), a
new technique that leverages profiling information from program execution with specific
input sets. Using profiling information, we track liveness information and detect loop
commutativity by examining the code’s live-out values. We evaluate DCA against almost
1400 loops of the NPB suite, discovering 86% of them as parallelizable. Comparing
our results against dependence-based methods, we match the detection efficacy of two
dynamic and outperform three static approaches, respectively. Additionally, DCA is
able to automatically detect parallelism in loops which iterate over Pointer-Linked
Data Structures (PLDSs), taken from wide range of benchmarks used in the literature,
where all other techniques we considered failed. Parallelizing the discovered loops, our
methodology achieves an average speedup of 3.6× across NPB (and up to 55×) and up
to 36.9× for the PLDS-based loops on a 72-core host. We also demonstrate that our
methodology, despite relying on specific input values for profiling each program, is able
to correctly identify parallelism that is valid for all potential input sets.
Lastly, we develop a methodology to utilize liveness-based commutativity, as implemented in DCA, to detect latent loop parallelism in the shape of patterns. Our approach
applies a series of transformations which subsequently enable multiple applications
of DCA over the generated multi-loop code section and match its loop commutativity
outcomes against the expected criteria for each pattern. Applying our methodology on
sets of sequential loops, we are able to identify well-known parallel patterns (i.e., maps,
reduction and scans). This extends the scope of parallelism detection to loops, such
as those performing scan operations, which cannot be determined as parallelizable by
simply evaluating liveness-based commutativity conditions on their original form
Advances in the Automatic Detection of Optimization Opportunities in Computer Programs
Massively parallel and heterogeneous systems together with their APIs have been used for various applications. To achieve high-performance software, the programmer should develop optimized algorithms to maximize the system’s resource utilization. However, designing such algorithms is challenging and time-consuming. Therefore, optimizing compilers are developed to take part in the programmer’s optimization burden. Developing effective optimizing compilers is an active area of research. Specifically, because loop nests are usually the hot spots in a program, their optimization has been the main subject of many optimization algorithms. This thesis aims to improve the scope and applicability of performance optimization algorithms used in the compiler optimization phase. In the first two chapters, we focus on the parts of the programs with for-loop nests. We take advantage of the polyhedral model and the scalar evolution to develop algorithms that can automatically discover new optimization opportunities in computer programs. Our functions operate at the intermediate representation level and are implemented as part of the LLVM infrastructure. In the final chapter, we improve the performance of the Fourier-Motzkin elimination method, which is an underlying algorithm in the polyhedral theory
Translation of Array-based Loop Programs to Optimized SQL-based Distributed Programs
Many data analysis programs are often expressed in terms of array operations in sequential loops. However, these programs do not scale very well to large amounts of data that cannot fit in the memory of a single computer and they have to be rewritten to work on Big Data analysis platforms, such as Map-Reduce and Spark. We present a novel framework, called SQLgen, that automatically translates sequential loops on arrays to distributed data-parallel programs, specifically Spark SQL programs. We further extend this framework by introducing OSQLgen, which automatically parallelizes array-based loop programs to distributed data-parallel programs on block arrays. At first, our framework translates the sequential loops on arrays to monoid comprehensions and then to Spark SQL. For SQLgen, the SQL is over coordinate arrays while for OSQLgen, it is over block arrays. As block arrays are more compact than coordinate arrays, computations on block matrices are significantly faster than on arrays in the coordinate format. Since not all array-based loops can be translated to SQL on block arrays, we focus on certain patterns of loops that match an algebraic structure known as a semiring. Many linear algebra operations, such as matrix multiplication required in many machine learning algorithms, as well as many graph programs that are equivalent to a semiring can be translated to distributed data-parallel programs on block arrays using OSQLgen, thus giving us a substantial performance gain. Finally, to evaluate our framework, we compare the performance of OSQLgen with GraphX, GraphFrames, MLlib, and hand-written Spark SQL programs on coordinate and block arrays on various real-world problems
A Unified Framework for Parallel Anisotropic Mesh Adaptation
Finite-element methods are a critical component of the design and analysis procedures of many (bio-)engineering applications. Mesh adaptation is one of the most crucial components since it discretizes the physics of the application at a relatively low cost to the solver. Highly scalable parallel mesh adaptation methods for High-Performance Computing (HPC) are essential to meet the ever-growing demand for higher fidelity simulations. Moreover, the continuous growth of the complexity of the HPC systems requires a systematic approach to exploit their full potential. Anisotropic mesh adaptation captures features of the solution at multiple scales while, minimizing the required number of elements. However, it also introduces new challenges on top of mesh generation. Also, the increased complexity of the targeted cases requires departing from traditional surface-constrained approaches to utilizing CAD (Computer-Aided Design) kernels. Alongside the functionality requirements, is the need of taking advantage of the ubiquitous multi-core machines. More importantly, the parallel implementation needs to handle the ever-increasing complexity of the mesh adaptation code.
In this work, we develop a parallel mesh adaptation method that utilizes a metric-based approach for generating anisotropic meshes. Moreover, we enhance our method by interfacing with a CAD kernel, thus enabling its use on complex geometries. We evaluate our method both with fixed-resolution benchmarks and within a simulation pipeline, where the resolution of the discretization increases incrementally. With the Telescopic Approach for scalable mesh generation as a guide, we propose a parallel method at the node (multi-core) for mesh adaptation that is expected to scale up efficiently to the upcoming exascale machines. To facilitate an effective implementation, we introduce an abstract layer between the application and the runtime system that enables the use of task-based parallelism for concurrent mesh operations. Our evaluation indicates results comparable to state-of-the-art methods for fixed-resolution meshes both in terms of performance and quality. The integration with an adaptive pipeline offers promising results for the capability of the proposed method to function as part of an adaptive simulation. Moreover, our abstract tasking layer allows the separation of different aspects of the implementation without any impact on the functionality of the method
Runtime Dependence Computation and Execution of Loops on Heterogeneous Systems
Abstract GPUs have been used for parallel execution of DOALL loops. However, loops with indirect array references can potentially cause cross iteration dependences which are hard to detect using existing compilation techniques. Applications with such loops cannot easily use the GPU and hence do not benefit from the tremendous compute capabilities of GPUs. In this paper, we present an algorithm to compute at runtime the cross iteration dependences in such loops. The algorithm uses both the CPU and the GPU to compute the dependences. Specifically, it effectively uses the compute capabilities of the GPU to quickly collect the memory accesses performed by the iterations by executing the slice functions generated for the indirect array accesses. Using the dependence information, the loop iterations are levelized such that each level contains independent iterations which can be executed in parallel. Another interesting aspect of the proposed solution is that it pipelines the dependence computation of the future level with the actual computation of the current level to effectively utilize the resources available in the GPU. We use NVIDIA Tesla C2070 to evaluate our implementation using benchmarks from Polybench suite and some synthetic benchmarks. Our experiments show that the proposed technique can achieve an average speedup of 6.4x on loops with a reasonable number of cross iteration dependences
Tiling Optimization For Nested Loops On Gpus
Optimizing nested loops has been considered as an important topic and widely studied in parallel programming. With the development of GPU architectures, the performance of these computations can be significantly boosted with the massively parallel hardware.
General matrix-matrix multiplication is a typical example where executing such an algorithm on GPUs outperforms the performance obtained on other multicore CPUs. However, achieving ideal performance on GPUs usually requires a lot of human effort to manage
the massively parallel computation resources. Therefore, the efficient implementation of optimizing nested loops on GPUs became a popular topic in recent years. We present our work based on the tiling strategy in this dissertation to address three kinds of popular
problems. Different kinds of computations bring in different latency issues where dependencies in the computation may result in insufficient parallelism and the performance of computations without dependencies may be degraded due to intensive memory accesses. In this thesis, we tackle the challenges for each kind of problem and believe that other computations performed in nested loops can also benefit from the presented techniques.
We improve a parallel approximation algorithm for the problem of scheduling jobs on parallel identical machines to minimize makespan with a high-dimensional tiling method. The algorithm is designed and optimized for solving this kind of problem efficiently on GPUs. Because the algorithm is based on a higher-dimensional dynamic programming approach, where dimensionality refers to the number of variables in the dynamic programming equation characterizing the problem, the existing implementation suffers from the pain of dimensionality and cannot fully utilize GPU resources. We design a novel data-partitioning technique to accelerate the higher-dimensional dynamic programming component of the algorithm. Both the load imbalance and exceeding memory capacity
issues are addressed in our GPU solution. We present performance results to demonstrate how our proposed design improves the GPU utilization and makes it possible to solve large higher-dimensional dynamic programming problems within the limited GPU memory.
Experimental results show that the GPU implementation achieves up to 25X speedup compared to the best existing OpenMP implementation.
In addition, we focus on optimizing wavefront parallelism on GPUs. Wavefront parallelism is a well-known technique for exploiting the concurrency of applications that execute nested loops with uniform data dependencies. Recent research on such applications,
which range from sequence alignment tools to partial differential equation solvers, has used GPUs to benefit from the massively parallel computing resources. Wavefront parallelism faces the load imbalance issue because the parallelism is passing along the diagonal.
The tiling method has been introduced as a popular solution to address this issue. However, the use of hyperplane tiles increases the cost of synchronization and leads to poor data locality. In this paper, we present a highly optimized implementation of the wavefront
parallelism technique that harnesses the GPU architecture. A balanced workload and maximum resource utilization are achieved with an extremely low synchronization overhead. We design the kernel configuration to significantly reduce the minimum number of
synchronizations required and also introduce an inter-block lock to minimize the overhead of each synchronization. We evaluate the performance of our proposed technique for four different applications: Sequence Alignment, Edit Distance, Summed-Area Table, and 2DSOR. The performance results demonstrate that our method achieves speedups of up to six times compared to the previous best-known hyperplane tiling-based GPU implementation.
Finally, we extend the hyperplane tiling to high order 2D stencil computations. Unlike wavefront parallelism that has dependence in the spatial dimension, dependence remains only across two adjacent time steps along the temporal dimension in stencil computations.
Even if the no-dependence property significantly increases the parallelism obtained in the spatial dimensions, full parallelism may not be efficient on GPUs. Due to the limited cache capacity owned by each streaming multiprocessor, full parallelism can be obtained
on global memory only, which has high latency to access. Therefore, the tiling technique can be applied to improve the memory efficiency by caching the small tiled blocks. Because the widely studied tiling methods, like overlapped tiling and split tiling, have considerable computation overhead caused by load imbalance or extra operations, we propose a time skewed tiling method, which is designed upon the GPU architecture. We work around the serialized computation issue and coordinate the intra-tile parallelism and inter-tile parallelism to minimize the load imbalance caused by pipelined processing. Moreover, we address the high-order stencil computations in our development, which has not been comprehensively studied. The proposed method achieves up to 3.5X performance improvement when the stencil computation is performed on a Moore neighborhood pattern
Refactoring GrPPI:Generic Refactoring for Generic Parallelism in C++
Funding: EU Horizon 2020 project, TeamPlay (https://www.teamplay-xh2020.eu), Grant Number 779882, UK EPSRC Discovery, grant number EP/P020631/1, and Madrid Regional Government, CABAHLA-CM (ConvergenciA Big dAta-Hpc: de Los sensores a las Aplicaciones) Grant Number S2018/TCS-4423.The Generic Reusable Parallel Pattern Interface (GrPPI) is a very useful abstraction over different parallel pattern libraries, allowing the programmer to write generic patterned parallel code that can easily be compiled to different backends such as FastFlow, OpenMP, Intel TBB and C++ threads. However, rewriting legacy code to use GrPPI still involves code transformations that can be highly non-trivial, especially for programmers who are not experts in parallelism. This paper describes software refactorings to semi-automatically introduce instances of GrPPI patterns into sequential C++ code, as well as safety checking static analysis mechanisms which verify that introducing patterns into the code does not introduce concurrency-related bugs such as race conditions. We demonstrate the refactorings and safety-checking mechanisms on four simple benchmark applications, showing that we are able to obtain, with little effort, GrPPI-based parallel versions that accomplish good speedups (comparable to those of manually-produced parallel versions) using different pattern backends.Publisher PDFPeer reviewe
High Performance Collision Cross Section (HPCCS) : utilização de técnicas de HPC para aceleração do cálculo da seção de choque transversal
Orientador: Guido Costa Souza de AraújoDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: A técnica de Mobilidade Iônica junto com a Espectrometria de Massa (IM-MS) tem sido utilizada desde 2003 por laboratórios de pesquisa e análises, quando foram introduzidos os primeiros equipamentos comerciais. Ela é usada como uma ferramenta de separação molecular, técnica cromatográfica e também para obter informação estrutural de íons moleculares. A interpretação dos dados obtidos ainda é um desafio, dependendo dos cálculos da seção de choque transversal (CCS) contra um gás de arraste. Este trabalho, apresenta um novo software, \textit{High Performance Collision Cross Section} - HPCCS, que, baseado no método de trajetória, realiza os cálculos de CCS utilizando técnicas de \textit{High Performance Computing} como paralelização, vetorização e otimização. Agora é possível calcular o CCS de maneira eficiente, desde para pequenas moléculas orgânicas até proteínas complexas com um número maior de átomos. Os resultados mostraram que, comparados com o software usado atualmente (MOBCAL), houve um ganho em média de 78 vezes em um nó de um cluster com 24 cores e 48 threads, utilizando Simultaneous Multithreading (SMT)Abstract: Ion Mobility coupled to Mass Spectrometry technique (IM-MS) have been used since 2003 for research and analysis laboratories, when they were commercially introduced. It has been used as a tool for molecular separation, chromatography technique, and to obtain structural information for molecular ions. The interpretation of the resulting data is still a challenge, depending on collision cross section (CCS) calculation against a buffer gas. This work, presents a new software, High Performance Collision Cross Section - HPCCS, which is based on the trajectory method, using High Performance Computing techniques like parallelization, vectorization and optimization. By using HPCCS now calculate the CCS efficiently, from small organic molecules to protein complexes with a larger number of atoms. The results presented in this work when comparing to the state of the art software (MOBCAL), show an average speedup of 78 times on a cluster node with 24 cores and 48 threads, with Simultaneous Multithreading (SMT)MestradoCiência da ComputaçãoMestre em Ciência da Computação2012/24750-6, 2013/08293-7, 2016/04963-6FAPES
Paralelização de laços doacross usando anotações de componentes e probabilidade de Loop-Carried
Orientadores: Guido Costa Souza de Araújo, Márcio Machado PereiraDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: A paralelização de laços é usada para se obter melhor desempenho em algoritmos intensivos, entretando, não são todos os laços que podem ser facilmente paralelizados. Os laços chamados de DOACROSS possuem dependências entre iterações, i.e. uma iteração calcula um dado que é usado por outra iteração futura. Este tipo de dependência é chamada de loop-carried e não pode ser paralelizada trivialmente porque a ordem de execução das iterações deve ser respeitada. Algumas técnicas podem ser usadas para paralelizar este tipo de laço, porém o programador deve entender como funciona o algoritmo e deve escolher quais instruções podem ser executadas em paralelo e quais instruções devem ser executadas sequencialmente. Estas componentes sequenciais e paralelas precisam ser separadas manualmente pelo programador e a comunicação entre as componentes deve ser incluída, a fim de respeitar as dependências entre componentes e as dependências entre iterações. Implementar essas técnicas é um trabalho laborioso que requer uma certa experiência do programador para separar as componentes e encontrar as dependências para implementar a comunicação entre as componentes/threads. Esta comunicação pode ser feita através de filas ou buffers, dependendo do algoritmo de paralelização escolhido. Uma das técnicas de paralelização é o algoritmo mais tradicional, chamado de DOACROSS que foi implementado no OpenMP 4.5 através da cláusula depend da diretiva ordered. Este pragma deve ser usado dentro da região de um laço paralelo do OpenMP a fim de separar as componentes que devem ser sequenciais. A comunicação e a sincronização são implementadas automaticamente utilizando a biblioteca de runtime do OpenMP. Este método remove do programador o trabalho de programação, entretando, ainda é necessário delimitar explicitamente as componentes sequenciais. Outro algoritmo de paralelização estudado foi o Batched DoAcross (BDX). Este algoritmo pode ser usado para reduzir o overhead da comunicação entre componentes, entretanto, a implementação deve ser feita manualmente pelo programador e requer que o programador separe as componentes sequenciais e paralelas, crie barreiras de sincronização para as componentes sequenciais, crie buffers para a comunicação entre componentes e crie variáveis compartilhadas para a comunicação entre as threads (dependências entre iterações). Nos experimentos, foi percebido que a escolha do algoritmo de paralelização depende de alguns fatores, i.e. a estrutura do algoritmo, a proporção das dependências entre iterações, o número de iterações do laço e o tamanho do laço. Foi criada então uma nova cláusula para o OpenMP que, quando usada juntamente com a diretiva ordered, consegue separar as componentes sequenciais e paralelas e implementar essas técnicas de forma automática. Esta cláusula, chamada de use, deve receber um parâmetro que especifica qual técnica o programador quer utilizar para paralelizar o laçoAbstract: Loop parallelization can be used to achieve better performance on intensive algorithms, however, not all loops can be easily parallelized. The called 'DOACROSS' loops have dependences between different iterations, i.e. some iteration computes a data which is used in a later iteration. This kind of dependence is called loop-carried dependence and cannot be simply parallelized because iterations execution order must be respected. Some techniques can be used to parallelize this kind of loop, however, the programmer must understand how the algorithm works and choose which instructions can be executed in parallel and which instructions need to be serialized. These serial and parallel components need to be manually separated by programmer and communication between components must be included to respect dependences inside loop body and between threads to respect loop-carried dependences. Implementing these techniques is a laborious work that requires a certain expertise from programmer to separate loop components and find dependences to implement communication between components/threads. This communication can be done by using a queue or a buffer, depending on the algorithm used to parallelize. One of these parallelization techniques is the traditional DOACROSS, which was implemented by using depend clause for the ordered directive in OpenMP 4.5. This OpenMP construct is used within OpenMP loop region to separate serial and parallel components, then, communication and synchronization are automatically implemented by OpenMP Runtime. This method removes most of the programming work from the programmer, however still requires to explicitly delimit serial region. Another studied parallelization technique is the Batched DoAcross (BDX). This algorithm can be used to reduce the communication overhead of synchronization between components, however, the implementation must be done manually by programmer, which requires for the programmer to separate serial and parallel components, create barriers to synchronization in serial components, create buffers for communication between components and create the shared variables for communication between threads (loop-carried dependences). In our experiments, we noticed that some factors must be taken for the choice of parallelization technique, i.e. algorithm structure, loop-carried ratio, number of loop iterations and loop size. We created a new OpenMP clause that, used together with the ordered directive, can separate these components and implement these techniques automatically. This clause, is called use, receive a parameter for specifying which parallelization technique the programmer want to be implementedMestradoCiência da ComputaçãoMestre em Ciência da Computaçã
- …