    Scalability and Performance Analysis of OpenMP Codes Using the Periscope Toolkit

    In this paper, we present two new approaches while rendering necessary extensions to Periscope to perform scalability and performance analysis on OpenMP codes. Periscope is an online-based performance analysis toolkit which consists of a user defined number of analysis agents that automatically search for the performance properties while the application is running. In order to detect the scalability and performance bottlenecks of OpenMP codes using Periscope, a few newly defined performance properties and meta properties are formalized. We manifest our implementation by evaluating NAS OpenMP benchmarks. As shown in our results, our approach identifies the code regions which do not scale well and other performance problems, e.g. load imbalance in NAS parallel benchmarks

    Discovery of Potential Parallelism in Sequential Programs

    In the era of multicore processors, the responsibility for performance gains has been shifted onto software developers. Once improvements of the sequential algorithm have been exhausted, software-managed parallelism is the only option left. However, writing parallel code is still difficult, especially when parallelizing sequential code written by someone else. A key task in this process is the identification of suitable parallelization targets in the source code. Parallelism discovery tools help developers to find such targets automatically. Unfortunately, tools that identify parallelism during compilation are usually conservative due to the lack of runtime information, and tools relying on runtime information primarily suffer from high overhead in terms of both time and memory. This dissertation presents a generic framework for parallelism discovery based on dynamic program analysis, supporting various types of parallelism while incurring practically affordable overhead. The framework contains two main components: an efficient data-dependence profiler and a set of parallelism discovery algorithms based on a language-independent concept called Computational Unit. The data-dependence profiler serves as the foundation of the parallelism discovery framework. Traditional dependence profiling approaches introduce a tremendous amount of time and memory overhead. To lower the overhead, current methods limit their scope to the subset of the dependence information needed for the analysis they have been created for, sacrificing generality and discouraging reuse. In contrast, the profiler shown in this thesis addresses the problem via signature-based memory management and a lock-free parallel design. It produces detailed dependences not only for sequential but also for multi-threaded code without causing prohibitive overhead, allowing it to serve as a generic base for various program analysis techniques. Computational Units (CUs) provide a language-independent foundation for parallelism discovery. CUs are computations that follow the read-compute-write pattern. Unlike other concepts, they are not restricted to predefined language constructs. A program is represented as a CU graph, in which vertexes are CUs and edges are data dependences. This allows parallelism to be detected that spreads across multiple language constructs, taking code refactoring into consideration. The parallelism discovery algorithms cover both loop and task parallelism. Results of our experiments show that 1) the efficient data-dependence profiler has a very competitive average slowdown of around 80脳 with accuracy higher than 99.6%; 2) the framework discovers parallelism with high accuracy, identifying 92.5% of the parallel loops in NAS benchmarks; 3) when parallelizing well-known open-source software following the outputs of the framework, reasonable speedups are obtained. Finally, use cases beyond parallelism discovery are briefly demonstrated to show the generality of the framework

    Abstraction Raising in General-Purpose Compilers

    Investigating tools and techniques for improving software performance on multiprocessor computer systems

    The availability of modern commodity multicore processors and multiprocessor computer systems has resulted in the widespread adoption of parallel computers in a variety of environments, ranging from the home to workstation and server environments in particular. Unfortunately, parallel programming is harder and requires more expertise than the traditional sequential programming model. The variety of tools and parallel programming models available to the programmer further complicates the issue. The primary goal of this research was to identify and describe a selection of parallel programming tools and techniques to aid novice parallel programmers in the process of developing efficient parallel C/C++ programs for the Linux platform. This was achieved by highlighting and describing the key concepts and hardware factors that affect parallel programming, providing a brief survey of commonly available software development tools and parallel programming models and libraries, and presenting structured approaches to software performance tuning and parallel programming. Finally, the performance of several parallel programming models and libraries was investigated, along with the programming effort required to implement solutions using the respective models. A quantitative research methodology was applied to the investigation of the performance and programming effort associated with the selected parallel programming models and libraries, which included automatic parallelisation by the compiler, Boost Threads, Cilk Plus, OpenMP, POSIX threads (Pthreads), and Threading Building Blocks (TBB). Additionally, the performance of the GNU C/C++ and Intel C/C++ compilers was examined. The results revealed that the choice of parallel programming model or library is dependent on the type of problem being solved and that there is no overall best choice for all classes of problem. However, the results also indicate that parallel programming models with higher levels of abstraction require less programming effort and provide similar performance compared to explicit threading models. The principle conclusion was that the problem analysis and parallel design are an important factor in the selection of the parallel programming model and tools, but that models with higher levels of abstractions, such as OpenMP and Threading Building Blocks, are favoured

    Compile-time support for thread-level speculation

    Una de las principales preocupaciones de las ciencias de la computaci贸n es el estudio de las capacidades paralelas tanto de programas como de los procesadores que los ejecutan. Existen varias razones que hacen muy deseable el desarrollo de t茅cnicas que paralelicen autom谩ticamente el c贸digo. Entre ellas se encuentran el inmenso n煤mero de programas secuenciales existentes ya escritos, la complejidad de los lenguajes de programaci贸n paralelos, y los conocimientos que se requieren para paralelizar un c贸digo. Sin embargo, los actuales mecanismos de paralelizaci贸n autom谩tica implementados en los compiladores comerciales no son capaces de paralelizar la mayor铆a de los bucles en un c贸digo [1], debido a la dependencias de datos que existen entre ellos [2]. Por lo tanto, se hace necesaria la b煤squeda de nuevas t茅cnicas, como la paralelizaci贸n especulativa [3-5], que saquen beneficio de las potenciales capacidades paralelas del hardware y arquitecturas multiprocesador actuales. Sin embargo, 茅sta y otras t茅cnicas requieren la intervenci贸n manual de programadores experimentados. Antes de ofrecer soluciones alternativas, se han evaluado las capacidades de paralelizaci贸n de los compiladores comerciales, exponiendo las limitaciones de los mecanismos de paralelizaci贸n autom谩tica que implementan. El estudio revela que estos mecanismos de paralelizaci贸n autom谩tica s贸lo alcanzan un 19% de speedup en promedio para los benchmarks del SPEC CPU2006 [6], siendo este un resultado significativamente inferior al obtenido por t茅cnicas de paralelizaci贸n especulativa [7]. Sin embargo, la paralelizaci贸n especulativa requiere una extensa modificaci贸n manual del c贸digo por parte de programadores. Esta Tesis aborda este problema definiendo una nueva cl谩usula OpenMP [8], llamada 驴speculative驴, que permite se帽alar qu茅 variables pueden llevar a una violaci贸n de dependencia. Adem谩s, esta Tesis tambi茅n propone un sistema en tiempo de compilaci贸n que, usando la informaci贸n sobre los accesos a las variables que proporcionan las cl谩usulas OpenMP, a帽ade autom谩ticamente todo el c贸digo necesario para gestionar la ejecuci贸n especulativa de un programa. Esto libera al programador de modificar el c贸digo manualmente, evitando posibles errores y una tediosa tarea. El c贸digo generado por nuestro sistema enlaza con la librer铆a de ejecuci贸n especulativamente paralela desarrollada por Estebanez, Garc铆a-Yag眉ez, Llanos y Gonzalez-Escribano [9,10].Departamento de Inform谩tica (Arquitectura y Tecnolog铆a de Computadores, Ciencias de la Computaci贸n e Inteligencia Artificial, Lenguajes y Sistemas Inform谩ticos

    MetaFork: A Compilation Framework for Concurrency Models Targeting Hardware Accelerators

    Parallel programming is gaining ground in various domains due to the tremendous computational power that it brings; however, it also requires a substantial code crafting effort to achieve performance improvement. Unfortunately, in most cases, performance tuning has to be accomplished manually by programmers. We argue that automated tuning is necessary due to the combination of the following factors. First, code optimization is machine-dependent. That is, optimization preferred on one machine may be not suitable for another machine. Second, as the possible optimization search space increases, manually finding an optimized configuration is hard. Therefore, developing new compiler techniques for optimizing applications is of considerable interest. This thesis aims at generating new techniques that will help programmers develop efficient algorithms and code targeting hardware acceleration technologies, in a more effective manner. Our work is organized around a compilation framework, called MetaFork, for concurrency platforms and its application to automatic parallelization. MetaFork is a high-level programming language extending C/C++, which combines several models of concurrency including fork-join, SIMD and pipelining parallelism. MetaFork is also a compilation framework which aims at facilitating the design and implementation of concurrent programs through four key features which make MetaFork unique and novel: (1) Perform automatic code translation between concurrency platforms targeting multi-core architectures. (2) Provide a high-level language for expressing concurrency as in the fork-join model, the SIMD paradigm and the pipelining parallelism. (3) Generate parallel code from serial code with an emphasis on code depending on machine or program parameters (e.g. cache size, number of processors, number of threads per thread block). (4) Optimize code depending on parameters that are unknown at compile-time