27 research outputs found

    High-level compiler analysis for OpenMP

    Get PDF
    Nowadays, applications from dissimilar domains, such as high-performance computing and high-integrity systems, require levels of performance that can only be achieved by means of sophisticated heterogeneous architectures. However, the complex nature of such architectures hinders the production of efficient code at acceptable levels of time and cost. Moreover, the need for exploiting parallelism adds complications of its own (e.g., deadlocks, race conditions,...). In this context, compiler analysis is fundamental for optimizing parallel programs. There is however a trade-off between complexity and profit: low complexity analyses (e.g., reaching definitions) provide information that may be insufficient for many relevant transformations, and complex analyses based on mathematical representations (e.g., polyhedral model) give accurate results at a high computational cost. A range of parallel programming models providing different levels of programmability, performance and portability enable the exploitation of current architectures. However, OpenMP has proved many advantages over its competitors: 1) it delivers levels of performance comparable to highly tunable models such as CUDA and MPI, and better robustness than low level libraries such as Pthreads; 2) the extensions included in the latest specification meet the characteristics of current heterogeneous architectures (i.e., the coupling of a host processor to one or more accelerators, and the capability of expressing fine-grained, both structured and unstructured, and highly-dynamic task parallelism); 3) OpenMP is widely implemented by several chip (e.g., Kalray MPPA, Intel) and compiler (e.g., GNU, Intel) vendors; and 4) although currently the model lacks resiliency and reliability mechanisms, many works, including this thesis, pursue their introduction in the specification. This thesis addresses the study of compiler analysis techniques for OpenMP with two main purposes: 1) enhance the programmability and reliability of OpenMP, and 2) prove OpenMP as a suitable model to exploit parallelism in safety-critical domains. Particularly, the thesis focuses on the tasking model because it offers the flexibility to tackle the parallelization of algorithms with load imbalance, recursiveness and uncountable loop based kernels. Additionally, current works have proved the time-predictability of this model, shortening the distance towards its introduction in safety-critical domains. To enable the analysis of applications using the OpenMP tasking model, the first contribution of this thesis is the extension of a set of classic compiler techniques with support for OpenMP. As a basis for including reliability mechanisms, the second contribution consists of the development of a series of algorithms to statically detect situations involving OpenMP tasks, which may lead to a loss of performance, non-deterministic results or run-time failures. A well-known problem of parallel processing related to compilers is the static scheduling of a program represented by a directed graph. Although the literature is extensive in static scheduling techniques, the work related to the generation of the task graph at compile-time is very scant. Compilers are limited by the knowledge they can extract, which depends on the application and the programming model. The third contribution of this thesis is the generation of a predicated task dependency graph for OpenMP that can be interpreted by the runtime in such a way that the cost of solving dependences is reduced to the minimum. With the previous contributions as a basis for determining the functional safety of OpenMP, the final contribution of this thesis is the adaptation of OpenMP to the safety-critical domain considering two directions: 1) indicating how OpenMP can be safely used in such a domain, and 2) integrating OpenMP into Ada, a language widely used in the safety-critical domain.Actualment, aplicacions de dominis diversos com la computaci贸 d'altes prestacions i els sistemes d'alta integritat, requereixen nivells de rendiment assolibles nom茅s mitjan莽ant arquitectures heterog猫nies sofisticades. No obstant, la natura complexa d'aquestes dificulta la producci贸 de codi eficient en un temps i cost acceptables. A m茅s, la necessitat d鈥檈xplotar paral路lelisme introdueix complicacions en s铆 mateixa (p. ex. bloqueig mutu, condicions de carrera,...). En aquest context, l'an脿lisi de compiladors 茅s fonamental per optimitzar programes paral路lels. Existeix per貌 un equilibri entre complexitat i beneficis: la informaci贸 obtinguda amb an脿lisis simples (p. ex. definicions abastables) pot ser insuficient per moltes transformacions rellevants, i an脿lisis complexos basats en models matem脿tics (p. ex. model poli猫dric) faciliten resultats acurats a un alt cost computacional. Existeixen molts models de programaci贸 paral路lela que proporcionen diferents nivells de programabilitat, rendiment i portabilitat per l'explotaci贸 de les arquitectures actuals. En aquest marc, OpenMP ha demostrat molts avantatges respecte dels seus competidors: 1) el seu nivell de rendiment 茅s comparable a models molt ajustables com CUDA i MPI, i proporciona m茅s robustesa que llibreries de baix nivell com Pthreads; 2) les extensions que inclou la darrera especificaci贸 satisfan les caracter铆stiques de les actuals arquitectures heterog猫nies (茅s a dir, l鈥檃coblament d鈥檜n processador principal i un o m茅s acceleradors, i la capacitat d'expressar paral路lelisme de tasques de gra fi, ja sigui estructurat o sense estructura; 3) OpenMP 茅s 脿mpliament implementat per venedors de xips (p. ex. Kalray MPPA, Intel) i compiladors (p. ex. GNU, Intel); i 4) tot i que el model actual manca de mecanismes de resili猫ncia i fiabilitat, molts treballs, incloent aquesta tesi, busquen la seva introducci贸 a l'especificaci贸. Aquesta tesi adre莽a l'estudi de t猫cniques d鈥檃n脿lisi de compiladors amb dos objectius: 1) millorar la programabilitat i la fiabilitat de OpenMP, i 2) provar que OpenMP 茅s un model adequat per explotar paral路lelisme en sistemes cr铆tics. En particular, la tesi es centra en el model de tasques per qu猫 aquest ofereix la flexibilitat per abordar aplicacions amb problemes de balanceig de c脿rrega, recursivitat i bucles incomptables. A m茅s, treballs recents han provat la predictibilitat en q眉esti贸 de temps del model, escur莽ant la dist脿ncia cap a la seva introducci贸 en sistemes cr铆tics. Per a poder analitzar aplicacions que utilitzen el model de tasques d鈥橭penMP, la primera contribuci贸 d鈥檃questa tesi consisteix en l鈥檈xtensi贸 d'un conjunt de t猫cniques cl脿ssiques de compilaci贸 per suportar OpenMP. Com a base per incloure mecanismes de fiabilitat, la segona contribuci贸 consisteix en el desenvolupament duna s猫rie d'algorismes per detectar de forma est脿tica situacions que involucren tasques d鈥橭penMP, i que poden conduir a una p猫rdua de rendiment, resultats no deterministes, o fallades en temps d鈥檈xecuci贸. Un problema ben conegut del processament paral路lel relacionat amb els compiladors 茅s la planificaci贸 est脿tica d鈥檜n programa representat mitjan莽ant un graf dirigit. Tot i que la literatura sobre planificaci贸 est脿tica 茅s extensa, aquella relacionada amb la generaci贸 del graf en temps de compilaci贸 茅s molt escassa. Els compiladors estan limitats pel coneixement que poden extreure, que dep猫n de l鈥檃plicaci贸 i del model de programaci贸. La tercera contribuci贸 de la tesi 茅s la generaci贸 d鈥檜n graf de depend猫ncies enriquit que pot ser interpretat pel sistema en temps d鈥檈xecuci贸 de manera que el cost de resoldre les depend猫ncies sigui m铆nim. Amb les anteriors contribucions com a base per a determinar la seguretat funcional de OpenMP, la darrera contribuci贸 de la tesi consisteix en adaptar OpenMP a sistemes cr铆tics, explorant dues direccions: 1) indicar com OpenMP es pot utilitzar de forma segura en un domini com, i 2) integrar OpenMP en Ada, un llenguatge molt utilitzat en el domini de seguretat.Postprint (published version

    Programming Abstractions for Data Locality

    Get PDF
    The goal of the workshop and this report is to identify common themes and standardize concepts for locality-preserving abstractions for exascale programming models. Current software tools are built on the premise that computing is the most expensive component, we are rapidly moving to an era that computing is cheap and massively parallel while data movement dominates energy and performance costs. In order to respond to exascale systems (the next generation of high performance computing systems), the scientific computing community needs to refactor their applications to align with the emerging data-centric paradigm. Our applications must be evolved to express information about data locality. Unfortunately current programming environments offer few ways to do so. They ignore the incurred cost of communication and simply rely on the hardware cache coherency to virtualize data movement. With the increasing importance of task-level parallelism on future systems, task models have to support constructs that express data locality and affinity. At the system level, communication libraries implicitly assume all the processing elements are equidistant to each other. In order to take advantage of emerging technologies, application developers need a set of programming abstractions to describe data locality for the new computing ecosystem. The new programming paradigm should be more data centric and allow to describe how to decompose and how to layout data in the memory.Fortunately, there are many emerging concepts such as constructs for tiling, data layout, array views, task and thread affinity, and topology aware communication libraries for managing data locality. There is an opportunity to identify commonalities in strategy to enable us to combine the best of these concepts to develop a comprehensive approach to expressing and managing data locality on exascale programming systems. These programming model abstractions can expose crucial information about data locality to the compiler and runtime system to enable performance-portable code. The research question is to identify the right level of abstraction, which includes techniques that range from template libraries all the way to completely new languages to achieve this goal

    Evaluation of the parallel computational capabilities of embedded platforms for critical systems

    Get PDF
    Modern critical systems need higher performance which cannot be delivered by the simple architectures used so far. Latest embedded architectures feature multi-cores and GPUs, which can be used to satisfy this need. In this thesis we parallelise relevant applications from multiple critical domains represented in the GPU4S benchmark suite, and perform a comparison of the parallel capabilities of candidate platforms for use in critical systems. In particular, we port the open source GPU4S Bench benchmarking suite in the OpenMP programming model, and we benchmark the candidate embedded heterogeneous multi-core platforms of the H2020 UP2DATE project, NVIDIA TX2, NVIDIA Xavier and Xilinx Zynq Ultrascale+, in order to drive the selection of the research platform which will be used in the next phases of the project. Our result indicate that in terms of CPU and GPU performance, the NVIDIA Xavier is the highest performing platform

    Directive-based Approach to Heterogeneous Computing

    Get PDF
    El mundo de la computaci贸n de altas prestaciones est谩 sufriendo grandes cambios que incrementan notablemente su complejidad. La incapacidad de los sistemas monoprocesador o incluso multiprocesador de mantener el incremento de la potencia de c贸mputo para suplir las necesidades de la comunidad cient铆fica ha forzado la irrupci贸n de arquitecturas hardware masivamente paralelas y de unidades espec铆ficas para realizar operaciones concretas. Un buen ejemplo de este tipo de dispositivos son las GPU (Unidades de procesamiento gr谩fico). Estos dispositivos, tradicionalmente dedicados a la programaci贸n gr谩fica, se han convertido recientemente en una plataforma ideal para implementar c贸mputos masivamente paralelos. La combinaci贸n de GPUs para realizar tareas intensivas en c贸mputo con multi-procesadores para llevar tareas menos intensas pero con l贸gica de control m谩s compleja, se ha convertido en los 煤ltimos a帽os en una de las plataformas m谩s comunes para la realizaci贸n de c谩lculos cient铆ficos a bajo coste, dado que la potencia desplegada en muchos casos puede alcanzar la de cl煤sters de peque帽o o mediano tama帽o, con un coste inicial y de mantenimiento notablemente inferior. La incorporaci贸n de GPUs en cl煤sters ha permitido tambi茅n aumentar la capacidad de 茅stos. Sin embargo, la complejidad de la programaci贸n de GPUs, y su integraci贸n con c贸digos existentes, dificultan enormemente la introducci贸n de estas tecnolog铆as entre usuarios menos expertos. En esta t茅sis exploramos la utilizaci贸n de modelos de programaci贸n basados en directivas para este tipo de entornos, multi-core, many-core, GPUs y cl煤sters, donde el usuario medio ve disminuida notablemente su productividad debido a la dificultad de programaci贸n en estos entornos. Para explorar la mejor forma de aplicar directivas en estos entornos, hemos desarrollado un conjunto de herramientas software altamente flexibles (un compilador y un runtime), que permiten explorar diversas t茅cnicas con relativamente poco esfuerzo. La irrupci贸n del est谩ndar de programaci贸n de directivas de OpenACC nos permiti贸 demostrar la capacidad de estas herramientas, realizando una implementaci贸n experimental del est谩ndar (accULL) en muy poco tiempo y con un rendimiento nada desde帽able. Los resultados computacionales aportados nos permiten demostrar: (a) La disminuci贸n en el esfuerzo de programaci贸n que permiten las aproximaciones basadas en directivas, (b) La capacidad y flexibilidad de las herramientas dise帽adas durante esta t茅sis para explorar estas aproximaciones y finalmente (c) El potencial de desarrollo futuro de accULL como herramienta experimental en OpenACC en base al rendimiento obtenido actualmente frente al rendimiento de otras aproximaciones comerciales

    Scalable Applications on Heterogeneous System Architectures: A Systematic Performance Analysis Framework

    Get PDF
    The efficient parallel execution of scientific applications is a key challenge in high-performance computing (HPC). With growing parallelism and heterogeneity of compute resources as well as increasingly complex software, performance analysis has become an indispensable tool in the development and optimization of parallel programs. This thesis presents a framework for systematic performance analysis of scalable, heterogeneous applications. Based on event traces, it automatically detects the critical path and inefficiencies that result in waiting or idle time, e.g. due to load imbalances between parallel execution streams. As a prerequisite for the analysis of heterogeneous programs, this thesis specifies inefficiency patterns for computation offloading. Furthermore, an essential contribution was made to the development of tool interfaces for OpenACC and OpenMP, which enable a portable data acquisition and a subsequent analysis for programs with offload directives. At present, these interfaces are already part of the latest OpenACC and OpenMP API specification. The aforementioned work, existing preliminary work, and established analysis methods are combined into a generic analysis process, which can be applied across programming models. Based on the detection of wait or idle states, which can propagate over several levels of parallelism, the analysis identifies wasted computing resources and their root cause as well as the critical-path share for each program region. Thus, it determines the influence of program regions on the load balancing between execution streams and the program runtime. The analysis results include a summary of the detected inefficiency patterns and a program trace, enhanced with information about wait states, their cause, and the critical path. In addition, a ranking, based on the amount of waiting time a program region caused on the critical path, highlights program regions that are relevant for program optimization. The scalability of the proposed performance analysis and its implementation is demonstrated using High-Performance Linpack (HPL), while the analysis results are validated with synthetic programs. A scientific application that uses MPI, OpenMP, and CUDA simultaneously is investigated in order to show the applicability of the analysis

    Enhancing productivity and performance portability of opencl applications on heterogeneous systems using runtime optimizations

    Get PDF
    Initially driven by a strong need for increased computational performance in science and engineering, heterogeneous systems have become ubiquitous and they are getting increasingly complex. The single processor era has been replaced with multi-core processors, which have quickly been surrounded by satellite devices aiming to increase the throughput of the entire system. These auxiliary devices, such as Graphics Processing Units, Field Programmable Gate Arrays or other specialized processors have very different architectures. This puts an enormous strain on programming models and software developers to take full advantage of the computing power at hand. Because of this diversity and the unachievable flexibility and portability necessary to optimize for each target individually, heterogeneous systems remain typically vastly under-utilized. In this thesis, we explore two distinct ways to tackle this problem. Providing automated, non intrusive methods in the form of compiler tools and implementing efficient abstractions to automatically tune parameters for a restricted domain are two complementary approaches investigated to better utilize compute resources in heterogeneous systems. First, we explore a fully automated compiler based approach, where a runtime system analyzes the computation flow of an OpenCL application and optimizes it across multiple compute kernels. This method can be deployed on any existing application transparently and replaces significant software engineering effort spent to tune application for a particular system. We show that this technique achieves speedups of up to 3x over unoptimized code and an average of 1.4x over manually optimized code for highly dynamic applications. Second, a library based approach is designed to provide a high level abstraction for complex problems in a specific domain, stencil computation. Using domain specific techniques, the underlying framework optimizes the code aggressively. We show that even in a restricted domain, automatic tuning mechanisms and robust architectural abstraction are necessary to improve performance. Using the abstraction layer, we demonstrate strong scaling of various applications to multiple GPUs with a speedup of up to 1.9x on two GPUs and 3.6x on four

    X10 for high-performance scientific computing

    No full text
    High performance computing is a key technology that enables large-scale physical simulation in modern science. While great advances have been made in methods and algorithms for scientific computing, the most commonly used programming models encourage a fragmented view of computation that maps poorly to the underlying computer architecture. Scientific applications typically manifest physical locality, which means that interactions between entities or events that are nearby in space or time are stronger than more distant interactions. Linear-scaling methods exploit physical locality by approximating distant interactions, to reduce computational complexity so that cost is proportional to system size. In these methods, the computation required for each portion of the system is different depending on that portion鈥檚 contribution to the overall result. To support productive development, application programmers need programming models that cleanly map aspects of the physical system being simulated to the underlying computer architecture while also supporting the irregular workloads that arise from the fragmentation of a physical system. X10 is a new programming language for high-performance computing that uses the asynchronous partitioned global address space (APGAS) model, which combines explicit representation of locality with asynchronous task parallelism. This thesis argues that the X10 language is well suited to expressing the algorithmic properties of locality and irregular parallelism that are common to many methods for physical simulation. The work reported in this thesis was part of a co-design effort involving researchers at IBM and ANU in which two significant computational chemistry codes were developed in X10, with an aim to improve the expressiveness and performance of the language. The first is a Hartree鈥揊ock electronic structure code, implemented using the novel Resolution of the Coulomb Operator approach. The second evaluates electrostatic interactions between point charges, using either the smooth particle mesh Ewald method or the fast multipole method, with the latter used to simulate ion interactions in a Fourier Transform Ion Cyclotron Resonance mass spectrometer. We compare the performance of both X10 applications to state-of-the-art software packages written in other languages. This thesis presents improvements to the X10 language and runtime libraries for managing and visualizing the data locality of parallel tasks, communication using active messages, and efficient implementation of distributed arrays. We evaluate these improvements in the context of computational chemistry application examples. This work demonstrates that X10 can achieve performance comparable to established programming languages when running on a single core. More importantly, X10 programs can achieve high parallel efficiency on a multithreaded architecture, given a divide-and-conquer pattern parallel tasks and appropriate use of worker-local data. For distributed memory architectures, X10 supports the use of active messages to construct local, asynchronous communication patterns which outperform global, synchronous patterns. Although point-to-point active messages may be implemented efficiently, productive application development also requires collective communications; more work is required to integrate both forms of communication in the X10 language. The exploitation of locality is the key insight in both linear-scaling methods and the APGAS programming model; their combination represents an attractive opportunity for future co-design efforts
    corecore