74 research outputs found

    Automatic skeleton-driven performance optimizations for transactional memory

    Get PDF
    The recent shift toward multi -core chips has pushed the burden of extracting performance to the programmer. In fact, programmers now have to be able to uncover more coarse -grain parallelism with every new generation of processors, or the performance of their applications will remain roughly the same or even degrade. Unfortunately, parallel programming is still hard and error prone. This has driven the development of many new parallel programming models that aim to make this process efficient.This thesis first combines the skeleton -based and transactional memory programming models in a new framework, called OpenSkel, in order to improve performance and programmability of parallel applications. This framework provides a single skeleton that allows the implementation of transactional worklist applications. Skeleton or pattern-based programming allows parallel programs to be expressed as specialized instances of generic communication and computation patterns. This leaves the programmer with only the implementation of the particular operations required to solve the problem at hand. Thus, this programming approach simplifies parallel programming by eliminating some of the major challenges of parallel programming, namely thread communication, scheduling and orchestration. However, the application programmer has still to correctly synchronize threads on data races. This commonly requires the use of locks to guarantee atomic access to shared data. In particular, lock programming is vulnerable to deadlocks and also limits coarse grain parallelism by blocking threads that could be potentially executed in parallel.Transactional Memory (TM) thus emerges as an attractive alternative model to simplify parallel programming by removing this burden of handling data races explicitly. This model allows programmers to write parallel code as transactions, which are then guaranteed by the runtime system to execute atomically and in isolation regardless of eventual data races. TM programming thus frees the application from deadlocks and enables the exploitation of coarse grain parallelism when transactions do not conflict very often. Nevertheless, thread management and orchestration are left for the application programmer. Fortunately, this can be naturally handled by a skeleton framework. This fact makes the combination of skeleton -based and transactional programming a natural step to improve programmability since these models complement each other. In fact, this combination releases the application programmer from dealing with thread management and data races, and also inherits the performance improvements of both models. In addition to it, a skeleton framework is also amenable to skeleton - driven iii performance optimizations that exploits the application pattern and system information.This thesis thus also presents a set of pattern- oriented optimizations that are automatically selected and applied in a significant subset of transactional memory applications that shares a common pattern called worklist. These optimizations exploit the knowledge about the worklist pattern and the TM nature of the applications to avoid transaction conflicts, to prefetch data, to reduce contention etc. Using a novel autotuning mechanism, OpenSkel dynamically selects the most suitable set of these patternoriented performance optimizations for each application and adjusts them accordingly. Experimental results on a subset of five applications from the STAMP benchmark suite show that the proposed autotuning mechanism can achieve performance improvements within 2 %, on average, of a static oracle for a 16 -core UMA (Uniform Memory Access) platform and surpasses it by 7% on average for a 32 -core NUMA (Non -Uniform Memory Access) platform.Finally, this thesis also investigates skeleton -driven system- oriented performance optimizations such as thread mapping and memory page allocation. In order to do it, the OpenSkel system and also the autotuning mechanism are extended to accommodate these optimizations. The conducted experimental results on a subset of five applications from the STAMP benchmark show that the OpenSkel framework with the extended autotuning mechanism driving both pattern and system- oriented optimizations can achieve performance improvements of up to 88 %, with an average of 46 %, over a baseline version for a 16 -core UMA platform and up to 162 %, with an average of 91 %, for a 32 -core NUMA platform

    Automatic skeleton-driven performance optimizations for transactional memory

    Get PDF
    The recent shift toward multi-core chips has pushed the burden of extracting performance to the programmer. In fact, programmers now have to be able to uncover more coarse-grain parallelism with every new generation of processors, or the performance of their applications will remain roughly the same or even degrade. Unfortunately, parallel programming is still hard and error prone. This has driven the development of many new parallel programming models that aim to make this process efficient. This thesis first combines the skeleton-based and transactional memory programming models in a new framework, called OpenSkel, in order to improve performance and programmability of parallel applications. This framework provides a single skeleton that allows the implementation of transactional worklist applications. Skeleton or pattern-based programming allows parallel programs to be expressed as specialized instances of generic communication and computation patterns. This leaves the programmer with only the implementation of the particular operations required to solve the problem at hand. Thus, this programming approach simplifies parallel programming by eliminating some of the major challenges of parallel programming, namely thread communication, scheduling and orchestration. However, the application programmer has still to correctly synchronize threads on data races. This commonly requires the use of locks to guarantee atomic access to shared data. In particular, lock programming is vulnerable to deadlocks and also limits coarse grain parallelism by blocking threads that could be potentially executed in parallel. Transactional Memory (TM) thus emerges as an attractive alternative model to simplify parallel programming by removing this burden of handling data races explicitly. This model allows programmers to write parallel code as transactions, which are then guaranteed by the runtime system to execute atomically and in isolation regardless of eventual data races. TM programming thus frees the application from deadlocks and enables the exploitation of coarse grain parallelism when transactions do not conflict very often. Nevertheless, thread management and orchestration are left for the application programmer. Fortunately, this can be naturally handled by a skeleton framework. This fact makes the combination of skeleton-based and transactional programming a natural step to improve programmability since these models complement each other. In fact, this combination releases the application programmer from dealing with thread management and data races, and also inherits the performance improvements of both models. In addition to it, a skeleton framework is also amenable to skeleton-driven performance optimizations that exploits the application pattern and system information. This thesis thus also presents a set of pattern-oriented optimizations that are automatically selected and applied in a significant subset of transactional memory applications that shares a common pattern called worklist. These optimizations exploit the knowledge about the worklist pattern and the TM nature of the applications to avoid transaction conflicts, to prefetch data, to reduce contention etc. Using a novel autotuning mechanism, OpenSkel dynamically selects the most suitable set of these pattern-oriented performance optimizations for each application and adjusts them accordingly. Experimental results on a subset of five applications from the STAMP benchmark suite show that the proposed autotuning mechanism can achieve performance improvements within 2%, on average, of a static oracle for a 16-core UMA (Uniform Memory Access) platform and surpasses it by 7% on average for a 32-core NUMA (Non-Uniform Memory Access) platform. Finally, this thesis also investigates skeleton-driven system-oriented performance optimizations such as thread mapping and memory page allocation. In order to do it, the OpenSkel system and also the autotuning mechanism are extended to accommodate these optimizations. The conducted experimental results on a subset of five applications from the STAMP benchmark show that the OpenSkel framework with the extended autotuning mechanism driving both pattern and system-oriented optimizations can achieve performance improvements of up to 88%, with an average of 46%, over a baseline version for a 16-core UMA platform and up to 162%, with an average of 91%, for a 32-core NUMA platform

    Polyhedral+Dataflow Graphs

    Get PDF
    This research presents an intermediate compiler representation that is designed for optimization, and emphasizes the temporary storage requirements and execution schedule of a given computation to guide optimization decisions. The representation is expressed as a dataflow graph that describes computational statements and data mappings within the polyhedral compilation model. The targeted applications include both the regular and irregular scientific domains. The intermediate representation can be integrated into existing compiler infrastructures. A specification language implemented as a domain specific language in C++ describes the graph components and the transformations that can be applied. The visual representation allows users to reason about optimizations. Graph variants can be translated into source code or other representation. The language, intermediate representation, and associated transformations have been applied to improve the performance of differential equation solvers, or sparse matrix operations, tensor decomposition, and structured multigrid methods

    Autotuning for Automatic Parallelization on Heterogeneous Systems

    Get PDF

    Aceleración de algoritmos de procesamiento de imágenes para el análisis de partículas individuales con microscopia electrónica

    Full text link
    Tesis Doctoral inédita cotutelada por la Masaryk University (República Checa) y la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Ingeniería Informática. Fecha de Lectura: 24-10-2022Cryogenic Electron Microscopy (Cryo-EM) is a vital field in current structural biology. Unlike X-ray crystallography and Nuclear Magnetic Resonance, it can be used to analyze membrane proteins and other samples with overlapping spectral peaks. However, one of the significant limitations of Cryo-EM is the computational complexity. Modern electron microscopes can produce terabytes of data per single session, from which hundreds of thousands of particles must be extracted and processed to obtain a near-atomic resolution of the original sample. Many existing software solutions use high-Performance Computing (HPC) techniques to bring these computations to the realm of practical usability. The common approach to acceleration is parallelization of the processing, but in praxis, we face many complications, such as problem decomposition, data distribution, load scheduling, balancing, and synchronization. Utilization of various accelerators further complicates the situation, as heterogeneous hardware brings additional caveats, for example, limited portability, under-utilization due to synchronization, and sub-optimal code performance due to missing specialization. This dissertation, structured as a compendium of articles, aims to improve the algorithms used in Cryo-EM, esp. the SPA (Single Particle Analysis). We focus on the single-node performance optimizations, using the techniques either available or developed in the HPC field, such as heterogeneous computing or autotuning, which potentially needs the formulation of novel algorithms. The secondary goal of the dissertation is to identify the limitations of state-of-the-art HPC techniques. Since the Cryo-EM pipeline consists of multiple distinct steps targetting different types of data, there is no single bottleneck to be solved. As such, the presented articles show a holistic approach to performance optimization. First, we give details on the GPU acceleration of the specific programs. The achieved speedup is due to the higher performance of the GPU, adjustments of the original algorithm to it, and application of the novel algorithms. More specifically, we provide implementation details of programs for movie alignment, 2D classification, and 3D reconstruction that have been sped up by order of magnitude compared to their original multi-CPU implementation or sufficiently the be used on-the-fly. In addition to these three programs, multiple other programs from an actively used, open-source software package XMIPP have been accelerated and improved. Second, we discuss our contribution to HPC in the form of autotuning. Autotuning is the ability of software to adapt to a changing environment, i.e., input or executing hardware. Towards that goal, we present cuFFTAdvisor, a tool that proposes and, through autotuning, finds the best configuration of the cuFFT library for given constraints of input size and plan settings. We also introduce a benchmark set of ten autotunable kernels for important computational problems implemented in OpenCL or CUDA, together with the introduction of complex dynamic autotuning to the KTT tool. Third, we propose an image processing framework Umpalumpa, which combines a task-based runtime system, data-centric architecture, and dynamic autotuning. The proposed framework allows for writing complex workflows which automatically use available HW resources and adjust to different HW and data but at the same time are easy to maintainThe project that gave rise to these results received the support of a fellowship from the “la Caixa” Foundation (ID 100010434). The fellowship code is LCF/BQ/DI18/11660021. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 71367

    Empirical Installation of Linear Algebra Shared-Memory Subroutines for Auto-Tuning

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/s10766-013-0249-6The introduction of auto-tuning techniques in linear algebra shared-memory routines is analyzed. Information obtained in the installation of the routines is used at running time to take some decisions to reduce the total execution time. The study is carried out with routines at different levels (matrix multiplication, LU and Cholesky factorizations and linear systems symmetric or general routines) and with calls to routines in the LAPACK and PLASMA libraries with multithread implementations. Medium NUMA and large cc-NUMA systems are used in the experiments. This variety of routines, libraries and systems allows us to obtain general conclusions about the methodology to use for linear algebra shared-memory routines auto-tuning. Satisfactory execution times are obtained with the proposed methodology.Partially supported by Fundacion Seneca, Consejeria de Educacion de la Region de Murcia, 08763/PI/08, PROMETEO/2009/013 from Generalitat Valenciana, the Spanish Ministry of Education and Science through TIN2012-38341-C04-03, and the High-Performance Computing Network on Parallel Heterogeneus Architectures (CAPAP-H). The authors gratefully acknowledge the computer resources and assistance provided by the Supercomputing Centre of the Scientific Park Foundation of Murcia and by the Centre de Supercomputacio de Catalunya.Cámara, J.; Cuenca, J.; Giménez, D.; García, LP.; Vidal Maciá, AM. (2014). Empirical Installation of Linear Algebra Shared-Memory Subroutines for Auto-Tuning. International Journal of Parallel Programming. 42(3):408-434. https://doi.org/10.1007/s10766-013-0249-6S408434423Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Ltaief, H., Luszczek, P., Tomov, S.: Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects. J. Phys. Conf. Ser. 180(1), 1–5 (2009)Alberti, P., Alonso, P., Vidal, A.M., Cuenca, J., Giménez, D.: Designing polylibraries to speed up linear algebra computations. Int. J. High Perform. Comput. Netw. 1/2/3(1), 75–84 (2004)Anderson, E., Bai, Z., Bischof, C., Demmel, J., Dongarra, J.J., Du Croz, J., Grenbaum, A., Hammarling, S., McKenney, A., Ostrouchov, S., D. Sorensen, S.: LAPACK User’s Guide. Society for Industrial and Applied Mathematics, Philadelphia (1995)Bernabé, G., Cuenca, J., Giménez, D.: Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs. In: ICCS (2013)Buttari, A., Langou, J., Kurzak, J., Dongarra, J.: A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput. 35(1), 38–53 (2009)Cámara, J., Cuenca, J., Giménez, D., Vidal. A.M.: Empirical autotuning of two-level parallel linear algebra routines on large cc-NUMA systems. In: ISPA (2012)Caron, E., Desprez, F., Suter, F.: Parallel extension of a dynamic performance forecasting tool. Scalable Comput. Pract. Exp. 6(1), 57–69 (2005)Chen, Z., Dongarra, J., Luszczek, P., Roche, K.: Self adapting software for numerical linear algebra and LAPACK for clusters. Parallel Comput. 29, 1723–1743 (2003)Cuenca, J., Giménez, D., González, J.: Achitecture of an automatic tuned linear algebra library. Parallel Comput. 30(2), 187–220 (2004)Cuenca, J., García, L.P., Giménez, D.: Improving linear algebra computation on NUMA platforms through auto-tuned nested parallelism. In: Proceedings of the 2012 EUROMICRO Conference on Parallel, Distributed and Network Processing (2012)Frigo, M.: FFTW: An adaptive software architecture for the FFT. In: Proceedings of the ICASSP Conference, vol. 3, p. 1381 (1998)Golub, G., Van Loan, C.F.: Matrix Computations, 3rd edn. The John Hopkins University Press, Baltimore (1996)Im, E.-J., Yelick, K., Vuduc, R.: Sparsity: optimization framework for sparse matrix kernels. Int. J. High Perform. Comput. Appl. (IJHPCA) 18(1), 135–158 (2004)Intel MKL web page.: http://software.intel.com/en-us/intel-mkl/Jerez, S., Montávez, J.-P., Giménez, D.: Optimizing the execution of a parallel meteorology simulation code. In: Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium. IEEE (2009)Katagiri, T., Kise, K., Honda, H., Yuba, T.: Fiber: a generalized framework for auto-tuning software. Springer LNCS 2858, 146–159 (2003)Katagiri, T., Kise, K., Honda, H., Yuba, T.: ABCLib-DRSSED: a parallel eigensolver with an auto-tuning facility. Parallel Comput. 32(3), 231–250 (2006)Kurzak, J., Tomov, S., Dongarra, J.: Autotuning gemm kernels for the FERMI GPU. IEEE Trans. Parallel Distrib. Syst. 23(11), 2045–2057 (2012)Lastovetsky, A.L., Reddy, R., Higgins, R.: Building the functional performance model of a processor. In: SAC, pp. 746–753 (2006)Li, J., Skjellum, A., Falgout, R.D.: A poly-algorithm for parallel dense matrix multiplication on two-dimensional process grid topologies. Concurrency Pract. Exp. 9(5), 345–389 (1997)Naono, K., Teranishi, K., Cavazos, J., Suda, R., (eds.): Software Automatic Tuning. From Concepts to State-of-the-Art Results. Springer, Berlin (2010)Nath, R., Tomov, S., Dongarra, J.: An improved MAGMA gemm for FERMI graphics processing units. IJHPCA 24(4), 511–515 (2010)Petitet, A., Blackford, L.S., Dongarra, J., Ellis, B., Fagg, G.E., Roche, K., Vadhiyar, S.S.: Numerical libraries and the grid. IJHPCA 15(4), 359–374 (2001)PLASMA.: http://icl.cs.utk.edu/plasma/Püschel, M., Moura, J.M.F., Singer, B., Xiong, J., Johnson, J.R., Padua, D.A., Veloso, M.M., Johnson, R.W.: Spiral: a generator for platform-adapted libraries of signal processing algorithms. IJHPCA 18(1), 21–45 (2004)Seshagiri, L., Wu, M.-S., Sosonkina, M., Zhang, Z., Gordon, M.S., Schmidt, M.W.: Enhancing adaptive middleware for quantum chemistry applications with a database framework. In: IPDPS Workshops, pp. 1–8 (2010)Tanaka, T., Katagiri, T., Yuba, T.: d-Spline based incremental parameter estimation in automatic performance tuning. In: PARA, pp. 986–995 (2006)Vuduc, R., Demmel, J., Bilmes, J.: Statistical models for automatic performance tuning. In: International Conference on Computational Science (1), pp. 117–126 (2001)Whaley, R.C., Petitet, A., Dongarra, J.: Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27(1–2), 3–35 (2001

    Learning Intermediate Representations using Graph Neural Networks for NUMA and Prefetchers Optimization

    Get PDF
    International audienceThere is a large space of NUMA and hardware prefetcher configurations that can significantly impact the performance of an application. Previous studies have demonstrated how a model can automatically select configurations based on the dynamic properties of the code to achieve speedups. This paper demonstrates how the static Intermediate Representation (IR) of the code can guide NUMA/prefetcher optimizations without the prohibitive cost of performance profiling. We propose a method to create a comprehensive dataset that includes a diverse set of intermediate representations along with optimum configurations. We then apply a graph neural network model in order to validate this dataset. We show that our static intermediate representation based model achieves 80% of the performance gains provided by expensive dynamic performance profiling based strategies. We further develop a hybrid model that uses both static and dynamic information. Our hybrid model achieves the same gains as the dynamic models but at a reduced cost by only profiling 30% of the programs

    Devito: Towards a generic Finite Difference DSL using Symbolic Python

    Full text link
    Domain specific languages (DSL) have been used in a variety of fields to express complex scientific problems in a concise manner and provide automated performance optimization for a range of computational architectures. As such DSLs provide a powerful mechanism to speed up scientific Python computation that goes beyond traditional vectorization and pre-compilation approaches, while allowing domain scientists to build applications within the comforts of the Python software ecosystem. In this paper we present Devito, a new finite difference DSL that provides optimized stencil computation from high-level problem specifications based on symbolic Python expressions. We demonstrate Devito's symbolic API and performance advantages over traditional Python acceleration methods before highlighting its use in the scientific context of seismic inversion problems.Comment: pyHPC 2016 conference submissio

    Programmer-transparent efficient parallelism with skeletons

    Get PDF
    Parallel and heterogeneous systems are ubiquitous. Unfortunately, both require significant complexity at the software level to the detriment of programmer productivity. To produce correct and efficient code programmers not only have to manage synchronisation and communication but also be aware of low-level hardware details. It is foresee able that the problem is becoming worse because systems are increasingly parallel and heterogeneous. Building on earlier work, this thesis further investigates the contribution which algorithmic skeletons can make towards solving this problem. Skeletons are high-level abstractions for typical parallel computations. They hide low-level hardware details from programmers and, in addition, encode information about the computations that they implement, which runtime systems and library developers can use for automatic optimisations. We present two novel case studies in this respect. First, we provide scheduling flexibility on heterogeneous CPU + GPU systems in a programmer transparent way similar to the freedom OS schedulers have on CPUs. Thanks to the high-level nature of skeletons we automatically switch between CPU and GPU implementations of kernels and use semantic information encoded in skeletons to find execution time points at which switches can occur. In more detail, kernel iteration spaces are processed in slices and migration is considered on a slice-by-slice basis. We show that slice sizes choices that introduce negligible overheads can be learned by predictive models. We show that in a simple deployment scenario mid-kernel migration achieves speedups of up to 1.30x and 1.08x on average. Our mechanism introduces negligible overheads of 2.34% if a kernel does not actually migrate. Second, we propose skeletons to simplify the programming of parallel hard real-time systems. We combine information encoded in task farms with real-time systems user code analysis to automatically choose thread counts and an optimisation parameter related to farm internal communication. Both parameters are chosen so that real-time deadlines are met with minimum resource usage. We show that our approach achieves 1.22x speedup over unoptimised code, selects the best parameter settings in 83% of cases, and never chooses parameters that cause deadline misses
    corecore