94 research outputs found

    A Study of Energy and Locality Effects using Space-filling Curves

    Full text link
    The cost of energy is becoming an increasingly important driver for the operating cost of HPC systems, adding yet another facet to the challenge of producing efficient code. In this paper, we investigate the energy implications of trading computation for locality using Hilbert and Morton space-filling curves with dense matrix-matrix multiplication. The advantage of these curves is that they exhibit an inherent tiling effect without requiring specific architecture tuning. By accessing the matrices in the order determined by the space-filling curves, we can trade computation for locality. The index computation overhead of the Morton curve is found to be balanced against its locality and energy efficiency, while the overhead of the Hilbert curve outweighs its improvements on our test system.Comment: Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW

    Parallel Multiscale Contact Dynamics for Rigid Non-spherical Bodies

    Get PDF
    The simulation of large numbers of rigid bodies of non-analytical shapes or vastly varying sizes which collide with each other is computationally challenging. The fundamental problem is the identification of all contact points between all particles at every time step. In the Discrete Element Method (DEM), this is particularly difficult for particles of arbitrary geometry that exhibit sharp features (e.g. rock granulates). While most codes avoid non-spherical or non-analytical shapes due to the computational complexity, we introduce an iterative-based contact detection method for triangulated geometries. The new method is an improvement over a naive brute force approach which checks all possible geometric constellations of contact and thus exhibits a lot of execution branching. Our iterative approach has limited branching and high floating point operations per processed byte. It thus is suitable for modern Single Instruction Multiple Data (SIMD) CPU hardware. As only the naive brute force approach is robust and always yields a correct solution, we propose a hybrid solution that combines the best of the two worlds to produce fast and robust contacts. In terms of the DEM workflow, we furthermore propose a multilevel tree-based data structure strategy that holds all particles in the domain on multiple scales in grids. Grids reduce the total computational complexity of the simulation. The data structure is combined with the DEM phases to form a single touch tree-based traversal that identifies both contact points between particle pairs and introduces concurrency to the system during particle comparisons in one multiscale grid sweep. Finally, a reluctant adaptivity variant is introduced which enables us to realise an improved time stepping scheme with larger time steps than standard adaptivity while we still minimise the grid administration overhead. Four different parallelisation strategies that exploit multicore architectures are discussed for the triad of methodological ingredients. Each parallelisation scheme exhibits unique behaviour depending on the grid and particle geometry at hand. The fusion of them into a task-based parallelisation workflow yields promising speedups. Our work shows that new computer architecture can push the boundary of DEM computability but this is only possible if the right data structures and algorithms are chosen

    A framework for efficient execution of matrix computations

    Get PDF
    Matrix computations lie at the heart of most scientific computational tasks. The solution of linear systems of equations is a very frequent operation in many fields in science, engineering, surveying, physics and others. Other matrix operations occur frequently in many other fields such as pattern recognition and classification, or multimedia applications. Therefore, it is important to perform matrix operations efficiently. The work in this thesis focuses on the efficient execution on commodity processors of matrix operations which arise frequently in different fields.We study some important operations which appear in the solution of real world problems: some sparse and dense linear algebra codes and a classification algorithm. In particular, we focus our attention on the efficient execution of the following operations: sparse Cholesky factorization; dense matrix multiplication; dense Cholesky factorization; and Nearest Neighbor Classification.A lot of research has been conducted on the efficient parallelization of numerical algorithms. However, the efficiency of a parallel algorithm depends ultimately on the performance obtained from the computations performed on each node. The work presented in this thesis focuses on the sequential execution on a single processor.There exists a number of data structures for sparse computations which can be used in order to avoid the storage of and computation on zero elements. We work with a hierarchical data structure known as hypermatrix. A matrix is subdivided recursively an arbitrary number of times. Several pointer matrices are used to store the location ofsubmatrices at each level. The last level consists of data submatrices which are dealt with as dense submatrices. When the block size of this dense submatrices is small, the number of zeros can be greatly reduced. However, the performance obtained from BLAS3 routines drops heavily. Consequently, there is a trade-off in the size of data submatrices used for a sparse Cholesky factorization with the hypermatrix scheme. Our goal is that of reducing the overhead introduced by the unnecessary operation on zeros when a hypermatrix data structure is used to produce a sparse Cholesky factorization. In this work we study several techniques for reducing such overhead in order to obtain high performance.One of our goals is the creation of codes which work efficiently on different platforms when operating on dense matrices. To obtain high performance, the resources offered by the CPU must be properly utilized. At the same time, the memory hierarchy must be exploited to tolerate increasing memory latencies. To achieve the former, we produce inner kernels which use the CPU very efficiently. To achieve the latter, we investigate nonlinear data layouts. Such data formats can contribute to the effective use of the memory system.The use of highly optimized inner kernels is of paramount importance for obtaining efficient numerical algorithms. Often, such kernels are created by hand. However, we want to create efficient inner kernels for a variety of processors using a general approach and avoiding hand-made codification in assembly language. In this work, we present an alternative way to produce efficient kernels automatically, based on a set of simple codes written in a high level language, which can be parameterized at compilation time. The advantage of our method lies in the ability to generate very efficient inner kernels by means of a good compiler. Working on regular codes for small matrices most of the compilers we used in different platforms were creating very efficient inner kernels for matrix multiplication. Using the resulting kernels we have been able to produce high performance sparse and dense linear algebra codes on a variety of platforms.In this work we also show that techniques used in linear algebra codes can be useful in other fields. We present the work we have done in the optimization of the Nearest Neighbor classification focusing on the speed of the classification process.Tuning several codes for different problems and machines can become a heavy and unbearable task. For this reason we have developed an environment for development and automatic benchmarking of codes which is presented in this thesis.As a practical result of this work, we have been able to create efficient codes for several matrix operations on a variety of platforms. Our codes are highly competitive with other state-of-art codes for some problems

    Effective data parallel computing on multicore processors

    Get PDF
    The rise of chip multiprocessing or the integration of multiple general purpose processing cores on a single chip (multicores), has impacted all computing platforms including high performance, servers, desktops, mobile, and embedded processors. Programmers can no longer expect continued increases in software performance without developing parallel, memory hierarchy friendly software that can effectively exploit the chip level multiprocessing paradigm of multicores. The goal of this dissertation is to demonstrate a design process for data parallel problems that starts with a sequential algorithm and ends with a high performance implementation on a multicore platform. Our design process combines theoretical algorithm analysis with practical optimization techniques. Our target multicores are quad-core processors from Intel and the eight-SPE IBM Cell B.E. Target applications include Matrix Multiplications (MM), Finite Difference Time Domain (FDTD), LU Decomposition (LUD), and Power Flow Solver based on Gauss-Seidel (PFS-GS) algorithms. These applications are popular computation methods in science and engineering problems and are characterized by unit-stride (MM, LUD, and PFS-GS) or 2-point stencil (FDTD) memory access pattern. The main contributions of this dissertation include a cache- and space-efficient algorithm model, integrated data pre-fetching and caching strategies, and in-core optimization techniques. Our multicore efficient implementations of the above described applications outperform nai¨ve parallel implementations by at least 2x and scales well with problem size and with the number of processing cores

    Efficient representations of large radiosity matrices

    Get PDF
    The radiosity equation can be expressed as a linear system, where light interactions between patches of the scene are considered. Its resolution has been one of the main subjects in computer graphics, which has lead to the development of methods focused on different goals. For instance, in inverse lighting problems, it is convenient to solve the radiosity equation thousands of times for static geometries. Also, this calculation needs to consider many (or infinite) light bounces to achieve accurate global illumination results. Several methods have been developed to solve the linear system by finding approximations or other representations of the radiosity matrix, because the full storage of this matrix is memory demanding. Some examples are hierarchical radiosity, progressive refinement approaches, or wavelet radiosity. Even though these methods are memory efficient, they may become slow for many light bounces, due to their iterative nature. Recently, efficient methods have been developed for the direct resolution of the radiosity equation. In this case, the challenge is to reduce the memory requirements of the radiosity matrix, and its inverse. The main objective of this thesis is exploiting the properties of specific problems to reduce the memory requirements of the radiosity problem. Hereby, two types of problems are analyzed. The first problem is to solve radiosity for scenes with a high spatial coherence, such as it happens to some architectural models. The second involves scenes with a high occlusion factor between patches. For the high spatial coherence case, a novel and efficient error-bounded factorization method is presented. It is based on the use of multiple singular value decompositions along with a space filling curve, which allows to exploit spatial coherence. This technique accelerates the factorization of in-core matrices, and allows to work with out-of-core matrices passing only one time over them. In the experimental analysis, the presented method is applied to scenes up to 163K patches. After a precomputation stage, it is used to solve the radiosity equation for fixed geometries and infinite bounces, at interactive times. For the high occlusion problem, city models are used. In this case, the sparsity of the radiosity matrix is exploited. An approach for radiative exchange computation is proposed, where the inverse of the radiosity matrix is approximated. In this calculation, near-zero elements are removed, leading to a highly sparse result. This technique is applied to simulate daylight in urban environments composed by up to 140k patches.La ecuación de radiosidad tiene por objetivo el cálculo de la interacción de la luz con los elementos de la escena. Esta se puede expresar como un sistema lineal, cuya resolución ha derivado en el desarrollo de diversos métodos gráficos para satisfacer propósitos específicos. Por ejemplo, en problemas inversos de iluminación para geometrías estáticas, se debe resolver la ecuación de radiosidad miles de veces. Además, este cálculo debe considerar muchos (infinitos) rebotes de luz, si se quieren obtener resultados precisos de iluminación global. Entre los métodos desarrollados, se destacan aquellos que generan aproximaciones u otras representaciones de la matriz de radiosidad, debido a que su almacenamiento requiere grandes cantidades de memoria. Algunos ejemplos de estas técnicas son la radiosidad jerárquica, el refinamiento progresivo y la radiosidad basada en wavelets. Si bien estos métodos son eficientes en cuanto a memoria, pueden ser lentos cuando se requiere el cálculo de muchos rebotes de luz, debido a su naturaleza iterativa. Recientemente se han desarrollado métodos eficientes para la resolución directa de la ecuación de radiosidad, basados en el pre-cómputo de la inversa de la matriz de radiosidad. En estos casos, el desafío consiste en reducir los requerimientos de memoria y tiempo de ejecución para el cálculo de la matriz y de su inversa. El principal objetivo de la tesis consiste en explotar propiedades específicas de ciertos problemas de iluminación para reducir los requerimientos de memoria de la ecuación de radiosidad. En este contexto, se analizan dos casos diferentes. El primero consiste en hallar la radiosidad para escenas con alta coherencia espacial, tal como ocurre en algunos modelos arquitectónicos. El segundo involucra escenas con un elevado factor de oclusión entre parches. Para el caso de alta coherencia espacial, se presenta un nuevo método de factorización de matrices que es computacionalmente eficiente y que genera aproximaciones cuyo error es configurable. Está basado en el uso de múltiples descomposiciones en valores singulares (SVD) junto a una curva de recubrimiento espacial, lo que permite explotar la coherencia espacial. Esta técnica acelera la factorización de matrices que entran en memoria, y permite trabajar con matrices que no entran en memoria, recorriéndolas una única vez. En el análisis experimental, el método presentado es aplicado a escenas de hasta 163 mil parches. Luego de una etapa de precómputo, se logra resolver la ecuación de radiosidad en tiempos interactivos, para geométricas estáticas e infinitos rebotes. Para el problema de alta oclusión, se utilizan modelos de ciudades. En este caso, se aprovecha la baja densidad de la matriz de radiosidad, y se propone una técnica para el cálculo aproximado de su inversa. En este cálculo, los elementos cercanos a cero son eliminados. La técnica es aplicada a la simulación de la luz natural en ambientes urbanos compuestos por hasta 140 mil parches

    Task-based Runtime Optimizations Towards High Performance Computing Applications

    Get PDF
    The last decades have witnessed a rapid improvement of computational capabilities in high-performance computing (HPC) platforms thanks to hardware technology scaling. HPC architectures benefit from mainstream advances on the hardware with many-core systems, deep hierarchical memory subsystem, non-uniform memory access, and an ever-increasing gap between computational power and memory bandwidth. This has necessitated continuous adaptations across the software stack to maintain high hardware utilization. In this HPC landscape of potentially million-way parallelism, task-based programming models associated with dynamic runtime systems are becoming more popular, which fosters developers’ productivity at extreme scale by abstracting the underlying hardware complexity. In this context, this dissertation highlights how a software bundle powered by a task-based programming model can address the heterogeneous workloads engendered by HPC applications., i.e., data redistribution, geospatial modeling and 3D unstructured mesh deformation here. Data redistribution aims to reshuffle data to optimize some objective for an algorithm, whose objective can be multi-dimensional, such as improving computational load balance or decreasing communication volume or cost, with the ultimate goal of increasing the efficiency and therefore reducing the time-to-solution for the algorithm. Geostatistical modeling, one of the prime motivating applications for exascale computing, is a technique for predicting desired quantities from geographically distributed data, based on statistical models and optimization of parameters. Meshing the deformable contour of moving 3D bodies is an expensive operation that can cause huge computational challenges in fluid-structure interaction (FSI) applications. Therefore, in this dissertation, Redistribute-PaRSEC, ExaGeoStat-PaRSEC and HiCMA-PaRSEC are proposed to efficiently tackle these HPC applications respectively at extreme scale, and they are evaluated on multiple HPC clusters, including AMD-based, Intel-based, Arm-based CPU systems and IBM-based multi-GPU system. This multidisciplinary work emphasizes the need for runtime systems to go beyond their primary responsibility of task scheduling on massively parallel hardware system for servicing the next-generation scientific applications
    corecore