32 research outputs found

    Automated cache optimisations of stencil computations for partial differential equations

    Get PDF
    This thesis focuses on numerical methods that solve partial differential equations. Our focal point is the finite difference method, which solves partial differential equations by approximating derivatives with explicit finite differences. These partial differential equation solvers consist of stencil computations on structured grids. Stencils for computing real-world practical applications are patterns often characterised by many memory accesses and non-trivial arithmetic expressions that lead to high computational costs compared to simple stencils used in much prior proof-of-concept work. In addition, the loop nests to express stencils on structured grids may often be complicated. This work is highly motivated by a specific domain of stencil computations where one of the challenges is non-aligned to the structured grid ("off-the-grid") operations. These operations update neighbouring grid points through scatter and gather operations via non-affine memory accesses, such as {A[B[i]]}. In addition to this challenge, these practical stencils often include many computation fields (need to store multiple grid copies), complex data dependencies and imperfect loop nests. In this work, we aim to increase the performance of stencil kernel execution. We study automated cache-memory-dependent optimisations for stencil computations. This work consists of two core parts with their respective contributions.The first part of our work tries to reduce the data movement in stencil computations of practical interest. Data movement is a dominant factor affecting the performance of high-performance computing applications. It has long been a target of optimisations due to its impact on execution time and energy consumption. This thesis tries to relieve this cost by applying temporal blocking optimisations, also known as time-tiling, to stencil computations. Temporal blocking is a well-known technique to enhance data reuse in stencil computations. However, it is rarely used in practical applications but rather in theoretical examples to prove its efficacy. Applying temporal blocking to scientific simulations is more complex. More specifically, in this work, we focus on the application context of seismic and medical imaging. In this area, we often encounter scatter and gather operations due to signal sources and receivers at arbitrary locations in the computational domain. These operations make the application of temporal blocking challenging. We present an approach to overcome this challenge and successfully apply temporal blocking.In the second part of our work, we extend the first part as an automated approach targeting a wide range of simulations modelled with partial differential equations. Since temporal blocking is error-prone, tedious to apply by hand and highly complex to assimilate theoretically and practically, we are motivated to automate its application and automatically generate code that benefits from it. We discuss algorithmic approaches and present a generalised compiler pipeline to automate the application of temporal blocking. These passes are written in the Devito compiler. They are used to accelerate the computation of stencil kernels in areas such as seismic and medical imaging, computational fluid dynamics and machine learning. \href{www.devitoproject.org}{Devito} is a Python package to implement optimised stencil computation (e.g., finite differences, image processing, machine learning) from high-level symbolic problem definitions. Devito builds on \href{www.sympy.org}{SymPy} and employs automated code generation and just-in-time compilation to execute optimised computational kernels on several computer platforms, including CPUs, GPUs, and clusters thereof. We show how we automate temporal blocking code generation without user intervention and often achieve better time-to-solution. We enable domain-specific optimisation through compiler passes and offer temporal blocking gains from a high-level symbolic abstraction. These automated optimisations benefit various computational kernels for solving real-world application problems.Open Acces

    On the co-design of scientific applications and long vector architectures

    Get PDF
    The landscape of High Performance Computing (HPC) system architectures keeps expanding with new technologies and increased complexity. To improve the efficiency of next-generation compute devices, architects are looking for solutions beyond the commodity CPU approach. In 2021, the five most powerful supercomputers in the world use either GP-GPU (General-purpose computing on graphics processing units) accelerators or a customized CPU specially designed to target HPC applications. This trend is only expected to grow in the next years motivated by the compute demands of science and industry. As architectures evolve, the ecosystem of tools and applications must follow. The choices in the number of cores in a socket, the floating point-units per core and the bandwidth through the memory hierarchy among others, have a large impact in the power consumption and compute capabilities of the devices. To balance CPU and accelerators, designers require accurate tools for analyzing and predicting the impact of new architectural features on the performance of complex scientific applications at scale. In such a large design space, capturing and modeling with simulators the complex interactions between the system software and hardware components is a defying challenge. Moreover, applications must be able to exploit those designs with aggressive compute capabilities and memory bandwidth configurations. Algorithms and data structures will need to be redesigned accordingly to expose a high degree of data-level parallelism allowing them to scale in large systems. Therefore, next-generation computing devices will be the result of a co-design effort in hardware and applications supported by advanced simulation tools. In this thesis, we focus our work on the co-design of scientific applications and long vector architectures. We significantly extend a multi-scale simulation toolchain enabling accurate performance and power estimations of large-scale HPC systems. Through simulation, we explore the large design space in current HPC trends over a wide range of applications. We extract speedup and energy consumption figures analyzing the trade-offs and optimal configurations for each of the applications. We describe in detail the optimization process of two challenging applications on real vector accelerators, achieving outstanding operation performance and full memory bandwidth utilization. Overall, we provide evidence-based architectural and programming recommendations that will serve as hardware and software co-design guidelines for the next generation of specialized compute devices.El panorama de las arquitecturas de los sistemas para la Computación de Alto Rendimiento (HPC, de sus siglas en inglés) sigue expandiéndose con nuevas tecnologías y complejidad adicional. Para mejorar la eficiencia de la próxima generación de dispositivos de computación, los arquitectos están buscando soluciones más allá de las CPUs. En 2021, los cinco supercomputadores más potentes del mundo utilizan aceleradores gráficos aplicados a propósito general (GP-GPU, de sus siglas en inglés) o CPUs diseñadas especialmente para aplicaciones HPC. En los próximos años, se espera que esta tendencia siga creciendo motivada por las demandas de más potencia de computación de la ciencia y la industria. A medida que las arquitecturas evolucionan, el ecosistema de herramientas y aplicaciones les debe seguir. Las decisiones eligiendo el número de núcleos por zócalo, las unidades de coma flotante por núcleo y el ancho de banda a través de la jerarquía de memoría entre otros, tienen un gran impacto en el consumo de energía y las capacidades de cómputo de los dispositivos. Para equilibrar las CPUs y los aceleradores, los diseñadores deben utilizar herramientas precisas para analizar y predecir el impacto de nuevas características de la arquitectura en el rendimiento de complejas aplicaciones científicas a gran escala. Dado semejante espacio de diseño, capturar y modelar con simuladores las complejas interacciones entre el software de sistema y los componentes de hardware es un reto desafiante. Además, las aplicaciones deben ser capaces de explotar tales diseños con agresivas capacidades de cómputo y ancho de banda de memoria. Los algoritmos y estructuras de datos deberán ser rediseñadas para exponer un alto grado de paralelismo de datos permitiendo así escalarlos en grandes sistemas. Por lo tanto, la siguiente generación de dispósitivos de cálculo será el resultado de un esfuerzo de codiseño tanto en hardware como en aplicaciones y soportado por avanzadas herramientas de simulación. En esta tesis, centramos nuestro trabajo en el codiseño de aplicaciones científicas y arquitecturas vectoriales largas. Extendemos significativamente una serie de herramientas para la simulación multiescala permitiendo así obtener estimaciones de rendimiento y potencia de sistemas HPC de gran escala. A través de simulaciones, exploramos el gran espacio de diseño de las tendencias actuales en HPC sobre un amplio rango de aplicaciones. Extraemos datos sobre la mejora y el consumo energético analizando las contrapartidas y las configuraciones óptimas para cada una de las aplicaciones. Describimos en detalle el proceso de optimización de dos aplicaciones en aceleradores vectoriales, obteniendo un rendimiento extraordinario a nivel de operaciones y completa utilización del ancho de memoria disponible. Con todo, ofrecemos recomendaciones empíricas a nivel de arquitectura y programación que servirán como instrucciones para diseñar mejor hardware y software para la siguiente generación de dispositivos de cálculo especializados.Postprint (published version

    Generalized sweeping preconditioners for domain decomposition methods applied to Helmholtz problems

    Full text link
    The main part of this thesis explores a family of generalized sweeping preconditionners for Helmholtz problems with non-overlapping checkerboard partition of the computational domain. The domain decomposition procedure relies on high-order transmission conditions and cross-point treatments, which cannot scale without an efficient preconditioning technique when the number of subdomains increases. With the proposed approach, existing sweeping preconditioners, such as the symmetric Gauss-Seidel and parallel double sweep preconditioners, can be applied to checkerboard partitions with different sweeping directions (e.g. horizontal and diagonal). Several directions can be combined thanks to the flexible version of GMRES, allowing for the rapid transfer of information in the different zones of the computational domain, then accelerating the convergence of the final iterative solution procedure. Several two-dimensional finite element results are proposed to study and to compare the sweeping preconditioners, and to illustrate the performance on cases of increasing complexity

    Modeling for inversion in exploration geophysics

    Get PDF
    Seismic inversion, and more generally geophysical exploration, aims at better understanding the earth's subsurface, which is one of today's most important challenges. Firstly, it contains natural resources that are critical to our technologies such as water, minerals and oil and gas. Secondly, monitoring the subsurface in the context of CO2 sequestration, earthquake detection and global seismology are of major interests with regard to safety and the environment hazards. However, the technologies to monitor the subsurface or find resources are scientifically extremely challenging. Seismic inversion can be formulated as a mathematical optimization problem that minimizes the difference between field recorded data and numerically modeled synthetic data. The process of solving this optimization problem then requires to numerically model, thousands of times, wave-propagation in large three-dimensional representations of part of the earth subsurface. The mathematical and computational complexity of this problem, therefore, calls for software design that abstracts these requirements and facilitates algorithm and software development. My thesis addresses some of the challenges that arise from these problems; mainly the computational cost and access to the right software for research and development. In the first part, I will discuss a performance metric that improves the current runtime-only benchmarks in exploration geophysics. This metric, the roofline model, first provides insight at the hardware level of the performance of a given implementation relative to the maximum achievable performance. Second, this study demonstrates that the choice of numerical discretization has a major impact on the achievable performance depending on the hardware at hand and shows that a flexible framework with respect to the discretization parameters is necessary. In the second part, I will introduce and describe Devito, a symbolic finite-difference DSL that provides a high-level interface to the definition of partial differential equations (PDE) such as the wave equation. Devito, from the symbolic definition of PDEs, then generates and compiles highly optimized C code on-the-fly to compute the solution of the PDE. The combination of the high-level abstractions and the just-in-time compiler enable research for geophysical exploration and PDE-constrainted optimization based on the paradigm of separation of concerns. This allows researchers to concentrate on their respective field of study while having access to computationally performant solvers with a flexible and easy to use interface to successfully implement complex representations of the physics. The second part of my thesis will be split into two sub-parts; first describing the symbolic application programming interface (API), before describing and benchmarking the just-in-time compiler. I will end my thesis with concluding remarks, the latest developments and a brief description of projects that were enabled by Devito.Ph.D

    Large-Scale Simulations of Complex Turbulent Flows: Modulation of Turbulent Boundary Layer Separation and Optimization of Discontinuous Galerkin Methods for Next-Generation HPC Platforms

    Full text link
    The separation of spatially evolving turbulent boundary layer flow near regions of adverse pressure gradients has been the subject of numerous studies in the context of flow control. Although many studies have demonstrated the efficacy of passive flow control devices, such as vortex generators (VGs), in reducing the size of the separated region, the interactions between the salient flow structures produced by the VG and those of the separated flow are not fully understood. Here, wall-resolved large-eddy simulation of a model problem of flow over a backward-facing ramp is studied with a submerged, wall-mounted cube being used as a canonical VG. In particular, the turbulent transport that results in the modulation of the separated flow over the ramp is investigated by varying the size, location of the VG, and the spanwise spacing between multiple VGs, which in turn are expected to modify the interactions between the VG-induced flow structures and those of the separated region. The horseshoe vortices produced by the cube entrain the freestream turbulent flow towards the plane of symmetry. These localized regions of high vorticity correspond to turbulent kinetic energy production regions, which effectively transfer energy from the freestream to the near-wall regions. Numerical simulations indicate that: (i) the gradients and the fluctuations, scale with the size of the cube and thus lead to more effective modulation for large cubes, (ii) for a given cube height the different upstream cube positions affect the behavior of the horseshoe vortex---when placed too close to the leading edge, the horseshoe vortex is not sufficiently strong to affect the large-scale structures of the separated region, and when placed too far, the dispersed core of the streamwise vortex is unable to modulate the flow over the ramp, (iii) if the spanwise spacing between neighboring VGs is too small, the counter-rotating vortices are not sufficiently strong to affect the large-scale structures of the separated region, and if the spacing is too large, the flow modulation is similar to that of an isolated VG. Turbulent boundary layer flows are inherently multiscale, and numerical simulations of such systems often require high spatial and temporal resolution to capture the unsteady flow dynamics accurately. While the innovations in computer hardware and distributed computing have enabled advances in the modeling of such large-scale systems, computations of many practical problems of interest are infeasible, even on the largest supercomputers. The need for high accuracy and the evolving heterogeneous architecture of the next-generation high-performance computing centers has impelled interest in the development of high-order methods. While the new class of recovery-assisted discontinuous Galerkin (RADG) methods can provide arbitrary high-orders of accuracy, the large number of degrees of freedom increases costs associated with the arithmetic operations performed and the amount of data transferred on-node. The purpose of the second part of this thesis is to explore optimization strategies to improve the parallel efficiency of RADG. A cache data-tiling strategy is investigated for polynomial orders 1 through 6, which enhances the arithmetic intensity of RADG to make better utilization of on-node floating-point capability. In addition, a power-aware compute framework is suggested by analyzing the power-performance trade-offs when changing from double to single-precision floating-point types---energy savings of 5 W per node are observed---which suggests that a transprecision framework will likely offer better power-performance balance on modern HPC platforms.PHDMechanical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163206/1/suyashtn_1.pd

    Flexible Modellerweiterung und Optimierung von Erdbebensimulationen

    Get PDF
    Simulations of realistic earthquake scenarios require scalable software and extensive supercomputing resources. With increasing fidelity in simulations, advanced rheological and source models need to be incorporated. I introduce a domain-specific language in order to handle the model flexibility in combination with the high efficiency requirements. The contributions in this thesis enabled the to date largest and longest dynamic rupture simulation of the 2004 Sumatra earthquake.Realistische Erdbebensimulationen benötigen skalierbare Software und beträchtliche Rechenressourcen. Mit zunehmender Genauigkeit der Simulationen müssen fortschrittliche rheologische und Quellmodelle integriert werden. Ich führe eine domänenspezifische Sprache ein, um die Modelflexibilität in Kombination mit den hohen Effizienzanforderungen zu beherrschen. Die Beiträge in dieser Arbeit haben die bisher größte und längste dynamische Bruchsimulation des Sumatra-Erdbebens von 2004 ermöglicht

    Productivity, performance, and portability for computational fluid dynamics applications

    Get PDF
    Hardware trends over the last decade show increasing complexity and heterogeneity in high performance computing architectures, which presents developers of CFD applications with three key challenges; the need for achieving good performance, being able to utilise current and future hardware by being portable, and doing so in a productive manner. These three appear to contradict each other when using traditional programming approaches, but in recent years, several strategies such as template libraries and Domain Specific Languages have emerged as a potential solution; by giving up generality and focusing on a narrower domain of problems, all three can be achieved. This paper gives an overview of the state-of-the-art for delivering performance, portability, and productivity to CFD applications, ranging from high-level libraries that allow the symbolic description of PDEs to low-level techniques that target individual algorithmic patterns. We discuss advantages and challenges in using each approach, and review the performance benchmarking literature that compares implementations for hardware architectures and their programming methods, giving an overview of key applications and their comparative performance

    Towards a real-time fully-coherent all-sky search for gravitational waves from compact binary coalescences using particle swarm optimization

    Get PDF
    While a fully-coherent all-sky search is known to be optimal for detecting gravitational wave signals from compact binary coalescences, its high computational cost has limited current searches to less sensitive coincidence-based schemes. Following up on previous work that has demonstrated the effectiveness of particle swarm optimization (PSO) in reducing the computational cost of this search, we present an implementation that achieves near real-time computational speed. This is achieved by combining the search efficiency of PSO with a significantly revised and optimized numerical implementation of the underlying mathematical formalism along with additional multithreaded parallelization layers in a distributed computing framework. For a network of four second-generation detectors with 60 min data from each, the runtime of the implementation presented here ranges between ≈1.4 to ≈0.5 times the data duration for network signal-to-noise ratios (SNRs) of ≳10 and ≳12, respectively. The reduced runtimes are obtained with small to negligible losses in detection sensitivity: for a false alarm rate of ≃1 event per year in Gaussian stationary noise, the loss in detection probability is ≤5% and ≤2% for SNRs of 10 and 12, respectively. Using the fast implementation, we are able to quantify frequentist errors in parameter estimation for signals in the double neutron star mass range using a large number of simulated data realizations. A clear dependence of parameter estimation errors and detection sensitivity on the condition number of the network antenna pattern matrix is revealed. Combined with previous work, this paper securely establishes the effectiveness of PSO-based fully-coherent all-sky search across the entire binary inspiral mass range that is relevant to ground-based detectors
    corecore