Search CORE

9 research outputs found

The fast multipole method at exascale

Author: Chandramowlishwaran Aparna
Publication venue: Georgia Institute of Technology
Publication date: 13/01/2014
Field of study

This thesis presents a top to bottom analysis on designing and implementing fast algorithms for current and future systems. We present new analysis, algorithmic techniques, and implementations of the Fast Multipole Method (FMM) for solving N- body problems. We target the FMM because it is broadly applicable to a variety of scientific particle simulations used to study electromagnetic, fluid, and gravitational phenomena, among others. Importantly, the FMM has asymptotically optimal time complexity with guaranteed approximation accuracy. As such, it is among the most attractive solutions for scalable particle simulation on future extreme scale systems. We specifically address two key challenges. The first challenge is how to engineer fast code for today’s platforms. We present the first in-depth study of multicore op- timizations and tuning for FMM, along with a systematic approach for transforming a conventionally-parallelized FMM into a highly-tuned one. We introduce novel opti- mizations that significantly improve the within-node scalability of the FMM, thereby enabling high-performance in the face of multicore and manycore systems. The second challenge is how to understand scalability on future systems. We present a new algorithmic complexity analysis of the FMM that considers both intra- and inter- node communication costs. Using these models, we present results for choosing the optimal algorithmic tuning parameter. This analysis also yields the surprising prediction that although the FMM is largely compute-bound today, and therefore highly scalable on current systems, the trajectory of processor architecture designs, if there are no significant changes could cause it to become communication-bound as early as the year 2015. This prediction suggests the utility of our analysis approach, which directly relates algorithmic and architectural characteristics, for enabling a new kind of highlevel algorithm-architecture co-design. To demonstrate the scientific significance of FMM, we present two applications namely, direct simulation of blood which is a multi-scale multi-physics problem and large-scale biomolecular electrostatics. MoBo (Moving Boundaries) is the infrastruc- ture for the direct numerical simulation of blood. It comprises of two key algorithmic components of which FMM is one. We were able to simulate blood flow using Stoke- sian dynamics on 200,000 cores of Jaguar, a peta-flop system and achieve a sustained performance of 0.7 Petaflop/s. The second application we propose as future work in this thesis is biomolecular electrostatics where we solve for the electrical potential using the boundary-integral formulation discretized with boundary element methods (BEM). The computational kernel in solving the large linear system is dense matrix vector multiply which we propose can be calculated using our scalable FMM. We propose to begin with the two dielectric problem where the electrostatic field is cal- culated using two continuum dielectric medium, the solvent and the molecule. This is only a first step to solving biologically challenging problems which have more than two dielectric medium, ion-exclusion layers, and solvent filled cavities. Finally, given the difficulty in producing high-performance scalable code, productivity is a key concern. Recently, numerical algorithms are being redesigned to take advantage of the architectural features of emerging multicore processors. These new classes of algorithms express fine-grained asynchronous parallelism and hence reduce the cost of synchronization. We performed the first extensive performance study of a recently proposed parallel programming model, called Concurrent Collections (CnC). In CnC, the programmer expresses her computation in terms of application-specific operations, partially-ordered by semantic scheduling constraints. The CnC model is well-suited to expressing asynchronous-parallel algorithms, so we evaluate CnC using two dense linear algebra algorithms in this style for execution on state-of-the-art mul- ticore systems. Our implementations in CnC was able to match and in some cases even exceed competing vendor-tuned and domain specific library codes. We combine these two distinct research efforts by expressing FMM in CnC, our approach tries to marry performance with productivity that will be critical on future systems. Looking forward, we would like to extend this to distributed memory machines, specifically implement FMM in the new distributed CnC, distCnC to express fine-grained paral- lelism which would require significant effort in alternative models.Ph.D

Scholarly Materials And Research @ Georgia Tech

Fast Linear Programming through Transprecision Computing on Small and Sparse Data

Author: Falkenstein Maximilian
Grosser Tobias
Hoefler Torsten
Kruse Michael
Pitchanathan Arjun
Rigger Manuel
Su Zhendong
Theodoridis Theodoros
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/11/2020
Field of study

A plethora of program analysis and optimization techniques rely on linear programming at their heart. However, such techniques are often considered too slow for production use. While today’s best solvers are optimized for complex problems with thousands of dimensions, linear programming, as used in compilers, is typically applied to small and seemingly trivial problems, but to many instances in a single compilation run. As a result, compilers do not benefit from decades of research on optimizing large-scale linear programming. We design a simplex solver targeted at compilers. A novel theory of transprecision computation applied from individual elements to full data-structures provides the computational foundation. By carefully combining it with optimized representations for small and sparse matrices and specialized small-coefficient algorithms, we (1) reduce memory traffic, (2) exploit wide vectors, and (3) use low-precision arithmetic units effectively. We evaluate our work by embedding our solver into a state-of-the-art integer set library and implement one essential operation, coalescing, on top of our transprecision solver. Our evaluation shows more than an order-of-magnitude speedup on the core simplex pivot operation and a mean speedup of 3.2x (vs. GMP) and 4.6x (vs. IMath) for the optimized coalescing operation. Our results demonstrate that our optimizations exploit the wide SIMD instructions of modern microarchitectures effectively. We expect our work to provide foundations for a future integer set library that uses transprecision arithmetic to accelerate compiler analyses.ISSN:2475-142

Repository for Publications and Research Data

Edinburgh Research Explorer

High-performance and hardware-aware computing: proceedings of the first International Workshop on New Frontiers in High-performance and Hardware-aware Computing (HipHaC\u2708)

Author: Buchty Rainer
Weiß Jan-Philipp
Publication venue: KIT Scientific Publishing, Karlsruhe
Publication date: 01/01/2008
Field of study

The HipHaC workshop aims at combining new aspects of parallel, heterogeneous, and reconfigurable microprocessor technologies with concepts of high-performance computing and, particularly, numerical solution methods. Compute- and memory-intensive applications can only benefit from the full hardware potential if all features on all levels are taken into account in a holistic approach

KITopen

Code Generation for High Performance PDE Solvers on Modern Architectures

Author: Kempf Dominic
Publication venue
Publication date: 01/01/2019
Field of study

Numerical simulation with partial differential equations is an important discipline in high performance computing. Notable application areas include geosciences, fluid dynamics, solid mechanics and electromagnetics. Recent hardware developments have made it increasingly hard to achieve very good performance. This is both due to a lack of numerical algorithms suited for the hardware and efficient implementations of these algorithms not being available. Modern CPUs require a sufficiently high arithmetic intensity in order to unfold their full potential. In this thesis, we use a numerical scheme that is well-suited for this scenario: The Discontinuous Galerkin Finite Element Method on cuboid meshes can be implemented with optimal complexity exploiting the tensor product structure of basis functions and quadrature formulae using a technique called sum factorization. A matrix-free implementation of this scheme significantly lowers the memory footprint of the method and delivers a fully compute-bound algorithm. An efficient implementation of this scheme for a modern CPU requires maximum use of the processor’s SIMD units. General purpose compilers are not capable of autovectorizing traditional PDE simulation codes, requiring high performance implementations to explicitly spell out SIMD instructions. With the SIMD width increasing in the last years (reaching its current peak at 512 bits in the Intel Skylake architecture) and programming languages not providing tools to directly target SIMD units, such code suffers from a performance portability issue. This work proposes generative programming as a solution to this issue. To this end, we develop a toolchain that translates a PDE problem expressed in a domain specific language into a piece of machine-dependent, optimized C++ code. This toolchain is embedded into the existing user workflow of the DUNE project, an open source framework for the numerical solution of PDEs. Compared to other such toolchains, special emphasis is put on an intermediate representation that enables performance-oriented transformations. Furthermore, this thesis defines a new class of SIMD vectorization strategies that operate on batches of subkernels within one integration kernel. The space of these vectorization strategies is explored systematically from within the code generator in an autotuning procedure. We demonstrate the performance of our vectorization strategies and their implementation by providing measurements on the Intel Haswell and Intel Skylake architectures. We present numbers for the diffusion-reaction equation, the Stokes equations and Maxwell’s equations, achieving up to 40% of the machine’s theoretical floating point performance for an application of the DG operator

Heidelberger Dokumentenserver

AUTOMATING DATA-LAYOUT DECISIONS IN DOMAIN-SPECIFIC LANGUAGES

Author: Deb Diptorup
Publication venue: University of North Carolina at Chapel Hill Graduate School
Publication date: 01/01/2019
Field of study

A long-standing challenge in High-Performance Computing (HPC) is the simultaneous achievement of programmer productivity and hardware computational efficiency. The challenge has been exacerbated by the onset of multi- and many-core CPUs and accelerators. Only a few expert programmers have been able to hand-code domain-specific data transformations and vectorization schemes needed to extract the best possible performance on such architectures. In this research, we examined the possibility of automating these methods by developing a Domain-Specific Language (DSL) framework. Our DSL approach extends C++14 by embedding into it a high-level data-parallel array language, and by using a domain-specific compiler to compile to hybrid-parallel code. We also implemented an array index-space transformation algebra within this high-level array language to manipulate array data-layouts and data-distributions. The compiler introduces a novel method for SIMD auto-vectorization based on array data-layouts. Our new auto-vectorization technique is shown to outperform the default auto-vectorization strategy by up to 40% for stencil computations. The compiler also automates distributed data movement with overlapping of local compute with remote data movement using polyhedral integer set analysis. Along with these main innovations, we developed a new technique using C++ template metaprogramming for developing embedded DSLs using C++. We also proposed a domain-specific compiler intermediate representation that simplifies data flow analysis of abstract DSL constructs. We evaluated our framework by constructing a DSL for the HPC grand-challenge domain of lattice quantum chromodynamics. Our DSL yielded performance gains of up to twice the flop rate over existing production C code for selected kernels. This gain in performance was obtained while using less than one-tenth the lines of code. The performance of this DSL was also competitive with the best hand-optimized and hand-vectorized code, and is an order of magnitude better than existing production DSLs.Doctor of Philosoph

Carolina Digital Repository

Perception-motivated parallel algorithms for haptics

Author: Galvan Stefano
Publication venue: country:ITA
Publication date: 01/01/2010
Field of study

Negli ultimi anni l\u2019utilizzo di dispositivi aptici, atti cio\ue8 a riprodurre l\u2019interazione fisica con l\u2019ambiente remoto o virtuale, si sta diffondendo in vari ambiti della robotica e dell\u2019informatica, dai videogiochi alla chirurgia robotizzata eseguita in teleoperazione, dai cellulari alla riabilitazione. In questo lavoro di tesi abbiamo voluto considerare nuovi punti di vista sull\u2019argomento, allo scopo di comprendere meglio come riportare l\u2019essere umano, che \ue8 l\u2019unico fruitore del ritorno di forza, tattile e di telepresenza, al centro della ricerca sui dispositivi aptici. Allo scopo ci siamo focalizzati su due aspetti: una manipolazione del segnale di forza mutuata dalla percezione umana e l\u2019utilizzo di architetture multicore per l\u2019implementazione di algoritmi aptici e robotici. Con l\u2019aiuto di un setup sperimentale creato ad hoc e attraverso l\u2019utilizzo di un joystick con ritorno di forza a 6 gradi di libert\ue0, abbiamo progettato degli esperimenti psicofisici atti all\u2019identificazione di soglie differenziali di forze/coppie nel sistema mano-braccio. Sulla base dei risultati ottenuti abbiamo determinato una serie di funzioni di scalatura del segnale di forza, una per ogni grado di libert\ue0, che permettono di aumentare l\u2019abilit\ue0 umana nel discriminare stimoli differenti. L\u2019utilizzo di tali funzioni, ad esempio in teleoperazione, richiede la possibilit\ue0 di variare il segnale di feedback e il controllo del dispositivo sia in relazione al lavoro da svolgere, sia alle peculiari capacit\ue0 dell\u2019utilizzatore. La gestione del dispositivo deve quindi essere in grado di soddisfare due obbiettivi tendenzialmente in contrasto, e cio\ue8 il raggiungimento di alte prestazioni in termini di velocit\ue0, stabilit\ue0 e precisione, abbinato alla flessibilit\ue0 tipica del software. Una soluzione consiste nell\u2019affidare il controllo del dispositivo ai nuovi sistemi multicore che si stanno sempre pi\uf9 prepotentemente affacciando sul panorama informatico. Per far ci\uf2 una serie di algoritmi consolidati deve essere portata su sistemi paralleli. In questo lavoro abbiamo dimostrato che \ue8 possibile convertire facilmente vecchi algoritmi gi\ue0 implementati in hardware, e quindi intrinsecamente paralleli. Un punto da definire rimane per\uf2 quanto costa portare degli algoritmi solitamente descritti in VLSI e schemi in un linguaggio di programmazione ad alto livello. Focalizzando la nostra attenzione su un problema specifico, la pseudoinversione di matrici che \ue8 presente in molti algoritmi di dinamica e cinematica, abbiamo mostrato che un\u2019attenta progettazione e decomposizione del problema permette una mappatura diretta sulle unit\ue0 di calcolo disponibili. In aggiunta, l\u2019uso di parallelismo a livello di dati su macchine SIMD permette di ottenere buone prestazioni utilizzando semplici operazioni vettoriali come addizioni e shift. Dato che di solito tali istruzioni fanno parte delle implementazioni hardware la migrazione del codice risulta agevole. Abbiamo testato il nostro approccio su una Sony PlayStation 3 equipaggiata con un processore IBM Cell Broadband Engine.In the last years the use of haptic feedback has been used in several applications, from mobile phones to rehabilitation, from video games to robotic aided surgery. The haptic devices, that are the interfaces that create the stimulation and reproduce the physical interaction with virtual or remote environments, have been studied, analyzed and developed in many ways. Every innovation in the mechanics, electronics and technical design of the device it is valuable, however it is important to maintain the focus of the haptic interaction on the human being, who is the only user of force feedback. In this thesis we worked on two main topics that are relevant to this aim: a perception based force signal manipulation and the use of modern multicore architectures for the implementation of the haptic controller. With the help of a specific experimental setup and using a 6 dof haptic device we designed a psychophysical experiment aimed at identifying of the force/torque differential thresholds applied to the hand-arm system. On the basis of the results obtained we determined a set of task dependent scaling functions, one for each degree of freedom of the three-dimensional space, that can be used to enhance the human abilities in discriminating different stimuli. The perception based manipulation of the force feedback requires a fast, stable and configurable controller of the haptic interface. Thus a solution is to use new available multicore architectures for the implementation of the controller, but many consolidated algorithms have to be ported to these parallel systems. Focusing on specific problem, i.e. the matrix pseudoinversion, that is part of the robotics dynamic and kinematic computation, we showed that it is possible to migrate code that was already implemented in hardware, and in particular old algorithms that were inherently parallel and thus not competitive on sequential processors. The main question that still lies open is how much effort is required in order to write these algorithms, usually described in VLSI or schematics, in a modern programming language. We show that a careful task decomposition and design permit a mapping of the code on the available cores. In addition, the use of data parallelism on SIMD machines can give good performance when simple vector instructions such as add and shift operations are used. Since these instructions are present also in hardware implementations the migration can be easily performed. We tested our approach on a Sony PlayStation 3 game console equipped with IBM Cell Broadband Engine processor

Catalogo dei prodotti della ricerca

Vectorization system for unstructured codes with a Data-parallel Compiler IR

Author: Moll Simon
Publication venue: Saarländische Universitäts- und Landesbibliothek
Publication date: 01/01/2021
Field of study

With Dennard Scaling coming to an end, Single Instruction Multiple Data (SIMD) oﬀers itself as a way to improve the compute throughput of CPUs. One fundamental technique in SIMD code generators is the vectorization of data-parallel code regions. This has applications in outer-loop vectorization, whole-function vectorization and vectorization of explicitly data-parallel languages. This thesis makes contributions to the reliable vectorization of data-parallel code regions with unstructured, reducible control ﬂow. Reducibility is the case in practice where all control-ﬂow loops have exactly one entry point. We present P-LLVM, a novel, full-featured, intermediate representation for vectorizers that provides a semantics for the code region at every stage of the vectorization pipeline. Partial control-ﬂow linearization is a novel partial if-conversion scheme, an essential technique to vectorize divergent control ﬂow. Diﬀerent to prior techniques, partial linearization has linear running time, does not insert additional branches or blocks and gives proved guarantees on the control ﬂow retained. Divergence of control induces value divergence at join points in the control-ﬂow graph (CFG). We present a novel control-divergence analysis for directed acyclic graphs with optimal running time and prove that it is correct and precise under common static assumptions. We extend this technique to obtain a quadratic-time, control-divergence analysis for arbitrary reducible CFGs. For this analysis, we show on a range of realistic examples how earlier approaches are either less precise or incorrect. We present a feature-complete divergence analysis for P-LLVM programs. The analysis is the ﬁrst to analyze stack-allocated objects in an unstructured control setting. Finally, we generalize single-dimensional vectorization of outer loops to multi-dimensional tensorization of loop nests. SIMD targets beneﬁt from tensorization through more opportunities for re-use of loaded values and more eﬃcient memory access behavior. The techniques were implemented in the Region Vectorizer (RV) for vectorization and TensorRV for loop-nest tensorization. Our evaluation validates that the general-purpose RV vectorization system matches the performance of more specialized approaches. RV performs on par with the ISPC compiler, which only supports its structured domain-speciﬁc language, on a range of tree traversal codes with complex control ﬂow. RV is able to outperform the loop vectorizers of state-of-the-art compilers, as we show for the SPEC2017 nab_s benchmark and the XSBench proxy application.Mit dem Ausreizen des Dennard Scalings erreichen die gewohnten Zuwächse in der skalaren Rechenleistung zusehends ihr Ende. Moderne Prozessoren setzen verstärkt auf parallele Berechnung, um den Rechendurchsatz zu erhöhen. Hierbei spielen SIMD Instruktionen (Single Instruction Multiple Data), die eine Operation gleichzeitig auf mehrere Eingaben anwenden, eine zentrale Rolle. Eine fundamentale Technik, um SIMD Programmcode zu erzeugen, ist der Einsatz datenparalleler Vektorisierung. Diese unterliegt populären Verfahren, wie der Vektorisierung äußerer Schleifen, der Vektorisierung gesamter Funktionen bis hin zu explizit datenparallelen Programmiersprachen. Der Beitrag der vorliegenden Arbeit besteht darin, ein zuverlässiges Vektorisierungssystem für datenparallelen Code mit reduziblem Steuerﬂuss zu entwickeln. Diese Anforderung ist für alle Steuerﬂussgraphen erfüllt, deren Schleifen nur einen Eingang haben, was in der Praxis der Fall ist. Wir präsentieren P-LLVM, eine ausdrucksstarke Zwischendarstellung für Vektorisierer, welche dem Programm in jedem Stadium der Transformation von datenparallelem Code zu SIMD Code eine deﬁnierte Semantik verleiht. Partielle Steuerﬂuss-Linearisierung ist ein neuer Algorithmus zur If-Conversion, welcher Sprünge erhalten kann. Anders als existierende Verfahren hat Partielle Linearisierung eine lineare Laufzeit und fügt keine neuen Sprünge oder Blöcke ein. Wir zeigen Kriterien, unter denen der Algorithmus Steuerﬂuss erhält, und beweisen diese. Steuerﬂussdivergenz induziert Divergenz an Punkten zusammenﬂießenden Steuerﬂusses. Wir stellen eine neue Steuerﬂussdivergenzanalyse für azyklische Graphen mit optimaler Laufzeit vor und beweisen deren Korrektheit und Präzision. Wir verallgemeinern die Technik zu einem Algorithmus mit quadratischer Laufzeit für beliebiege, reduzible Steuerﬂussgraphen. Eine Studie auf realistischen Beispielgraphen zeigt, dass vergleichbare Techniken entweder weniger präsize sind oder falsche Ergebnisse liefern. Ebenfalls präsentieren wir eine Divergenzanalyse für P-LLVM Programme. Diese Analyse ist die erste Divergenzanalyse, welche Divergenz in stapelallokierten Objekten unter unstrukturiertem Steuerﬂuss analysiert. Schließlich generalisieren wir die eindimensionale Vektorisierung von äußeren Schleifen zur multidimensionalen Tensorisierung von Schleifennestern. Tensorisierung eröﬀnet für SIMD Prozessoren mehr Möglichkeiten, bereits geladene Werte wiederzuverwenden und das Speicherzugriﬀsverhalten des Programms zu optimieren, als dies mit Vektorisierung der Fall ist. Die vorgestellten Techniken wurden in den Region Vectorizer (RV) für Vektorisierung und TensorRV für die Tensorisierung von Schleifennestern implementiert. Wir zeigen auf einer Reihe von steuerﬂusslastigen Programmen für die Traversierung von Baumdatenstrukturen, dass RV das gleiche Niveau erreicht wie der ISPC Compiler, welcher nur seine strukturierte Eingabesprache verarbeiten kann. RV kann schnellere SIMD-Programme erzeugen als die Schleifenvektorisierer in aktuellen Industriecompilern. Dies demonstrieren wir mit dem nab_s benchmark aus der SPEC2017 Benchmarksuite und der XSBench Proxy-Anwendung

Universaar

Acronym

Green Wave : A Semi Custom Hardware Architecture for Reverse Time Migration

Author: Krüger Jens-Thomas
Publication venue
Publication date: 01/01/2012
Field of study

Over the course of the last few decades the scientific community greatly benefited from steady advances in compute performance. Until the early 2000's this performance improvement was achieved through rising clock rates. This enabled plug-n-play performance improvements for all codes. In 2005 the stagnation of CPU clock rates drove the computing hardware manufactures to attain future performance through explicit parallelism. Now the HPC community faces a new, even bigger challenge. So far performance gains were achieved through replication of general-purpose cores and nodes. Unfortunately, rising cluster sizes resulted in skyrocketing energy costs - a paradigm change in HPC architecture design is inevitable. In combination with the increasing costs of data movement, the HPC community started exploring alternatives like GPUs and large arrays of simple, low-power cores (e.g. BlueGene) to offer the better performance per Watt and greatest scalability. As in general science, the seismic community faces large-scale, complex computational challenges that can only be limited solved with available compute capabilities. Such challenges include the physically correct modeling of subsurface rock layers. This thesis analyzes the requirements and performance of isotropic (ISO), vertical transverse isotropic (VTI) and tilted transverse isotropic (TTI) wave propagation kernels as they appear in the Reverse Time Migration (RTM) imaging method. It finds that even with leading-edge, commercial off-the-shelf hardware, large-scale survey sizes cannot be imaged within reasonable time and power constraints. This thesis uses a novel architecture design method leveraging a hardware/software co-design approach, adopted from the mobile- and embedded market, for HPC. The methodology tailors an architecture design to a class of applications without loss of generality like in full custom designs. This approach was first applied in the Green Flash project, which proved that the co-design approach has the potential for high energy efficiency gains. This thesis presents the novel Green Wave architecture that is derived from the Green Flash project. Rather than focusing on climate codes, like Green Flash, Green Wave chooses RTM wave propagation kernels as its target application. Thus, the goal of the application-driven, co-design Green Wave approach, is to enable full programmability while allowing greater computational efficiency than general-purpose processors or GPUs by offering custom extensions to the processor's ISA and correctly sizing software-managed memories and an efficient on-chip network interconnect. The lowest level building blocks of the Green Wave design are pre-verified IP components. This minimizes the amount of custom logic in the design, which in turn reduces verification costs and design uncertainty. In this thesis three Green Wave architecture designs derived from ISO, VTI and TTI kernel analysis are introduced. Further, a programming model is proposed capable of hiding all communication latencies. With production-strength, cycle-accurate hardware simulators Green Wave's performance is benchmarked and its performance compared to leading on-market systems from Intel, AMD and NVidia. Based on a large-scale example survey, the results show that Green Wave has the potential of an energy efficiency improvement of 5x compared to x86 and 1.4x-4x to GPU-based clusters for ISO, VTI and TTI kernels

Heidelberger Dokumentenserver

Leveraging performance of 3D finite difference schemes in large scientific computing simulations

Author: De la Cruz Raúl
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2015
Field of study

Gone are the days when engineers and scientists conducted most of their experiments empirically. During these decades, actual tests were carried out in order to assess the robustness and reliability of forthcoming product designs and prove theoretical models. With the advent of the computational era, scientific computing has definetely become a feasible solution compared with empirical methods, in terms of effort, cost and reliability. Large and massively parallel computational resources have reduced the simulation execution times and have improved their numerical results due to the refinement of the sampled domain. Several numerical methods coexist for solving the Partial Differential Equations (PDEs). Methods such as the Finite Element (FE) and the Finite Volume (FV) are specially well suited for dealing with problems where unstructured meshes are frequent. Unfortunately, this flexibility is not bestowed for free. These schemes entail higher memory latencies due to the handling of irregular data accesses. Conversely, the Finite Difference (FD) scheme has shown to be an efficient solution for problems where the structured meshes suit the domain requirements. Many scientific areas use this scheme due to its higher performance. This thesis focuses on improving FD schemes to leverage the performance of large scientific computing simulations. Different techniques are proposed such as the Semi-stencil, a novel algorithm that increases the FLOP/Byte ratio for medium- and high-order stencils operators by reducing the accesses and endorsing data reuse. The algorithm is orthogonal and can be combined with techniques such as spatial- or time-blocking, adding further improvement. New trends on Symmetric Multi-Processing (SMP) systems -where tens of cores are replicated on the same die- pose new challenges due to the exacerbation of the memory wall problem. In order to alleviate this issue, our research is focused on different strategies to reduce pressure on the cache hierarchy, particularly when different threads are sharing resources due to Simultaneous Multi-Threading (SMT). Several domain decomposition schedulers for work-load balance are introduced ensuring quasi-optimal results without jeopardizing the overall performance. We combine these schedulers with spatial-blocking and auto-tuning techniques, exploring the parametric space and reducing misses in last level cache. As alternative to brute-force methods used in auto-tuning, where a huge parametric space must be traversed to find a suboptimal candidate, performance models are a feasible solution. Performance models can predict the performance on different architectures, selecting suboptimal parameters almost instantly. In this thesis, we devise a flexible and extensible performance model for stencils. The proposed model is capable of supporting multi- and many-core architectures including complex features such as hardware prefetchers, SMT context and algorithmic optimizations. Our model can be used not only to forecast execution time, but also to make decisions about the best algorithmic parameters. Moreover, it can be included in run-time optimizers to decide the best SMT configuration based on the execution environment. Some industries rely heavily on FD-based techniques for their codes. Nevertheless, many cumbersome aspects arising in industry are still scarcely considered in academia research. In this regard, we have collaborated in the implementation of a FD framework which covers the most important features that an HPC industrial application must include. Some of the node-level optimization techniques devised in this thesis have been included into the framework in order to contribute in the overall application performance. We show results for a couple of strategic applications in industry: an atmospheric transport model that simulates the dispersal of volcanic ash and a seismic imaging model used in Oil & Gas industry to identify hydrocarbon-rich reservoirs.Atrás quedaron los días en los que ingenieros y científicos realizaban sus experimentos empíricamente. Durante esas décadas, se llevaban a cabo ensayos reales para verificar la robustez y fiabilidad de productos venideros y probar modelos teóricos. Con la llegada de la era computacional, la computación científica se ha convertido en una solución factible comparada con métodos empíricos, en términos de esfuerzo, coste y fiabilidad. Los supercomputadores han reducido el tiempo de las simulaciones y han mejorado los resultados numéricos gracias al refinamiento del dominio. Diversos métodos numéricos coexisten para resolver las Ecuaciones Diferenciales Parciales (EDPs). Métodos como Elementos Finitos (EF) y Volúmenes Finitos (VF) están bien adaptados para tratar problemas donde las mallas no estructuradas son frecuentes. Desafortunadamente, esta flexibilidad no se confiere de forma gratuita. Estos esquemas conllevan latencias más altas debido al acceso irregular de datos. En cambio, el esquema de Diferencias Finitas (DF) ha demostrado ser una solución eficiente cuando las mallas estructuradas se adaptan a los requerimientos. Esta tesis se enfoca en mejorar los esquemas DF para impulsar el rendimiento de las simulaciones en la computación científica. Se proponen diferentes técnicas, como el Semi-stencil, un nuevo algoritmo que incrementa el ratio de FLOP/Byte para operadores de stencil de orden medio y alto reduciendo los accesos y promoviendo el reuso de datos. El algoritmo es ortogonal y puede ser combinado con técnicas como spatial- o time-blocking, añadiendo mejoras adicionales. Las nuevas tendencias hacia sistemas con procesadores multi-simétricos (SMP) -donde decenas de cores son replicados en el mismo procesador- plantean nuevos retos debido a la exacerbación del problema del ancho de memoria. Para paliar este problema, nuestra investigación se centra en estrategias para reducir la presión en la jerarquía de cache, particularmente cuando diversos threads comparten recursos debido a Simultaneous Multi-Threading (SMT). Introducimos diversos planificadores de descomposición de dominios para balancear la carga asegurando resultados casi óptimos sin poner en riesgo el rendimiento global. Combinamos estos planificadores con técnicas de spatial-blocking y auto-tuning, explorando el espacio paramétrico y reduciendo los fallos en la cache de último nivel. Como alternativa a los métodos de fuerza bruta usados en auto-tuning donde un espacio paramétrico se debe recorrer para encontrar un candidato, los modelos de rendimiento son una solución factible. Los modelos de rendimiento pueden predecir el rendimiento en diferentes arquitecturas, seleccionando parámetros suboptimos casi de forma instantánea. En esta tesis, ideamos un modelo de rendimiento para stencils flexible y extensible. El modelo es capaz de soportar arquitecturas multi-core incluyendo características complejas como prefetchers, SMT y optimizaciones algorítmicas. Nuestro modelo puede ser usado no solo para predecir los tiempos de ejecución, sino también para tomar decisiones de los mejores parámetros algorítmicos. Además, puede ser incluido en optimizadores run-time para decidir la mejor configuración SMT. Algunas industrias confían en técnicas DF para sus códigos. Sin embargo, no todos los aspectos que aparecen en la industria han sido sometidos a investigación. En este aspecto, hemos diseñado e implementado desde cero una infraestructura DF que cubre las características más importantes que una aplicación industrial debe incluir. Algunas de las técnicas de optimización propuestas en esta tesis han sido incluidas para contribuir en el rendimiento global a nivel industrial. Mostramos resultados de un par de aplicaciones estratégicas para la industria: un modelo de transporte atmosférico que simula la dispersión de ceniza volcánica y un modelo de imagen sísmica usado en la industria del petroleo y gas para identificar reservas ricas en hidrocarburo

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa