57 research outputs found

    Optimization of Triangular and Banded Matrix Operations Using 2d-Packed Layouts

    Get PDF
    International audienceOver the past few years, multicore systems have become more and more powerful and, thereby, very useful in high-performance computing. However, many applications, such as some linear algebra algorithms, still cannot take full advantage of these systems. This is mainly due to the shortage of optimization techniques dealing with irregular control structures. In particular, the well-known polyhedral model fails to optimize loop nests whose bounds and/or array references are not affine functions. This is more likely to occur when handling sparse matrices in their packed formats. In this paper, we propose to use 2d-packed layouts and simple affine transformations to enable optimization of triangular and banded matrix operations. The benefit of our proposal is shown through an experimental study over a set of linear algebra benchmarks

    Static versus Dynamic Memory Allocation: a Comparison for Linear Algebra Kernels

    Get PDF
    International audienceThe polyhedral model permits to automatically improve data locality and enable parallelism of regular linear algebra kernels. In previous work we have proposed a new data structure, 2d-packed layout, to store only the non-zeros elements of regular sparse (triangular and banded) matrices dynamically allocated for different basic linear algebra operations, and used Pluto to parallelize and optimize them. To our surprise, there were huge discrepancies in our measures of these kernels execution times that were due to the allocation mode: as statically declared arrays or as dynamically allocated arrays of pointers.In this paper we compare the performance of various linear algebra kernels, including some linear algebra kernels from the PolyBench suite, using different array allocation modes. We present our detailed investigation of the possible reasons of the performance variation on two different architectures: a dual 12-cores AMD (Magny-Cours) and a dual 10-cores Intel Xeon (Haswell-EP).We conclude that static or dynamic memory allocation has an impact on performance in many cases, and that the processor architecture and the gcc compiler's decisions can provoke significant and sometimes surprising variations, in favor of one or the other allocation mode

    GPU accelerated RNA-RNA interaction program, A

    Get PDF
    2021 Spring.Includes bibliographical references.RNA-RNA interaction (RRI) is important in processes like gene regulation, and is known to play roles in diseases including cancer and Alzheimer's. Large RRI computations run for days, weeks or even months, because the algorithms have time and space complexity of, respectively, O(N3M3) and O(N2M2), for sequences length N and M, and there is a need for high-throughput RRI tools. GPU parallelization of such algorithms is a challenge. We first show that the most computationally expensive part of base pair maximization (BPM) algorithms comprises O(N3) instances of upper banded tropical matrix products. We develop the first GPU library for this attaining close to theoretical machine peak (TMP). We next optimize other (fifth degree polynomial) terms in the computation and develop the first GPU implementation of the complete BPMax algorithm. We attain 12% of GPU TMP, a significant speedup over the original parallel CPU implementation, which attains less than 1% of CPU TMP. We also perform a large scale study of three small viral RNAs, hypothesized to be relevant to COVID-19

    Motivo: Fast Motif Counting via Succinct Color Coding and Adaptive Sampling

    No full text
    The randomized technique of color coding is behind state-of-the-art algorithms for estimating graph motif counts. Those algorithms, however, are not yet capable of scaling well to very large graphs with billions of edges. In this paper we develop novel tools for the `motif counting via color coding' framework. As a result, our new algorithm, Motivo, is able to scale well to larger graphs while at the same time provide more accurate graphlet counts than ever before. This is achieved thanks to two types of improvements. First, we design new succinct data structures that support fast common color coding operations, and a biased coloring trick that trades accuracy versus running time and memory usage. These adaptations drastically reduce the time and memory requirements of color coding. Second, we develop an adaptive graphlet sampling strategy, based on a fractional set cover problem, that breaks the additive approximation barrier of standard sampling. This strategy gives multiplicative approximations for all graphlets at once, allowing us to count not only the most frequent graphlets but also extremely rare ones. To give an idea of the improvements, in 4040 minutes Motivo counts 77-nodes motifs on a graph with 6565M nodes and 1.81.8B edges; this is 3030 and 500500 times larger than the state of the art, respectively in terms of nodes and edges. On the accuracy side, in one hour Motivo produces accurate counts of  ⁣10.000\approx \! 10.000 distinct 88-node motifs on graphs where state-of-the-art algorithms fail even to find the second most frequent motif. Our method requires just a high-end desktop machine. These results show how color coding can bring motif mining to the realm of truly massive graphs using only ordinary hardware.Comment: 13 page

    A high-performance open-source framework for multiphysics simulation and adjoint-based shape and topology optimization

    Get PDF
    The first part of this thesis presents the advances made in the Open-Source software SU2, towards transforming it into a high-performance framework for design and optimization of multiphysics problems. Through this work, and in collaboration with other authors, a tenfold performance improvement was achieved for some problems. More importantly, problems that had previously been impossible to solve in SU2, can now be used in numerical optimization with shape or topology variables. Furthermore, it is now exponentially simpler to study new multiphysics applications, and to develop new numerical schemes taking advantage of modern high-performance-computing systems. In the second part of this thesis, these capabilities allowed the application of topology optimiza- tion to medium scale fluid-structure interaction problems, using high-fidelity models (nonlinear elasticity and Reynolds-averaged Navier-Stokes equations), which had not been done before in the literature. This showed that topology optimization can be used to target aerodynamic objectives, by tailoring the interaction between fluid and structure. However, it also made ev- ident the limitations of density-based methods for this type of problem, in particular, reliably converging to discrete solutions. This was overcome with new strategies to both guarantee and accelerate (i.e. reduce the overall computational cost) the convergence to discrete solutions in fluid-structure interaction problems.Open Acces

    Sur la conception de solveurs linéaires hybrides pour les architectures parallèles modernes

    Get PDF
    In the context of this thesis, our focus is on numerical linear algebra, more precisely on solution of large sparse systems of linear equations. We focus on designing efficient parallel implementations of MaPHyS, an hybrid linear solver based on domain decomposition techniques. First we investigate the MPI+threads approach. In MaPHyS, the first level of parallelism arises from the independent treatment of the various subdomains. The second level is exploited thanks to the use of multi-threaded dense and sparse linear algebra kernels involved at the subdomain level. Such an hybrid implementation of an hybrid linear solver suitably matches the hierarchical structure of modern supercomputers and enables a trade-off between the numerical and parallel performances of the solver. We demonstrate the flexibility of our parallel implementation on a set of test examples. Secondly, we follow a more disruptive approach where the algorithms are described as sets of tasks with data inter-dependencies that leads to a directed acyclic graph (DAG) representation. The tasks are handled by a runtime system. We illustrate how a first task-based parallel implementation can be obtained by composing task-based parallel libraries within MPI processes throught a preliminary prototype implementation of our hybrid solver. We then show how a task-based approach fully abstracting the hardware architecture can successfully exploit a wide range of modern hardware architectures. We implemented a full task-based Conjugate Gradient algorithm and showed that the proposed approach leads to very high performance on multi-GPU, multicore and heterogeneous architectures.Dans le contexte de cette thèse, nous nous focalisons sur des algorithmes pour l’algèbre linéaire numérique, plus précisément sur la résolution de grands systèmes linéaires creux. Nous mettons au point des méthodes de parallélisation pour le solveur linéaire hybride MaPHyS. Premièrement nous considerons l'aproche MPI+threads. Dans MaPHyS, le premier niveau de parallélisme consiste au traitement indépendant des sous-domaines. Le second niveau est exploité grâce à l’utilisation de noyaux multithreadés denses et creux au sein des sous-domaines. Une telle implémentation correspond bien à la structure hiérarchique des supercalculateurs modernes et permet un compromis entre les performances numériques et parallèles du solveur. Nous démontrons la flexibilité de notre implémentation parallèle sur un ensemble de cas tests. Deuxièmement nous considérons un approche plus innovante, où les algorithmes sont décrits comme des ensembles de tâches avec des inter-dépendances, i.e., un graphe de tâches orienté sans cycle (DAG). Nous illustrons d’abord comment une première parallélisation à base de tâches peut être obtenue en composant des librairies à base de tâches au sein des processus MPI illustrer par un prototype d’implémentation préliminaire de notre solveur hybride. Nous montrons ensuite comment une approche à base de tâches abstrayant entièrement le matériel peut exploiter avec succès une large gamme d’architectures matérielles. À cet effet, nous avons implanté une version à base de tâches de l’algorithme du Gradient Conjugué et nous montrons que l’approche proposée permet d’atteindre une très haute performance sur des architectures multi-GPU, multicoeur ainsi qu’hétérogène

    Density-Aware Linear Algebra in a Column-Oriented In-Memory Database System

    Get PDF
    Linear algebra operations appear in nearly every application in advanced analytics, machine learning, and of various science domains. Until today, many data analysts and scientists tend to use statistics software packages or hand-crafted solutions for their analysis. In the era of data deluge, however, the external statistics packages and custom analysis programs that often run on single-workstations are incapable to keep up with the vast increase in data volume and size. In particular, there is an increasing demand of scientists for large scale data manipulation, orchestration, and advanced data management capabilities. These are among the key features of a mature relational database management system (DBMS). With the rise of main memory database systems, it now has become feasible to also consider applications that built up on linear algebra. This thesis presents a deep integration of linear algebra functionality into an in-memory column-oriented database system. In particular, this work shows that it has become feasible to execute linear algebra queries on large data sets directly in a DBMS-integrated engine (LAPEG), without the need of transferring data and being restricted by hard disc latencies. From various application examples that are cited in this work, we deduce a number of requirements that are relevant for a database system that includes linear algebra functionality. Beside the deep integration of matrices and numerical algorithms, these include optimization of expressions, transparent matrix handling, scalability and data-parallelism, and data manipulation capabilities. These requirements are addressed by our linear algebra engine. In particular, the core contributions of this thesis are: firstly, we show that the columnar storage layer of an in-memory DBMS yields an easy adoption of efficient sparse matrix data types and algorithms. Furthermore, we show that the execution of linear algebra expressions significantly benefits from different techniques that are inspired from database technology. In a novel way, we implemented several of these optimization strategies in LAPEG’s optimizer (SpMachO), which uses an advanced density estimation method (SpProdest) to predict the matrix density of intermediate results. Moreover, we present an adaptive matrix data type AT Matrix to obviate the need of scientists for selecting appropriate matrix representations. The tiled substructure of AT Matrix is exploited by our matrix multiplication to saturate the different sockets of a multicore main-memory platform, reaching up to a speed-up of 6x compared to alternative approaches. Finally, a major part of this thesis is devoted to the topic of data manipulation; where we propose a matrix manipulation API and present different mutable matrix types to enable fast insertions and deletes. We finally conclude that our linear algebra engine is well-suited to process dynamic, large matrix workloads in an optimized way. In particular, the DBMS-integrated LAPEG is filling the linear algebra gap, and makes columnar in-memory DBMS attractive as efficient, scalable ad-hoc analysis platform for scientists

    On High Performance Computing in Geodesy : Applications in Global Gravity Field Determination

    Get PDF
    Autonomously working sensor platforms deliver an increasing amount of precise data sets, which are often usable in geodetic applications. Due to the volume and quality, models determined from the data can be parameterized more complex and in more detail. To derive model parameters from these observations, the solution of a high dimensional inverse data fitting problem is often required. To solve such high dimensional adjustment problems, this thesis proposes a systematical, end-to-end use of a massive parallel implementation of the geodetic data analysis, using standard concepts of massive parallel high performance computing. It is shown how these concepts can be integrated into a typical geodetic problem, which requires the solution of a high dimensional adjustment problem. Due to the proposed parallel use of the computing and memory resources of a compute cluster it is shown, how general Gauss-Markoff models become solvable, which were only solvable by means of computationally motivated simplifications and approximations before. A basic, easy-to-use framework is developed, which is able to perform all relevant operations needed to solve a typical geodetic least squares adjustment problem. It provides the interface to the standard concepts and libraries used. Examples, including different characteristics of the adjustment problem, show how the framework is used and can be adapted for specific applications. In a computational sense rigorous solutions become possible for hundreds of thousands to millions of unknown parameters, which have to be estimated from a huge number of observations. Three special problems with different characteristics, as they arise in global gravity field recovery, are chosen and massive parallel implementations of the solution processes are derived. The first application covers global gravity field determination from real data as collected by the GOCE satellite mission (comprising 440 million highly correlated observations, 80,000 parameters). Within the second application high dimensional global gravity field models are estimated from the combination of complementary data sets via the assembly and solution of full normal equations (scenarios with 520,000 parameters, 2 TB normal equations). The third application solves a comparable problem, but uses an iterative least squares solver, allowing for a parameter space of even higher dimension (now considering scenarios with two million parameters). This thesis forms the basis for a flexible massive parallel software package, which is extendable according to further current and future research topics studied in the department. Within this thesis, the main focus lies on the computational aspects.Autonom arbeitende Sensorplattformen liefern präzise geodätisch nutzbare Datensätze in größer werdendem Umfang. Deren Menge und Qualität führt dazu, dass Modelle die aus den Beobachtungen abgeleitet werden, immer komplexer und detailreicher angesetzt werden können. Zur Bestimmung von Modellparametern aus den Beobachtungen gilt es oftmals, ein hochdimensionales inverses Problem im Sinne der Ausgleichungsrechnung zu lösen. Innerhalb dieser Arbeit soll ein Beitrag dazu geleistet werden, Methoden und Konzepte aus dem Hochleistungsrechnen in der geodätischen Datenanalyse strukturiert, durchgängig und konsequent zu verwenden. Diese Arbeit zeigt, wie sich diese nutzen lassen, um geodätische Fragestellungen, die ein hochdimensionales Ausgleichungsproblem beinhalten, zu lösen. Durch die gemeinsame Nutzung der Rechen- und Speicherressourcen eines massiv parallelen Rechenclusters werden Gauss-Markoff Modelle lösbar, die ohne den Einsatz solcher Techniken vorher höchstens mit massiven Approximationen und Vereinfachungen lösbar waren. Ein entwickeltes Grundgerüst stellt die Schnittstelle zu den massiv parallelen Standards dar, die im Rahmen einer numerischen Lösung von typischen Ausgleichungsaufgaben benötigt werden. Konkrete Anwendungen mit unterschiedlichen Charakteristiken zeigen das detaillierte Vorgehen um das Grundgerüst zu verwenden und zu spezifizieren. Rechentechnisch strenge Lösungen sind so für Hunderttausende bis Millionen von unbekannten Parametern möglich, die aus einer Vielzahl von Beobachtungen geschätzt werden. Drei spezielle Anwendungen aus dem Bereich der globalen Bestimmung des Erdschwerefeldes werden vorgestellt und die Implementierungen für einen massiv parallelen Hochleistungsrechner abgeleitet. Die erste Anwendung beinhaltet die Bestimmung von Schwerefeldmodellen aus realen Beobachtungen der Satellitenmission GOCE (welche 440 Millionen korrelierte Beobachtungen umfasst, 80,000 Parameter). In der zweite Anwendung werden globale hochdimensionale Schwerefelder aus komplementären Daten über das Aufstellen und Lösen von vollen Normalgleichungen geschätzt (basierend auf Szenarien mit 520,000 Parametern, 2 TB Normalgleichungen). Die dritte Anwendung löst dasselbe Problem, jedoch über einen iterativen Löser, wodurch der Parameterraum noch einmal deutlich höher dimensional sein kann (betrachtet werden nun Szenarien mit 2 Millionen Parametern). Die Arbeit bildet die Grundlage für ein massiv paralleles Softwarepaket, welches schrittweise um Spezialisierungen, abhängig von aktuellen Forschungsprojekten in der Arbeitsgruppe, erweitert werden wird. Innerhalb dieser Arbeit liegt der Fokus rein auf den rechentechnischen Aspekten
    corecore