153 research outputs found

    Hardware Parallel Architecture of a 3D Surface Reconstruction: Marching Cubes Algorithm

    Get PDF
    International audienceIn this paper we present a study of the algorithmic and architectural exploration methodology for a parallelism of the 3D reconstructing algorithm (Marching Cubes) and its optimized implementation on FPGA.We aim at defining a parallel multiprocessor architecture implementing this algorithm in an optimal way and Elementary Processor (EP) architecture dedicated to this algorithm. We use the SynDEx tool which adapts the AAA (Algorithm Architecture Adequacy) methodology, to find a good compromise between the computing power, the functionality of each PE, the optimization constraint (time, area), and the parallelization efficiency. Then, we describe a first implementation of PE on FPGA

    Parallel algorithms for computational fluid dynamics on unstructured meshes

    Get PDF
    La simulació numèrica directa (DNS) de fluxos complexes és actualment una utopia per la majoria d'aplicacions industrials ja que els requeriments computacionals son massa elevats. Donat un flux, la diferència entre els recursos computacionals necessaris i els disponibles és cobreix mitjançant la modelització/simplificació d'alguns termes de les equacions originals que regeixen el seu comportament. El creixement continuat dels recursos computacionals disponibles, principalment en forma de super-ordinadors, contribueix a reduir la part del flux que és necessari aproximar. De totes maneres, obtenir la eficiència esperada dels nous super-ordinadors no és una tasca senzilla i, per aquest motiu, part de la recerca en el camp de la Mecànica de Fluids Computacional es centra en aquest objectiu. En aquest sentit, algunes contribucions s'han presentat en el marc d'aquesta tesis. El primer objectiu va ser el desenvolupament d'un codi de CFD de propòsit general i paral·lel, basat en la metodologia de volums finits en malles no estructurades, per resoldre problemes de multi-física. Aquest codi, anomenat TermoFluids (TF), té un disseny orientat a objectes i pensat per ser usat de forma altament eficient en els super-ordinadors actuals. Amb el temps, ha esdevingut pel grup una eina fonamental en projectes tant de recerca bàsica com d'interès industrial. En el context d'aquesta tesis, el treball s'ha focalitzat en el desenvolupament de dos de les llibreries més bàsiques de TermoFluids: i) La Basics Objects Library (BOL), que es una plataforma de software sobre la qual estan programades la resta de llibreries del codi, i que conté els mètodes algebraics i geomètrics fonamentals per la implementació paral·lela dels algoritmes de discretització, ii) la Linear Solvers Library (LSL), que conté un gran nombre de mètodes per resoldre els sistemes d'equacions lineals derivats de les discretitzacions. El primer capítol d'aquesta tesi conté les principals idees subjacents al disseny i la implementació de la BOL i la LSL, juntament amb alguns exemples i algunes aplicacions industrials. En els capítols posteriors hi ha una explicació detallada de solvers específics per algunes aplicacions concretes. En el segon capítol, es presenta un solver paral·lel i directe per la resolució de l'equació de Poisson per casos en els quals una de les direccions del domini té condicions d'homogeneïtat. En la simulació de fluxos incompressibles, l'equació de Poisson es resol almenys una vegada en cada pas de temps, convertint-se en una de les parts més costoses i difícils de paral·lelitzar del codi. El mètode que proposem és una combinació d'una descomposició directa de Schur (DDS) i una diagonalització de Fourier. La darrera descompon el sistema original en un conjunt de sub-sistemes 2D independents que es resolen mitjançant l'algorisme DDS. Atès que no s'imposen restriccions a les direccions no periòdiques del domini, aquest mètode és aplicable a la resolució de problemes discretitzats mitjançat l'extrusió de malles 2D no estructurades. L'escalabilitat d'aquest mètode ha estat provada amb èxit amb un màxim de 8192 CPU per malles de fins a ~10⁹ volums de control. En el darrer capitol capítol, es presenta un mètode de resolució per l'equació de Transport de Boltzmann (BTE). La estratègia emprada es basa en el mètode d'Ordenades Discretes i pot ser aplicat en discretitzacions no estructurades. El flux per a cada ordenada angular es resol amb un mètode de substitució equivalent a la resolució d'un sistema lineal triangular. La naturalesa seqüencial d'aquest procés fa de la paral·lelització de l'algoritme el principal repte. Diversos algorismes de substitució han estat analitzats, esdevenint una de les heurístiques proposades la millor opció en totes les situacions analitzades, amb excel·lents resultats. Els testos d'eficiència paral·lela s'han realitzat usant fins a 2560 CPU.Direct Numerical Simulation (DNS) of complex flows is currently an utopia for most of industrial applications because computational requirements are too high. For a given flow, the gap between the required and the available computing resources is covered by modeling/simplifying of some terms of the original equations. On the other hand, the continuous growth of the computing power of modern supercomputers contributes to reduce this gap, reducing hence the unresolved physics that need to be attempted with approximated models. This growth, widely relies on parallel computing technologies. However, getting the expected performance from new complex computing systems is becoming more and more difficult, and therefore part of the CFD research is focused on this goal. Regarding to it, some contributions are presented in this thesis. The first objective was to contribute to the development of a general purpose multi-physics CFD code. referred to as TermoFluids (TF). TF is programmed following the object oriented paradigm and designed to run in modern parallel computing systems. It is also intensively involved in many different projects ranging from basic research to industry applications. Besides, one of the strengths of TF is its good parallel performance demonstrated in several supercomputers. In the context of this thesis, the work was focused on the development of two of the most basic libraries that compose TF: I) the Basic Objects Library (BOL), which is a parallel unstructured CFD application programming interface, on the top of which the rest of libraries that compose TF are written, ii) the Linear Solvers Library (LSL) containing many different algorithms to solve the linear systems arising from the discretization of the equations. The first chapter of this thesis contains the main ideas underlying the design and the implementation of the BOL and LSL libraries, together with some examples and some industrial applications. A detailed description of some application-specific linear solvers included in the LSL is carried out in the following chapters. In the second chapter, a parallel direct Poisson solver restricted to problems with one uniform periodic direction is presented. The Poisson equation is solved, at least, once per time-step when modeling incompressible flows, becoming one of the most time consuming and difficult to parallelize parts of the code. The solver here proposed is a combination of a direct Schur-complement based decomposition (DSD) and a Fourier diagonalization. The latter decomposes the original system into a set of mutually independent 2D sub-systems which are solved by means of the DSD algorithm. Since no restrictions are imposed in the non-periodic directions, the overall algorithm is well-suited for solving problems discretized on extruded 2D unstructured meshes. The scalability of the solver has been successfully tested using up to 8192 CPU cores for meshes with up to 10 9 grid points. In the last chapter, a solver for the Boltzmann Transport Equation (BTE) is presented. It can be used to solve radiation phenomena interacting with flows. The solver is based on the Discrete Ordinates Method and can be applied to unstructured discretizations. The flux for each angular ordinate is swept across the computational grid, within a source iteration loop that accounts for the coupling between the different ordinates. The sequential nature of the sweep process makes the parallelization of the overall algorithm the most challenging aspect. Several parallel sweep algorithms, which represent different options of interleaving communications and calculations, are analyzed. One of the heuristics proposed consistently stands out as the best option in all the situations analyzed. With this algorithm, good scalability results have been achieved regarding both weak and strong speedup tests with up to 2560 CPUs

    Book of Abstracts of the Sixth SIAM Workshop on Combinatorial Scientific Computing

    Get PDF
    Book of Abstracts of CSC14 edited by Bora UçarInternational audienceThe Sixth SIAM Workshop on Combinatorial Scientific Computing, CSC14, was organized at the Ecole Normale Supérieure de Lyon, France on 21st to 23rd July, 2014. This two and a half day event marked the sixth in a series that started ten years ago in San Francisco, USA. The CSC14 Workshop's focus was on combinatorial mathematics and algorithms in high performance computing, broadly interpreted. The workshop featured three invited talks, 27 contributed talks and eight poster presentations. All three invited talks were focused on two interesting fields of research specifically: randomized algorithms for numerical linear algebra and network analysis. The contributed talks and the posters targeted modeling, analysis, bisection, clustering, and partitioning of graphs, applied in the context of networks, sparse matrix factorizations, iterative solvers, fast multi-pole methods, automatic differentiation, high-performance computing, and linear programming. The workshop was held at the premises of the LIP laboratory of ENS Lyon and was generously supported by the LABEX MILYON (ANR-10-LABX-0070, Université de Lyon, within the program ''Investissements d'Avenir'' ANR-11-IDEX-0007 operated by the French National Research Agency), and by SIAM

    Precise Predictions for LHC Cross Sections and Phenomenology beyond NLO

    Get PDF
    Die Produktion von Vektorbosonpaaren ermöglicht die Untersuchung der Wechselwirkung zwischen drei elektroschwachen Eichbosonen. Eine Abweichung dieser Kopplung von der Vorhersage des Standardmodells kann durch anomale Kopplungen im Formalismus von Effektiver Feldtheorie beschrieben werden. In dieser Arbeit wird die zusätzliche Abstrahlung von Jets in WZ und WH Produktion untersucht. Hierfür wird die Observable xjetx_{\text{jet}} eingeführt, um Events, die von Jet Abstrahlung dominiert werden, von solchen zu trennen, die zwei hochenergetische Vektorbosonen beinhalten. Mit dieser Observablen können Phasenraumbereiche identifiziert werden, die sensitiv sind auf anomale Kopplungen zwischen Eichbosonen. Zudem wird ein dynamisches Jet Veto vorgeschlagen, um die Sensitivität von Suchen nach anomalen Kopplungen zu erhöhen. Ein traditionelles Veto mit einer festen Skala führt zu logarithmisch wachsenden Termen, die durch ein dynamisches Veto vermieden werden können. Das dynamische Veto erlaubt weiterhin die Einbeziehung eines größeren Phasenraumbereichs. Dies verbessert die Statistik und damit die Empfindlichkeit von Suchen nach anomalen Kopplungen. Für eine genaue Beschreibung der Events mit Vektorbosonpaaren mit hohen Transversalimpulsen sind Korrekturen höherer Ordnung notwendig. Im Rahmen dieser Arbeit wird die LoopSim Methode verwendet, um Korrekturen in nˉNLO\bar{n}\text{NLO} in der starken Kopplung zu berechnen. Dies ist eine Näherung der Korrekturen in nächst-zu-nächst-zu-führender Ordnung und besonders geeignet für hohe Transversalimpulse. Diese Analysen nutzen das flexible Monte Carlo Programm VBFNLO in Verbindung mit LoopSim. In dieser Arbeit wird eine parallelisierte Implementierung von VBFNLO entwickelt, die insbesondere für komplexe Prozesse die numerische Integration und Laufzeit verbessert und moderne Rechencluster effizienter nutzt

    High performance computing for multiphase fluid flows

    Get PDF
    Multiphase fluid flows are very common in engineering and science applications. Examples include air ow on water surface, metallurgical flow and blood flow in the body. In these flows, fluids are separated by a sharp interface and form different phases. The flow is characterized by the movement of this interface. Accurate modelling of the interface movement is a fundamental problem in the numerical simulation of these flows. Velocities for the movement are provided by the numerical solution of the Navier-Stokes (N-S) equations. These equations are discretized and converted into linear systems of equations. Research in the direction towards solving these systems efficiently has been the main focus of many researchers in the field of Computational Fluid Dynamics (CFD). A modified Volume of Fluid (VOF) method for modelling two phase flows is implemented using an analytic relation for its reconstruction step. The Finite Volume Method (FVM) is utilized, by incorporating a staggered grid, to discretize the two-dimensional (2-D) N-S equations. A preconditioned Krylov-Subspace iterative method, namely, the Bi-Conjugate Gradient Stabilized (Bi-CGSTAB) method is employed to solve the linear systems of equations. Solving the linear system usually consumes most of the simulation time for multiphase flow problems. Novel algorithms for the Incomplete LU Threshold (ILUT) preconditioner, forward and backward substitution and other matrix operations for penta-diagonal matrices are proposed here by adopting a diagonal sparse matrices format. The novel algorithm for ILUT reduces the computational complexity from O(n3 − n2) to O(n) in comparison to dense format. Further, it brings down the communication overhead, consequently facilitating parallelization. Parallel versions of these algorithms are developed using a new load balancing scheme. The MPI C++ communication library is utilized to develop the parallel version. The 2-D VOF code is applied to shape advection problems and results are found to be in good agreement with those available in literature. In the case of translation of a square box, it provides more accurate results than other VOF methods. The code for the VOF method and the parallel iterative solvers are integrated with 2-D N-S code in C++. The whole code is then implemented to simulate several two phase flow problems: dam breaking with and without an obstacle, rising of an air bubble and lid driven cavity flows. Speedup data from parallel programs implemented on these problems are generated

    Application of HPC in eddy current electromagnetic problem solution

    Get PDF
    As engineering problems are becoming more and more advanced, the size of an average model solved by partial differential equations is rapidly growing and, in order to keep simulation times within reasonable bounds, both faster computers and more efficient software implementations are needed. In the first part of this thesis, the full potential of simulation software has been exploited through high performance parallel computing techniques. In particular, the simulation of induction heating processes is accomplished within reasonable solution times, by implementing different parallel direct solvers for large sparse linear system, in the solution process of a commercial software. The performance of such library on shared memory systems has been remarkably improved by implementing a multithreaded version of MUMPS (MUltifrontal Massively Parallel Solver) library, which have been tested on benchmark matrices arising from typical induction heating process simulations. A new multithreading approach and a low rank approximation technique have been implemented and developed by MUMPS team in Lyon and Toulouse. In the context of a collaboration between MUMPS team and DII-University of Padova, a preliminary version of such functionalities could be tested on induction heating benchmark problems, and a substantial reduction of the computational cost and memory requirements could be achieved. In the second part of this thesis, some examples of design methodology by virtual prototyping have been described. Complex multiphysics simulations involving electromagnetic, circuital, thermal and mechanical problems have been performed by exploiting parallel solvers, as developed in the first part of this thesis. Finally, multiobjective stochastic optimization algorithms have been applied to multiphysics 3D model simulations in search of a set of improved induction heating device configurations

    Scalable and distributed constrained low rank approximations

    Get PDF
    Low rank approximation is the problem of finding two low rank factors W and H such that the rank(WH) << rank(A) and A ≈ WH. These low rank factors W and H can be constrained for meaningful physical interpretation and referred as Constrained Low Rank Approximation (CLRA). Like most of the constrained optimization problem, performing CLRA can be computationally expensive than its unconstrained counterpart. A widely used CLRA is the Non-negative Matrix Factorization (NMF) which enforces non-negativity constraints in each of its low rank factors W and H. In this thesis, I focus on scalable/distributed CLRA algorithms for constraints such as boundedness and non-negativity for large real world matrices that includes text, High Definition (HD) video, social networks and recommender systems. First, I begin with the Bounded Matrix Low Rank Approximation (BMA) which imposes a lower and an upper bound on every element of the lower rank matrix. BMA is more challenging than NMF as it imposes bounds on the product WH rather than on each of the low rank factors W and H. For very large input matrices, we extend our BMA algorithm to Block BMA that can scale to a large number of processors. In applications, such as HD video, where the input matrix to be factored is extremely large, distributed computation is inevitable and the network communication becomes a major performance bottleneck. Towards this end, we propose a novel distributed Communication Avoiding NMF (CANMF) algorithm that communicates only the right low rank factor to its neighboring machine. Finally, a general distributed HPC- NMF framework that uses HPC techniques in communication intensive NMF operations and suitable for broader class of NMF algorithms.Ph.D

    A Full Wave Electromagnetic Framework for Optimization and Uncertainty Quantification of Communication Systems in Underground Mine Environments

    Full text link
    Wireless communication, sensing, and tracking systems in mine environments are essential for protecting miners’ safety and daily operations. The design, deployment, and post-event reconfiguration of such systems greatly benefits from electromagnetic (EM) frameworks that can statistically analyze and optimize the wireless systems in realistic mine environments. This thesis proposes such a framework by developing two fast and efficient full-wave EM simulators and coupling them with a modern optimization algorithm and an efficient uncertainty quantification (UQ) method to synthesize system configurations and produce statistical insights. The first simulator is a fast multipole method – fast Fourier transform (FMM-FFT) accelerated surface integral equation (SIE) simulator. It relies on Muller and combined fields SIEs to account for scattering from mine walls and conductors, respectively. During the iterative solution of the SIE system, the computational and memory costs are reduced by using the FMM-FFT scheme. The memory costs are further reduced by compressing large data structures via singular value and Tucker decomposition. The second simulator is a domain decomposition (DD)-based SIE simulator. It first divides the physical domain of a mine tunnel or gallery into subdomains and then characterizes EM wave propagation in each subdomain separately. Finally, the DD-based SIE simulator assembles the solutions of subdomains and solves an inter-domain system using an efficient subdomain-combining scheme. While the DD-based SIE simulator is faster and more memory-efficient than the FMM-FFT accelerated SIE simulator when characterizing EM wave propagation in electrically large mine environments, it does not apply to certain scenarios that the FMM-FFT accelerated SIE simulators can handle. The optimization algorithm and UQ method that are coupled with the EM simulators are the dividing rectangles (DIRECT) algorithm and the high dimensional model representation (HDMR)-enhanced multi-element probabilistic collocation (ME-PC) method, respectively. The DIRECT algorithm is a Lipschitzian optimization method but does not require the knowledge of the Lipschitz constant. It performs a series of moves that explore the behavior of the objective function at a set of points in the carefully picked sub-regions of the search space. The HDMR-enhanced ME-PC method permits the accurate and efficient construction of surrogate models for EM observables in high dimensions. The HDMR expansion expresses the observable as finite sums of component functions that represent independent and combined contributions of random variables to the observable and hence reduces the complexity of UQ by including only the most significant component functions to minimize the computational cost of building the surrogate model. This research numerically validated and verified the two EM simulators and demonstrated the efficiency and applicability of the EM framework via its application to optimization and UQ problems in large and realistic mine environments.PHDElectrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/146028/1/wtsheng_1.pd

    Dense and sparse parallel linear algebra algorithms on graphics processing units

    Full text link
    Una línea de desarrollo seguida en el campo de la supercomputación es el uso de procesadores de propósito específico para acelerar determinados tipos de cálculo. En esta tesis estudiamos el uso de tarjetas gráficas como aceleradores de la computación y lo aplicamos al ámbito del álgebra lineal. En particular trabajamos con la biblioteca SLEPc para resolver problemas de cálculo de autovalores en matrices de gran dimensión, y para aplicar funciones de matrices en los cálculos de aplicaciones científicas. SLEPc es una biblioteca paralela que se basa en el estándar MPI y está desarrollada con la premisa de ser escalable, esto es, de permitir resolver problemas más grandes al aumentar las unidades de procesado. El problema lineal de autovalores, Ax = lambda x en su forma estándar, lo abordamos con el uso de técnicas iterativas, en concreto con métodos de Krylov, con los que calculamos una pequeña porción del espectro de autovalores. Este tipo de algoritmos se basa en generar un subespacio de tamaño reducido (m) en el que proyectar el problema de gran dimensión (n), siendo m << n. Una vez se ha proyectado el problema, se resuelve este mediante métodos directos, que nos proporcionan aproximaciones a los autovalores del problema inicial que queríamos resolver. Las operaciones que se utilizan en la expansión del subespacio varían en función de si los autovalores deseados están en el exterior o en el interior del espectro. En caso de buscar autovalores en el exterior del espectro, la expansión se hace mediante multiplicaciones matriz-vector. Esta operación la realizamos en la GPU, bien mediante el uso de bibliotecas o mediante la creación de funciones que aprovechan la estructura de la matriz. En caso de autovalores en el interior del espectro, la expansión requiere resolver sistemas de ecuaciones lineales. En esta tesis implementamos varios algoritmos para la resolución de sistemas de ecuaciones lineales para el caso específico de matrices con estructura tridiagonal a bloques, que se ejecutan en GPU. En el cálculo de las funciones de matrices hemos de diferenciar entre la aplicación directa de una función sobre una matriz, f(A), y la aplicación de la acción de una función de matriz sobre un vector, f(A)b. El primer caso implica un cálculo denso que limita el tamaño del problema. El segundo permite trabajar con matrices dispersas grandes, y para resolverlo también hacemos uso de métodos de Krylov. La expansión del subespacio se hace mediante multiplicaciones matriz-vector, y hacemos uso de GPUs de la misma forma que al resolver autovalores. En este caso el problema proyectado comienza siendo de tamaño m, pero se incrementa en m en cada reinicio del método. La resolución del problema proyectado se hace aplicando una función de matriz de forma directa. Nosotros hemos implementado varios algoritmos para calcular las funciones de matrices raíz cuadrada y exponencial, en las que el uso de GPUs permite acelerar el cálculo.One line of development followed in the field of supercomputing is the use of specific purpose processors to speed up certain types of computations. In this thesis we study the use of graphics processing units as computer accelerators and apply it to the field of linear algebra. In particular, we work with the SLEPc library to solve large scale eigenvalue problems, and to apply matrix functions in scientific applications. SLEPc is a parallel library based on the MPI standard and is developed with the premise of being scalable, i.e. to allow solving larger problems by increasing the processing units. We address the linear eigenvalue problem, Ax = lambda x in its standard form, using iterative techniques, in particular with Krylov's methods, with which we calculate a small portion of the eigenvalue spectrum. This type of algorithms is based on generating a subspace of reduced size (m) in which to project the large dimension problem (n), being m << n. Once the problem has been projected, it is solved by direct methods, which provide us with approximations of the eigenvalues of the initial problem we wanted to solve. The operations used in the expansion of the subspace vary depending on whether the desired eigenvalues are from the exterior or from the interior of the spectrum. In the case of searching for exterior eigenvalues, the expansion is done by matrix-vector multiplications. We do this on the GPU, either by using libraries or by creating functions that take advantage of the structure of the matrix. In the case of eigenvalues from the interior of the spectrum, the expansion requires solving linear systems of equations. In this thesis we implemented several algorithms to solve linear systems of equations for the specific case of matrices with a block-tridiagonal structure, that are run on GPU. In the computation of matrix functions we have to distinguish between the direct application of a matrix function, f(A), and the action of a matrix function on a vector, f(A)b. The first case involves a dense computation that limits the size of the problem. The second allows us to work with large sparse matrices, and to solve it we also make use of Krylov's methods. The expansion of subspace is done by matrix-vector multiplication, and we use GPUs in the same way as when solving eigenvalues. In this case the projected problem starts being of size m, but it is increased by m on each restart of the method. The solution of the projected problem is done by directly applying a matrix function. We have implemented several algorithms to compute the square root and the exponential matrix functions, in which the use of GPUs allows us to speed up the computation.Una línia de desenvolupament seguida en el camp de la supercomputació és l'ús de processadors de propòsit específic per a accelerar determinats tipus de càlcul. En aquesta tesi estudiem l'ús de targetes gràfiques com a acceleradors de la computació i ho apliquem a l'àmbit de l'àlgebra lineal. En particular treballem amb la biblioteca SLEPc per a resoldre problemes de càlcul d'autovalors en matrius de gran dimensió, i per a aplicar funcions de matrius en els càlculs d'aplicacions científiques. SLEPc és una biblioteca paral·lela que es basa en l'estàndard MPI i està desenvolupada amb la premissa de ser escalable, açò és, de permetre resoldre problemes més grans en augmentar les unitats de processament. El problema lineal d'autovalors, Ax = lambda x en la seua forma estàndard, ho abordem amb l'ús de tècniques iteratives, en concret amb mètodes de Krylov, amb els quals calculem una xicoteta porció de l'espectre d'autovalors. Aquest tipus d'algorismes es basa a generar un subespai de grandària reduïda (m) en el qual projectar el problema de gran dimensió (n), sent m << n. Una vegada s'ha projectat el problema, es resol aquest mitjançant mètodes directes, que ens proporcionen aproximacions als autovalors del problema inicial que volíem resoldre. Les operacions que s'utilitzen en l'expansió del subespai varien en funció de si els autovalors desitjats estan en l'exterior o a l'interior de l'espectre. En cas de cercar autovalors en l'exterior de l'espectre, l'expansió es fa mitjançant multiplicacions matriu-vector. Aquesta operació la realitzem en la GPU, bé mitjançant l'ús de biblioteques o mitjançant la creació de funcions que aprofiten l'estructura de la matriu. En cas d'autovalors a l'interior de l'espectre, l'expansió requereix resoldre sistemes d'equacions lineals. En aquesta tesi implementem diversos algorismes per a la resolució de sistemes d'equacions lineals per al cas específic de matrius amb estructura tridiagonal a blocs, que s'executen en GPU. En el càlcul de les funcions de matrius hem de diferenciar entre l'aplicació directa d'una funció sobre una matriu, f(A), i l'aplicació de l'acció d'una funció de matriu sobre un vector, f(A)b. El primer cas implica un càlcul dens que limita la grandària del problema. El segon permet treballar amb matrius disperses grans, i per a resoldre-ho també fem ús de mètodes de Krylov. L'expansió del subespai es fa mitjançant multiplicacions matriu-vector, i fem ús de GPUs de la mateixa forma que en resoldre autovalors. En aquest cas el problema projectat comença sent de grandària m, però s'incrementa en m en cada reinici del mètode. La resolució del problema projectat es fa aplicant una funció de matriu de forma directa. Nosaltres hem implementat diversos algorismes per a calcular les funcions de matrius arrel quadrada i exponencial, en les quals l'ús de GPUs permet accelerar el càlcul.Lamas Daviña, A. (2018). Dense and sparse parallel linear algebra algorithms on graphics processing units [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/112425TESI

    New Sequential and Scalable Parallel Algorithms for Incomplete Factor Preconditioning

    Get PDF
    The solution of large, sparse, linear systems of equations Ax = b is an important kernel, and the dominant term with regard to execution time, in many applications in scientific computing. The large size of the systems of equations being solved currently (millions of unknowns and equations) requires iterative solvers on parallel computers. Preconditioning, which is the process of translating a linear system into a related system that is easier to solve, is widely used to reduce solution time and is sometimes required to ensure convergence. Level-based preconditioning (ILU(ℓ)) has long been used in serial contexts and is widely recognized as robust and effective for a wide range of problems. However, the method has long been regarded as an inherently sequential technique. Parallelism, it has been thought, can be achieved primarily at the expense of increased iterations. We dispute these claims. The first half of this dissertation takes an in-depth look at structurally based ILU(ℓ) symbolic factorization. There are two definitions of fill level, “sum” and “max,” that have been proposed. Hitherto, these definitions have been cast in terms of matrix terminology. We develop a sequence of lemmas and theorems that provide graph theoretic characterizations of both definitions; these characterizations are based on the static graph of a matrix, G(A). Our Incomplete Fill Path Theorem characterizes fill levels per the sum definition; this is the definition that is used in most library implementations of the “classic” ILU(ℓ) factorization algorithm. Our theorem leads to several new graph-search algorithms that compute factors identical, or nearly identical, to those computed by the “classic” algorithm. Our analyses shows that the new algorithms have lower run time complexity than that of the previously existing algorithms for certain classes of matrices that are commonly encountered in scientific applications. The second half of this dissertation presents a Parallel ILU algorithmic framework (PILU). This framework enables scalable parallel ILU preconditioning by combining concepts from domain decomposition and graph ordering. The framework can accommodate ILU(ℓ) factorization as well as threshold-based ILUT methods. A model implementation of the framework, the Euclid library, was developed as part of this dissertation. This library was used to obtain experimental results for Poisson\u27s equation, the Convection-Diffusion equation, and a nonlinear Radiative Transfer problem. The experiments, which were conducted on a variety of platforms with up to 400 CPUs, demonstrate that our approach is highly scalable for arbitrary ILU(ℓ) fill levels
    corecore