499 research outputs found

    Parallelization of the ADI method exploring vector computing in GPUs

    Get PDF
    Dissertação de mestrado integrado em Engenharia InformáticaThe 2D convection-diffusion is a well-known problem in scientific simulation that often uses a direct method to solve a system of N linear equations, which requires N3 operations. This problem can be solved using a more efficient computational method, known as the alternating direction implicit (ADI). It solves a system of N linear equations in 2N times with N operations each, implemented in two steps, one to solve row by row, the other column by column. Each N operation is fully independent in each step, which opens an opportunity to an embarrassingly parallel solution. This method also explores the way matrices are stored in computer memory, either in row-major or column-major, by splitting each iteration in two. The major bottleneck of this method is solving the system of linear equations. These systems of linear equations can be described as tridiagonal matrices since the elements are always stored on the three main diagonals of the matrices. Algorithms tailored for tridiagonal matrices, can significantly improve the performance. These can be sequential (i.e. the Thomas algorithm) or parallel (i.e. the cyclic reduction CR, and the parallel cyclic reduction PCR). Current vector extensions in conventional scalar processing units, such as x86-64 and ARM devices, require the vector elements to be in contiguous memory locations to avoid performance penalties. To overcome these limitations in dot products several approaches are proposed and evaluated in this work, both in general-purpose processing units and in specific accelerators, namely NVidia GPUs. Profiling the code execution on a server based on x86-64 devices showed that the ADI method needs a combination of CPU computation power and memory transfer speed. This is best showed on a server based on the Intel manycore device, KNL, where the algorithm scales until the memory bandwidth is no longer enough to feed all 64 computing cores. A dual-socket server based on 16-core Xeon Skylakes, with AVX-512 vector support, proved to be a better choice: the algorithm executes in less time and scales better. The introduction of GPU computing to further improve the execution performance (and also using other optimisation techniques, namely a different thread scheme and shared memory to speed up the process) showed better results for larger grid sizes (above 32Ki x 32Ki). The CUDA development environment also showed a better performance than using OpenCL, in most cases. The largest difference was using a hybrid CR-PCR, where the OpenCL code displayed a major performance improvement when compared to CUDA. But even with this speedup, the better average time for the ADI method on all tested configurations on a NVidia GPU was using CUDA on an available updated GPU (with a Pascal architecture) and the CR as the auxiliary method.O problema da convecção-difusão é utilizado em simulaçãos cientificas que regularmente utilizam métodos diretos para solucionar um sistema de N equações lineares e necessitam de N3 operações. O problema pode ser resolvido utilizando um método computacionalmente mais eficiente para resolver um sistema de N equações lineares com N operações cada, implementado em dois passos, um solucionando linha a linha e outro solucionando coluna a coluna. Cada par de N operações são independentes em cada passo, havendo assim uma oportunidade de utilizar uma solução em baraçosamente paralela. Este método também explora o modo de guardar as matrizes na memória do computados, sendo esta por linhas ou em colunas, dividindo cada iteração em duas, este método é conhecido como o método de direção alternada. O maior bottleneck deste problema é a resolução dos sistemas de equações lineares criados pelo ADI. Estes sistemas podem ser descritos como matrizes tridiagonais, visto todos os seus elementos se encontrarem nas 3 diagonais interiores e a utilização de métodos estudados para este caso é necessário para conseguir atingir a melhor performance possível. Esses métodos podem ser sequenciais (como o algoritmo de Thomas) ou paralelos (como o CR e o PCR) As extensões vectoriais utilizadas nas atuais unidades de processamento, como dispositivos x86-64 e ARM, necessitam que os elementos do vetor estejam em blocos de memória contíguos para não sofrer penalizações. Algumas abordagens foram estudadas neste trabalho para as ultrapassar, tanto em processadores convencionais como em aceleradores de computação. Os registos do tempo em servidores baseado em dispositivos x86-64 mostram que o ADI necessitam de uma combinação de poder de processamento assim como velocidade de transferência de dados. Isto é demonstrado especialmente no servidor baseado no dispositivo KNL da Intel, no qual o algoritmo escala até que a largura de banda deixe de ser suficiente para o problema. Um servidor com dois sockets em que cada é composto por um dispositivo com 16 cores baseado na arquitetura Xeon Skylake, com acesso ao AVX-512, mostrou ser a melhor escolha: o algoritmo faz as mesmas operações em menos tempo e escala melhor. Com a introdução de computação com GPUs para melhorar a performance do programa mostrou melhores resultados para problemas de maiores dimensões (tamanho acima de 32Ki x 32Ki celulas). O desenvolvimento em CUDA também mostrou melhores resultados que em OpenCL na maioria dos casos. A maior divergência foi observada ao utilizar o método CR-PCR, onde o OpenCL mostrou melhor performance que em CUDA. Mas mesmo com este método sendo mais eficaz que o mesmo em CUDA, o melhor performance com o método ADI foi observado utilizando CUDA no GPU mais recente estudado com o método CR

    Quantum Interference on the Kagom\'e Lattice

    Full text link
    We study quantum interference effects due to electron motion on the Kagom\'e lattice in a perpendicular magnetic field. These effects arise from the interference between phase factors associated with different electron closed-paths. From these we compute, analytically and numerically, the superconducting-normal phase boundary for Kagom\'e superconducting wire networks and Josephson junction arrays. We use an analytical approach to analyze the relationship between the interference and the complex structure present in the phase boundary, including the origin of the overall and fine structure. Our results are obtained by exactly summing over one thousand billion billions (1021\sim 10^{21}) closed paths, each one weighted by its corresponding phase factor representing the net flux enclosed by each path. We expect our computed mean-field phase diagrams to compare well with several proposed experiments.Comment: 9 pages, Revtex, 3 figures upon reques

    Sparse stretching for solving sparse-dense linear least-squares problems

    Get PDF
    Large-scale linear least-squares problems arise in a wide range of practical applications. In some cases, the system matrix contains a small number of dense rows. These make the problem significantly harder to solve because their presence limits the direct applicability of sparse matrix techniques. In particular, the normal matrix is (close to) dense, so that forming it is impractical. One way to help overcome the dense row problem is to employ matrix stretching. Stretching is a sparse matrix technique that improves sparsity by making the least-squares problem larger. We show that standard stretching can still result in the normal matrix for the stretched problem having an unacceptably large amount of fill. This motivates us to propose a new sparse stretching strategy that performs the stretching so as to limit the fill in the normal matrix and its Cholesky factor. Numerical examples from real problems are used to illustrate the potential gains

    Computer Aided Analysis of Periodically Switched Linear Networks

    Get PDF
    Interest in analysing periodically switched linear networks has developed in response to the rapid development of sampled data communications systems. In particular, integrated circuit switched capacitor networks play an important part in modern analogue signal processing systems. This thesis addresses the problem of developing techniques for analysing periodically switched linear networks in the time and frequency domains that are suited to computer implementation and therefore facilitate the development of efficient computer aided analysis tools for these networks. Systems of large sparse complex linear equations arise in many network analysis problems and efficient techniques for solving these systems are crucial to the analysis methods developed in this thesis. By extending the concept of sparsity to include the type of the nonzero elements, very efficient solution and optimal ordering algorithms are developed. A new method for computing the time domain response of linear networks is presented. The method is based on numerical inversion of the Laplace transform and polynomial approximation of the excitations. This high accuracy method is well suited to solving large stiff systems and is extremely efficient. The method is extended to periodically switched linear networks and provides the basis for frequency domain analysis. A new frequency domain analysis method is presented that is orders of magnitude faster than existing techniques. This efficiency is achieved by developing a formulation such that AC analysis is not required, which allows the system to be solved as a discrete system. A special system compression reduces the solution of this discrete system to the solution of the network in one phase only. This solution step, which ordinarily requires O(N3) operations, is made more efficient by reducing the system to upper Hessenberg form in a preprocessing step, which then reduces the solution cost to O(N2) operations

    Steady and Stable: Numerical Investigations of Nonlinear Partial Differential Equations

    Full text link
    Excerpt: Mathematics is a language which can describe patterns in everyday life as well as abstract concepts existing only in our minds. Patterns exist in data, functions, and sets constructed around a common theme, but the most tangible patterns are visual. Visual demonstrations can help undergraduate students connect to abstract concepts in advanced mathematical courses. The study of partial differential equations, in particular, benefits from numerical analysis and simulation

    A modified algorithm for Henrici\u27s solution of y\u27 \u27 = f (x,y)

    Get PDF
    The purpose of the present study is to discuss the boundary value problems leading to equations exemplified by the four equations previously mentioned and to write a program for an I.B.M. computer in Fortran II language as a contribution to the numerical solution of an important class of linear and nonlinear differential equations --Introduction, page 4

    Generalized orthogonal polynomials, discrete KP and Riemann-Hilbert problems

    Full text link
    Classically, a single weight on an interval of the real line leads to moments, orthogonal polynomials and tridiagonal matrices. Appropriately deforming this weight with times t=(t_1,t_2,...), leads to the standard Toda lattice and tau-functions, expressed as Hermitian matrix integrals. This paper is concerned with a sequence of t-perturbed weights, rather than one single weight. This sequence leads to moments, polynomials and a (fuller) matrix evolving according to the discrete KP-hierarchy. The associated tau-functions have integral, as well as vertex operator representations. Among the examples considered, we mention: nested Calogero-Moser systems, concatenated solitons and m-periodic sequences of weights. The latter lead to 2m+1-band matrices and generalized orthogonal polynomials, also arising in the context of a Riemann-Hilbert problem. We show the Riemann-Hilbert factorization is tantamount to the factorization of the moment matrix into the product of a lower- times upper-triangular matrix.Comment: 40 page

    The Kinetic Plasma Physics of Solar-Wind Electrons

    Get PDF
    This thesis uses kinetic plasma physics to study the kinetic evolution of the electron velocity distribution function (VDF) in the solar wind. We propose an analytical model for resonant wave–particle instability in homogeneous plasma based on quasi-linear theory. By using this model, we confirm that the oblique fastmagnetosonic/whistler (FM/W) instability can scatter the electron strahl in the electron VDF. Following the study of the local scattering, we propose a global transport theory for the kinetic expansion of solar-wind electrons. We derive a gyro-averaged kinetic transport equation that accounts for the solar-wind expansion in the geometry of the Parker-spiral magnetic field. Our kinetic transport model shows the development of the core–strahl configuration in the electron VDF near the Sun. Applying fits to our numerical results, we compare our numerical results with data from Parker Solar Probe (PSP), and provide theoretical evidence that the electron strahl is not scattered by the oblique FM/W instability near the Sun. To confirm our theoretical results for strahl scattering, we analyse data from PSP and Helios. We compare the measured strahl properties with the analytical thresholds for the oblique FM/W instability in the low- and high-β∥c regimes, where β∥c is the ratio of the parallel core thermal pressure to the magnetic pressure, as functions of heliocentric distance. Our PSP and Helios data show that the electron strahl is stable against the oblique FM/W instability in the inner heliosphere. Our analysis suggests that this instability can only be excited sporadically, on short timescales. For the numerical evaluation of the kinetic equations in the research chapters, we develop a mathematical approach based on the Crank–Nicolson scheme. This approach numerically solves any kind of diffusion equations with any dimensions. We note that our mathematical approach is applicable to other complex diffusion equations
    corecore