6 research outputs found

    Task Scheduling in an Asynchronous Distributed Memory Multifrontal Solver

    No full text
    We describe the improvements to the task scheduling for MUMPS, an asynchronous distributed memory direct solver for sparse linear systems. In the new approach, we determine, during the analysis of the matrix, candidate processes for the tasks that will be dynamically scheduled during the subsequent factorization. This approach signi cantly improves the scalability of the solver in terms of execution time and storage. By comparison with the previous version of MUMPS, we demonstrate the eciency and the scalability of the new algorithm on up to 512 processors. Our test cases include matrices from regular 3D grids and irregular ones from real-life applications

    Computational Electromagnetic Methods for Transcranial Magnetic Stimulation.

    Full text link
    Transcranial magnetic stimulation (TMS) is a noninvasive technique used both as a research tool for cognitive neuroscience and as a FDA approved treatment for depression. During TMS, coils positioned near the scalp generate electric fields and activate targeted brain regions. In this thesis, several computational electromagnetics methods that improve the analysis, design, and uncertainty quantification of TMS systems were developed. Analysis: A new fast direct technique for solving the large and sparse linear system of equations (LSEs) arising from the finite difference (FD) discretization of Maxwell’s quasi-static equations was developed. Following a factorization step, the solver permits computation of TMS fields inside realistic brain models in seconds, allowing for patient-specific real-time usage during TMS. The solver is an alternative to iterative methods for solving FD LSEs, often requiring run-times of minutes. A new integral equation (IE) method for analyzing TMS fields was developed. The human head is highly-heterogeneous and characterized by high-relative permittivities (10^7). IE techniques for analyzing electromagnetic interactions with such media suffer from high-contrast and low-frequency breakdowns. The novel high-permittivity and low-frequency stable internally combined volume-surface IE method developed. The method not only applies to the analysis of high-permittivity objects, but it is also the first IE tool that is stable when analyzing highly-inhomogeneous negative permittivity plasmas. Design: TMS applications call for electric fields to be sharply focused on regions that lie deep inside the brain. Unfortunately, fields generated by present-day Figure-8 coils stimulate relatively large regions near the brain surface. An optimization method for designing single feed TMS coil-arrays capable of producing more localized and deeper stimulation was developed. Results show that the coil-arrays stimulate 2.4 cm into the head while stimulating 3.0 times less volume than Figure-8 coils. Uncertainty quantification (UQ): The location/volume/depth of the stimulated region during TMS is often strongly affected by variability in the position and orientation of TMS coils, as well as anatomical differences between patients. A surrogate model-assisted UQ framework was developed and used to statistically characterize TMS depression therapy. The framework identifies key parameters that strongly affect TMS fields, and partially explains variations in TMS treatment responses.PhDElectrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/111459/1/luisgo_1.pd

    High-performance direct solution of finite element problems on multi-core processors

    Get PDF
    A direct solution procedure is proposed and developed which exploits the parallelism that exists in current symmetric multiprocessing (SMP) multi-core processors. Several algorithms are proposed and developed to improve the performance of the direct solution of FE problems. A high-performance sparse direct solver is developed which allows experimentation with the newly developed and existing algorithms. The performance of the algorithms is investigated using a large set of FE problems. Furthermore, operation count estimations are developed to further assess various algorithms. An out-of-core version of the solver is developed to reduce the memory requirements for the solution. I/O is performed asynchronously without blocking the thread that makes the I/O request. Asynchronous I/O allows overlapping factorization and triangular solution computations with I/O. The performance of the developed solver is demonstrated on a large number of test problems. A problem with nearly 10 million degree of freedoms is solved on a low price desktop computer using the out-of-core version of the direct solver. Furthermore, the developed solver usually outperforms a commonly used shared memory solver.Ph.D.Committee Chair: Will, Kenneth; Committee Member: Emkin, Leroy; Committee Member: Kurc, Ozgur; Committee Member: Vuduc, Richard; Committee Member: White, Donal

    Memory and performance issues in parallel multifrontal factorizations and triangular solutions with sparse right-hand sides

    Get PDF
    Nous nous intéressons à la résolution de systèmes linéaires creux de très grande taille sur des machines parallèles. Dans ce contexte, la mémoire est un facteur qui limite voire empêche souvent l’utilisation de solveurs directs, notamment ceux basés sur la méthode multifrontale. Cette étude se concentre sur les problèmes de mémoire et de performance des deux phases des méthodes directes les plus coûteuses en mémoire et en temps : la factorisation numérique et la résolution triangulaire. Dans une première partie nous nous intéressons à la phase de résolution à seconds membres creux, puis, dans une seconde partie, nous nous intéressons à la scalabilité mémoire de la factorisation multifrontale. La première partie de cette étude se concentre sur la résolution triangulaire à seconds membres creux, qui apparaissent dans de nombreuses applications. En particulier, nous nous intéressons au calcul d’entrées de l’inverse d’une matrice creuse, où les seconds membres et les vecteurs solutions sont tous deux creux. Nous présentons d’abord plusieurs schémas de stockage qui permettent de réduire significativement l’espace mémoire utilisé lors de la résolution, dans le cadre d’exécutions séquentielles et parallèles. Nous montrons ensuite que la façon dont les seconds membres sont regroupés peut fortement influencer la performance et nous considérons deux cadres différents : le cas "hors-mémoire" (out-of-core) où le but est de réduire le nombre d’accès aux facteurs, qui sont stockés sur disque, et le cas "en mémoire" (in-core) où le but est de réduire le nombre d’opérations. Finalement, nous montrons comment améliorer le parallélisme. Dans la seconde partie, nous nous intéressons à la factorisation multifrontale parallèle. Nous montrons tout d’abord que contrôler la mémoire active spécifique à la méthode multifrontale est crucial, et que les technique de "répartition" (mapping) classiques ne peuvent fournir une bonne scalabilité mémoire : le coût mémoire de la factorisation augmente fortement avec le nombre de processeurs. Nous proposons une classe d’algorithmes de répartition et d’ordonnancement "conscients de la mémoire" (memory-aware) qui cherchent à maximiser la performance tout en respectant une contrainte mémoire fournie par l’utilisateur. Ces techniques ont révélé des problèmes de performances dans certains des noyaux parallèles denses utilisés à chaque étape de la factorisation, et nous avons proposé plusieurs améliorations algorithmiques. Les idées présentées tout au long de cette étude ont été implantées dans le solveur MUMPS (Solveur MUltifrontal Massivement Parallèle) et expérimentées sur des matrices de grande taille (plusieurs dizaines de millions d’inconnues) et sur des machines massivement parallèles (jusqu’à quelques milliers de coeurs). Elles ont permis d’améliorer les performances et la robustesse du code et seront disponibles dans une prochaine version. Certaines des idées présentées dans la première partie ont également été implantées dans le solveur PDSLin (solveur linéaire hybride basé sur une méthode de complément de Schur). ABSTRACT : We consider the solution of very large sparse systems of linear equations on parallel architectures. In this context, memory is often a bottleneck that prevents or limits the use of direct solvers, especially those based on the multifrontal method. This work focuses on memory and performance issues of the two memory and computationally intensive phases of direct methods, that is, the numerical factorization and the solution phase. In the first part we consider the solution phase with sparse right-hand sides, and in the second part we consider the memory scalability of the multifrontal factorization. In the first part, we focus on the triangular solution phase with multiple sparse right-hand sides, that appear in numerous applications. We especially emphasize the computation of entries of the inverse, where both the right-hand sides and the solution are sparse. We first present several storage schemes that enable a significant compression of the solution space, both in a sequential and a parallel context. We then show that the way the right-hand sides are partitioned into blocks strongly influences the performance and we consider two different settings: the out-of-core case, where the aim is to reduce the number of accesses to the factors, that are stored on disk, and the in-core case, where the aim is to reduce the computational cost. Finally, we show how to enhance the parallel efficiency. In the second part, we consider the parallel multifrontal factorization. We show that controlling the active memory specific to the multifrontal method is critical, and that commonly used mapping techniques usually fail to do so: they cannot achieve a high memory scalability, i.e. they dramatically increase the amount of memory needed by the factorization when the number of processors increases. We propose a class of "memory-aware" mapping and scheduling algorithms that aim at maximizing performance while enforcing a user-given memory constraint and provide robust memory estimates before the factorization. These techniques have raised performance issues in the parallel dense kernels used at each step of the factorization, and we have proposed some algorithmic improvements. The ideas presented throughout this study have been implemented within the MUMPS (MUltifrontal Massively Parallel Solver) solver and experimented on large matrices (up to a few tens of millions unknowns) and massively parallel architectures (up to a few thousand cores). They have demonstrated to improve the performance and the robustness of the code, and will be available in a future release. Some of the ideas presented in the first part have also been implemented within the PDSLin (Parallel Domain decomposition Schur complement based Linear solver) solver
    corecore