69 research outputs found

    QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

    Get PDF
    Previous studies have reported that common dense linear algebra operations do not achieve speed up by using multiple geographical sites of a computational grid. Because such operations are the building blocks of most scientific applications, conventional supercomputers are still strongly predominant in high-performance computing and the use of grids for speeding up large-scale scientific problems is limited to applications exhibiting parallelism at a higher level. We have identified two performance bottlenecks in the distributed memory algorithms implemented in ScaLAPACK, a state-of-the-art dense linear algebra library. First, because ScaLAPACK assumes a homogeneous communication network, the implementations of ScaLAPACK algorithms lack locality in their communication pattern. Second, the number of messages sent in the ScaLAPACK algorithms is significantly greater than other algorithms that trade flops for communication. In this paper, we present a new approach for computing a QR factorization -- one of the main dense linear algebra kernels -- of tall and skinny matrices in a grid computing environment that overcomes these two bottlenecks. Our contribution is to articulate a recently proposed algorithm (Communication Avoiding QR) with a topology-aware middleware (QCG-OMPI) in order to confine intensive communications (ScaLAPACK calls) within the different geographical sites. An experimental study conducted on the Grid'5000 platform shows that the resulting performance increases linearly with the number of geographical sites on large-scale problems (and is in particular consistently higher than ScaLAPACK's).Comment: Accepted at IPDPS10. (IEEE International Parallel & Distributed Processing Symposium 2010 in Atlanta, GA, USA.

    From Cluster to Grid: A Case Study in Scaling-Up a Molecular Electronics Simulation Code

    Get PDF
    This paper describes an ongoing project whose goal is to significantly improve the performance and applicability of a molecular electronics simulation code. The specific goals are to (1) increase computational performance on the simulation problems currently being solved by our physics collaborators; (2) allow much larger problems to be solved in reasonable time; and (3) expand the set of resources available to the code, from a single homogeneous cluster to a campus-wide computational grid, while maintaining acceptable performance across this larger set of resources. We describe the sequential performance of the code, the performance of two parallel versions, and the benefits of problem-specific load balancing strategies. The grid context motivates the need for runtime algorithm selection; we present a component-based software framework that makes this possible

    Revisiting Matrix Product on Master-Worker Platforms

    Get PDF
    This paper is aimed at designing efficient parallel matrix-product algorithms for heterogeneous master-worker platforms. While matrix-product is well-understood for homogeneous 2D-arrays of processors (e.g., Cannon algorithm and ScaLAPACK outer product algorithm), there are three key hypotheses that render our work original and innovative: - Centralized data. We assume that all matrix files originate from, and must be returned to, the master. - Heterogeneous star-shaped platforms. We target fully heterogeneous platforms, where computational resources have different computing powers. - Limited memory. Because we investigate the parallelization of large problems, we cannot assume that full matrix panels can be stored in the worker memories and re-used for subsequent updates (as in ScaLAPACK). We have devised efficient algorithms for resource selection (deciding which workers to enroll) and communication ordering (both for input and result messages), and we report a set of numerical experiments on various platforms at Ecole Normale Superieure de Lyon and the University of Tennessee. However, we point out that in this first version of the report, experiments are limited to homogeneous platforms

    Context adaptivity for selected computational kernels with applications in optoelectronics and in phylogenetics

    Get PDF
    Computational Kernels sind der kritische Teil rechenintensiver Software, wofür der größte Rechenaufwand anfällt; daher müssen deren Design und Implementierung sorgfältig vorgenommen werden. Zwei wissenschaftliche Anwendungsprobleme aus der Optoelektronik und aus der Phylogenetik, sowie dazugehörige Computational Kernels motivieren diese Arbeit. Im ersten Anwendungsproblem werden Komponenten zur Berechnung komplex-symmetrischer Eigenwertprobleme diskutiert, welche in der Simulation von Wellenleitern in der Optoelektronik auftreten. LAPACK und ScaLAPACK beinhalten sehr leistungsfähige Referenzimplementierungen für bestimmte Problemstellungen der linearen Algebra. In Bezug auf Eigenwertprobleme werden ausschließlich reell-symmetrische und komplex-hermitesche Varianten angeboten, daher sind effiziente Codes für komplex-symmetrische (nicht-hermitesche) Eigenwertprobleme sehr wünschenswert. Das zweite Anwendungsproblem behandelt einen parallelen, wissenschaftlichen Workflow zur Rekonstruktion von Phylogenien, welcher entworfen, umgesetzt und evaluiert wird. Die Rekonstruktion von phylogenetischen Bäumen ist ein NP-hartes Problem, welches äußerst viel Rechenkapazität benötigt, wodurch ein paralleler Ansatz erforderlich ist. Die grundlegende Idee dieser Arbeit ist die Untersuchung der Wechselbeziehung zwischen dem Kontext der behandelten Kernels und deren Effizienz. Ein Kontext eines Computational Kernels beinhaltet Modellaspekte (z.B. Struktur der Eingabedaten), Softwareaspekte (z.B. rechenintensive Bibliotheken), Hardwareaspekte (z.B. verfügbarer Hauptspeicher und unterstützte darstellbare Genauigkeit), sowie weitere Anforderungen bzw. Einschränkungen. Einschränkungen sind hinsichtlich Laufzeit, Speicherverbrauch, gelieferte Genauigkeit usw., möglich. Das Konzept der Kontextadaptivität wird für ausgewählte Anwendungsprobleme in Computational Science gezeigt. Die vorgestellte Methode ist ein Meta-Algorithmus, der Aspekte des Kontexts verwendet, um optimale Leistung hinsichtlich der angewandten Metrik zu erzielen. Es ist wichtig, den Kontext einzubeziehen, weil Anforderungen gegeneinander ausgetauscht werden könnten, resultierend in einer höheren Leistung. Zum Beispiel kann im Falle einer niedrigen benötigten Genauigkeit ein schnellerer Algorithmus einer bewährten, aber langsameren, Methode vorgezogen werden. Speziell für komplex-symmetrische Eigenwertprobleme zugeschnittene Codes zielen darauf ab, Genauigkeit gegen Geschwindigkeit einzutauschen. Die Innovation wird durch neue algorithmische Ansätze belegt, welche die algebraische Struktur ausnutzen. Bezüglich der Berechnung von phylogenetischen Bäumen wird die Abbildung eines Workflows auf ein Campusgrid-System gezeigt. Die Innovation besteht in der anpassungsfähigen Implementierung des Workflows, der nebenläufige Instanzen von Computational Kernels in einem verteilten System darstellt. Die Adaptivität bezeichnet hier die Fähigkeit des Workflows, die Rechenlast hinsichtlich verfügbarer Rechner, Zeit und Qualität der phylogenetischen Bäume anzupassen. Kontextadaptivität wird durch die Implementierung und Evaluierung von wissenschaftlichen Problemstellungen aus der Optoelektronik und aus der Phylogenetik gezeigt. Für das Fachgebiet der Optoelektronik zielt eine Familie von Algorithmen auf die Lösung von verallgemeinerten komplex-symmetrischen Eigenwertproblemen ab. Unser alternativer Ansatz nutzt die symmetrische Struktur aus und spielt günstigere Laufzeit gegen eine geringere Genauigkeit aus. Dieser Ansatz ist somit schneller, jedoch (meist) ungenauer als der konventionelle Lösungsweg. Zusätzlich zum sequentiellen Löser wird eine parallele Variante diskutiert und teilweise auf einem Cluster mit bis zu 1024 CPU-Cores evaluiert. Die erzielten Laufzeiten beweisen die Überlegenheit unseres Ansatzes -- allerdings sind weitere Untersuchungen zur Erhöhung der Genauigkeit notwendig. Für das Fachgebiet der Phylogenetik zeigen wir, dass die phylogenetische Baum-Rekonstruktion mittels eines Condor-basierten Campusgrids effizient parallelisiert werden kann. Dieser parallele wissenschaftliche Workflow weist einen geringen parallelen Overhead auf, resultierend in exzellenter Effizienz.Computational kernels are the crucial part of computationally intensive software, where most of the computing time is spent; hence, their design and implementation have to be accomplished carefully. Two scientific application problems from optoelectronics and from phylogenetics and corresponding computational kernels are motivating this thesis. In the first application problem, components for the computational solution of complex symmetric EVPs are discussed, arising in the simulation of waveguides in optoelectronics. LAPACK and ScaLAPACK contain highly effective reference implementations for certain numerical problems in linear algebra. With respect to EVPs, only real symmetric and complex Hermitian codes are available, therefore efficient codes for complex symmetric (non-Hermitian) EVPs are highly desirable. In the second application problem, a parallel scientific workflow for computing phylogenies is designed, implemented, and evaluated. The reconstruction of phylogenetic trees is an NP-hard problem that demands huge scale computing capabilities, and therefore a parallel approach is necessary. One idea underlying this thesis is to investigate the interaction between the context of the kernels considered and their efficiency. The context of a computational kernel comprises model aspects (for instance, structure of input data), software aspects (for instance, computational libraries), hardware aspects (for instance, available RAM and supported precision), and certain requirements or constraints. Constraints may exist with respect to runtime, memory usage, accuracy required, etc.. The concept of context adaptivity is demonstrated to selected computational problems in computational science. The method proposed here is a meta-algorithm that utilizes aspects of the context to result in an optimal performance concerning the applied metric. It is important to consider the context, because requirements may be traded for each other, resulting in a higher performance. For instance, in case of a low required accuracy, a faster algorithmic approach may be favored over an established but slower method. With respect to EVPs, prototypical codes that are especially targeted at complex symmetric EVPs aim at trading accuracy for speed. The innovation is evidenced by the implementation of new algorithmic approaches exploiting structure. Concerning the computation of phylogenetic trees, the mapping of a scientific workflow onto a campus grid system is demonstrated. The adaptive implementation of the workflow features concurrent instances of a computational kernel on a distributed system. Here, adaptivity refers to the ability of the workflow to vary computational load in terms of available computing resources, available time, and quality of reconstructed phylogenetic trees. Context adaptivity is discussed by means of computational problems from optoelectronics and from phylogenetics. For the field of optoelectronics, a family of implemented algorithms aim at solving generalized complex symmetric EVPs. Our alternative approach exploiting structural symmetry trades runtime for accuracy, hence, it is faster but (usually) features a lower accuracy than the conventional approach. In addition to a complete sequential solver, a parallel variant is discussed and partly evaluated on a cluster utilizing up to 1024 CPU cores. Achieved runtimes evidence the superiority of our approach, however, further investigations on improving accuracy are suggested. For the field of phylogenetics, we show that phylogenetic tree reconstruction can efficiently be parallelized on a campus grid infrastructure. The parallel scientific workflow features a moderate parallel overhead, resulting in an excellent efficiency

    Real-Time, Dynamic Hardware Accelerators for BLAS Computation

    Get PDF
    This paper presents an approach to increasing the capability of scientific computing through the use of real-time, partially reconfigurable hardware accelerators that implement basic linear algebra subprograms (BLAS). The use of reconfigurable hardware accelerators for computing linear algebra functions has the potential to increase floating point computation while at the same time providing an architecture that minimizes data movement latency and increase power efficiency. While there has been significant work by the computing community to optimize BLAS routines at the software level, optimizing these routines in hardware using reconfigurable fabrics is in its infancy. This paper begins with a comprehensive overview of the history and evolution of BLAS for use in scientific computing. In the reviews current successes in using reconfigurable computing architectures achieve acceleration. It then presents an investigation of an accelerator approach with a granularity at the logic circuit level through real-time, partial reconfiguration of a programmable fabric with static accelerator cache memory to minimize data movement. Empirical data is presented for a study on a single-FPGA

    Siesta: Recent developments and applications

    Get PDF
    A review of the present status, recent enhancements, and applicability of the SIESTA program is presented. Since its debut in the mid-1990s, SIESTA’s flexibility, efficiency, and free distribution have given advanced materials simulation capabilities to many groups worldwide. The core methodological scheme of SIESTA combines finite-support pseudo-atomic orbitals as basis sets, norm-conserving pseudopotentials, and a realspace grid for the representation of charge density and potentials and the computation of their associated matrix elements. Here, we describe the more recent implementations on top of that core scheme, which include full spin–orbit interaction, non-repeated and multiple-contact ballistic electron transport, density functional theory (DFT)+U and hybrid functionals, time-dependent DFT, novel reduced-scaling solvers, density-functional perturbation theory, efficient van der Waals non-local density functionals, and enhanced molecular-dynamics options. In addition, a substantial effort has been made in enhancing interoperability and interfacing with other codes and utilities, such as WANNIER90 and the second-principles modeling it can be used for, an AiiDA plugin for workflow automatization, interface to Lua for steering SIESTA runs, and various post-processing utilities. SIESTA has also been engaged in the Electronic Structure Library effort from its inception, which has allowed the sharing of various low-level libraries, as well as data standards and support for them, particularly the PSeudopotential Markup Language definition and library for transferable pseudopotentials, and the interface to the ELectronic Structure Infrastructure library of solvers. Code sharing is made easier by the new open-source licensing model of the program. This review also presents examples of application of the capabilities of the code, as well as a view of on-going and future developments. Published under license by AIP Publishing.Siesta development was historically supported by different Spanish National Plan projects (Project Nos. MEC-DGES-PB95-0202, MCyT-BFM2000-1312, MEC-BFM2003-03372, FIS2006-12117, FIS2009-12721, FIS2012-37549, FIS2015-64886-P, and RTC-2016-5681-7), the latter one together with Simune Atomistics Ltd. We are thankful for financial support from the Spanish Ministry of Science, Innovation and Universities through Grant No. PGC2018-096955-B. We acknowledge the Severo Ochoa Center of Excellence Program [Grant Nos. SEV-2015-0496 (ICMAB) and SEV-2017-0706 (ICN2)], the GenCat (Grant No. 2017SGR1506), and the European Union MaX Center of Excellence (EU-H2020 Grant No. 824143). P.G.-F. acknowledges support from Ramón y Cajal (Grant No. RyC-2013-12515). J.I.C. acknowledges Grant No. RTI2018-097895-B-C41. R.C. acknowledges the European Union’s Horizon 2020 Research and Innovation Program under Marie Skłodoswka-Curie Grant Agreement No. 665919. D.S.P, P.K., and P.B. acknowledge Grant No. MAT2016-78293-C6, FET-Open No. 863098, and UPV-EHU Grant No. IT1246-19. V. W. Yu was supported by a MolSSI Fellowship (U.S. NSF Award No. 1547580), and V.B. and V.W.Y. were supported by the ELSI Development by the NSF (Award No. 1450280). We also acknowledge Honghui Shang and Xinming Qin for giving us access to the honpas code, where a preliminary version of the hybrid functional support described here was implemented. We are indebted to other contributors to the Siesta project whose names can be seen in the Docs/Contributors.txt file of the Siesta distribution, and we thank those, too many to list, contributing fixes, comments, clarifications, and documentation for the code.Peer reviewe

    Siesta: Recent developments and applications

    Get PDF
    A review of the present status, recent enhancements, and applicability of the Siesta program is presented. Since its debut in the mid-1990s, Siesta?s flexibility, efficiency, and free distribution have given advanced materials simulation capabilities to many groups worldwide. The core methodological scheme of Siesta combines finite-support pseudo-atomic orbitals as basis sets, norm-conserving pseudopotentials, and a real-space grid for the representation of charge density and potentials and the computation of their associated matrix elements. Here, we describe the more recent implementations on top of that core scheme, which include full spin?orbit interaction, non-repeated and multiple-contact ballistic electron transport, density functional theory (DFT)+U and hybrid functionals, time-dependent DFT, novel reduced-scaling solvers, density-functional perturbation theory, efficient van der Waals non-local density functionals, and enhanced molecular-dynamics options. In addition, a substantial effort has been made in enhancing interoperability and interfacing with other codes and utilities, such as wannier90 and the second-principles modeling it can be used for, an AiiDA plugin for workflow automatization, interface to Lua for steering Siesta runs, and various post-processing utilities. Siesta has also been engaged in the Electronic Structure Library effort from its inception, which has allowed the sharing of various low-level libraries, as well as data standards and support for them, particularly the PSeudopotential Markup Language definition and library for transferable pseudopotentials, and the interface to the ELectronic Structure Infrastructure library of solvers. Code sharing is made easier by the new open-source licensing model of the program. This review also presents examples of application of the capabilities of the code, as well as a view of on-going and future developments.SIESTA development was historically supported by different Spanish National Plan projects (Project Nos. MEC-DGES-PB95-0202, MCyT-BFM2000-1312, MEC-BFM2003-03372, FIS2006-12117, FIS2009-12721, FIS2012-37549, FIS2015-64886-P, and RTC-2016-5681-7), the latter one together with Simune Atomistics Ltd. We are thankful for financial support from the Spanish Ministry of Science, Innovation and Universities through Grant No. PGC2018-096955-

    High performance selected inversion methods for sparse matrices: direct and stochastic approaches to selected inversion

    Get PDF
    The explicit evaluation of selected entries of the inverse of a given sparse matrix is an important process in various application fields and is gaining visibility in recent years. While a standard inversion process would require the computation of the whole inverse who is, in general, a dense matrix, state-of-the-art solvers perform a selected inversion process instead. Such approach allows to extract specific entries of the inverse, e.g., the diagonal, avoiding the standard inversion steps, reducing therefore time and memory requirements. Despite the complexity reduction already achieved, the natural direction for the development of the selected inversion software is the parallelization and distribution of the computation, exploiting multinode implementations of the algorithms. In this work we introduce parallel, high performance selected inversion algorithms suitable for both the computation and estimation of the diagonal of the inverse of large, sparse matrices. The first approach is built on top of a sparse factorization method and a distributed computation of the Schur-complement, and is specifically designed for the parallel treatment of large, dense matrices including a sparse block. The second is based on the stochastic estimation of the matrix diagonal using a stencil-based, matrix-free Krylov subspace iteration. We implement the two solvers and prove their excellent performance on Cray supercomputers, focusing on both the multinode scalability and the numerical accuracy. Finally, we include the solvers into two distinct frameworks designed for the solution of selected inversion problems in real-life applications. First, we present a parallel, scalable framework for the log-likelihood maximization in genomic prediction problems including marker by environment effects. Then, we apply the matrix-free estimator to the treatment of large-scale three-dimensional nanoelectronic device simulations with open boundary conditions
    • …
    corecore