Search CORE

2,421 research outputs found

A Sparse SCF algorithm and its parallel implementation: Application to DFTB

Author: Rapacioli Mathias
Renon Nicolas
Scemama Anthony
Publication venue: 'American Chemical Society (ACS)'
Publication date: 01/01/2014
Field of study

We present an algorithm and its parallel implementation for solving a self consistent problem as encountered in Hartree Fock or Density Functional Theory. The algorithm takes advantage of the sparsity of matrices through the use of local molecular orbitals. The implementation allows to exploit efficiently modern symmetric multiprocessing (SMP) computer architectures. As a first application, the algorithm is used within the density functional based tight binding method, for which most of the computational time is spent in the linear algebra routines (diagonalization of the Fock/Kohn-Sham matrix). We show that with this algorithm (i) single point calculations on very large systems (millions of atoms) can be performed on large SMP machines (ii) calculations involving intermediate size systems (1~000--100~000 atoms) are also strongly accelerated and can run efficiently on standard servers (iii) the error on the total energy due to the use of a cut-off in the molecular orbital coefficients can be controlled such that it remains smaller than the SCF convergence criterion.Comment: 13 pages, 11 figure

arXiv.org e-Print Archive

HAL-INSA Toulouse

Efficient multitasking of Choleski matrix factorization on CRAY supercomputers

Author: Overman Andrea L.
Poole Eugene L.
Publication venue
Publication date
Field of study

A Choleski method is described and used to solve linear systems of equations that arise in large scale structural analysis. The method uses a novel variable-band storage scheme and is structured to exploit fast local memory caches while minimizing data access delays between main memory and vector registers. Several parallel implementations of this method are described for the CRAY-2 and CRAY Y-MP computers demonstrating the use of microtasking and autotasking directives. A portable parallel language, FORCE, is used for comparison with the microtasked and autotasked implementations. Results are presented comparing the matrix factorization times for three representative structural analysis problems from runs made in both dedicated and multi-user modes on both computers. CPU and wall clock timings are given for the parallel implementations and are compared to single processor timings of the same algorithm

NASA Technical Reports Server

Research in the effective implementation of guidance computers with large scale arrays Interim report

Author: Burke J. A.
Disparte C. P.
Erwin F. D.
Mc Kevitt J. F.
Pariser J. J.
Schardin C. H.
Publication venue
Publication date
Field of study

Functional logic character implementation in breadboard design of NASA modular compute

NASA Technical Reports Server

Revisiting Actor Programming in C++

Author: Charousset Dominik
Hiesgen Raphael
Schmidt Thomas C.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

The actor model of computation has gained significant popularity over the last decade. Its high level of abstraction makes it appealing for concurrent applications in parallel and distributed systems. However, designing a real-world actor framework that subsumes full scalability, strong reliability, and high resource efficiency requires many conceptual and algorithmic additives to the original model. In this paper, we report on designing and building CAF, the "C++ Actor Framework". CAF targets at providing a concurrent and distributed native environment for scaling up to very large, high-performance applications, and equally well down to small constrained systems. We present the key specifications and design concepts---in particular a message-transparent architecture, type-safe message interfaces, and pattern matching facilities---that make native actors a viable approach for many robust, elastic, and highly distributed developments. We demonstrate the feasibility of CAF in three scenarios: first for elastic, upscaling environments, second for including heterogeneous hardware like GPGPUs, and third for distributed runtime systems. Extensive performance evaluations indicate ideal runtime behaviour for up to 64 cores at very low memory footprint, or in the presence of GPUs. In these tests, CAF continuously outperforms the competing actor environments Erlang, Charm++, SalsaLite, Scala, ActorFoundry, and even the OpenMPI.Comment: 33 page

arXiv.org e-Print Archive

REPOSIT

Simplified vector-thread architectures for flexible and efficient data-parallel accelerators

Author: Batten Christopher Francis
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2010
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student submitted PDF version of thesis.Includes bibliographical references (p. 165-170).This thesis explores a new approach to building data-parallel accelerators that is based on simplifying the instruction set, microarchitecture, and programming methodology for a vector-thread architecture. The thesis begins by categorizing regular and irregular data-level parallelism (DLP), before presenting several architectural design patterns for data-parallel accelerators including the multiple-instruction multiple-data (MIMD) pattern, the vector single-instruction multiple-data (vector-SIMD) pattern, the single-instruction multiple-thread (SIMT) pattern, and the vector-thread (VT) pattern. Our recently proposed VT pattern includes many control threads that each manage their own array of microthreads. The control thread uses vector memory instructions to efficiently move data and vector fetch instructions to broadcast scalar instructions to all microthreads. These vector mechanisms are complemented by the ability for each microthread to direct its own control flow. In this thesis, I introduce various techniques for building simplified instances of the VT pattern. I propose unifying the VT control-thread and microthread scalar instruction sets to simplify the microarchitecture and programming methodology. I propose a new single-lane VT microarchitecture based on minimal changes to the vector-SIMD pattern.(cont.) Single-lane cores are simpler to implement than multi-lane cores and can achieve similar energy efficiency. This new microarchitecture uses control processor embedding to mitigate the area overhead of single-lane cores, and uses vector fragments to more efficiently handle both regular and irregular DLP as compared to previous VT architectures. I also propose an explicitly data-parallel VT programming methodology that is based on a slightly modified scalar compiler. This methodology is easier to use than assembly programming, yet simpler to implement than an automatically vectorizing compiler. To evaluate these ideas, we have begun implementing the Maven data-parallel accelerator. This thesis compares a simplified Maven VT core to MIMD, vector-SIMD, and SIMT cores. We have implemented these cores with an ASIC methodology, and I use the resulting gate-level models to evaluate the area, performance, and energy of several compiled microbenchmarks. This work is the first detailed quantitative comparison of the VT pattern to other patterns. My results suggest that future data-parallel accelerators based on simplified VT architectures should be able to combine the energy efficiency of vector-SIMD accelerators with the flexibility of MIMD accelerators.by Christopher Francis Batten.Ph.D

CiteSeerX

DSpace@MIT