172 research outputs found
Book of Abstracts of the Sixth SIAM Workshop on Combinatorial Scientific Computing
Book of Abstracts of CSC14 edited by Bora UçarInternational audienceThe Sixth SIAM Workshop on Combinatorial Scientific Computing, CSC14, was organized at the Ecole Normale Supérieure de Lyon, France on 21st to 23rd July, 2014. This two and a half day event marked the sixth in a series that started ten years ago in San Francisco, USA. The CSC14 Workshop's focus was on combinatorial mathematics and algorithms in high performance computing, broadly interpreted. The workshop featured three invited talks, 27 contributed talks and eight poster presentations. All three invited talks were focused on two interesting fields of research specifically: randomized algorithms for numerical linear algebra and network analysis. The contributed talks and the posters targeted modeling, analysis, bisection, clustering, and partitioning of graphs, applied in the context of networks, sparse matrix factorizations, iterative solvers, fast multi-pole methods, automatic differentiation, high-performance computing, and linear programming. The workshop was held at the premises of the LIP laboratory of ENS Lyon and was generously supported by the LABEX MILYON (ANR-10-LABX-0070, Université de Lyon, within the program ''Investissements d'Avenir'' ANR-11-IDEX-0007 operated by the French National Research Agency), and by SIAM
GPU fast multipole method with lambda-dynamics features
A significant and computationally most demanding part of molecular dynamics simulations is the calculation of long-range electrostatic interactions. Such interactions can be evaluated directly by the naĂŻve pairwise summation algorithm, which is a ubiquitous showcase example for the compute power of graphics processing units (GPUS). However, the pairwise summation has O(N^2) computational complexity for N interacting particles; thus, an approximation method with a better scaling is required. Today, the prevalent method for such approximation in the field is particle mesh Ewald (PME). PME takes advantage of fast Fourier transforms (FFTS) to approximate the solution efficiently. However, as the underlying FFTS require all-to-all communication between ranks, PME runs into a communication bottleneck. Such communication overhead is negligible only for a moderate parallelization. With increased parallelization, as needed for high-performance applications, the usage of PME becomes unprofitable. Another PME drawback is its inability to perform constant pH simulations efficiently. In such simulations, the protonation states of a protein are allowed to change dynamically during the simulation. The description of this process requires a separate evaluation of the energies for each protonation state. This can not be calculated efficiently with PME as the algorithm requires a repeated FFT for each state, which leads to a linear overhead with respect to the number of states. For a fast approximation of pairwise Coulombic interactions, which does not suffer from PME drawbacks, the Fast Multipole Method (FMM) has been implemented and fully parallelized with CUDA. To assure the optimal FMM performance for diverse MD systems multiple parallelization strategies have been developed. The algorithm has been efficiently incorporated into GROMACS and subsequently tested to determine the optimal FMM parameter set for MD simulations. Finally, the FMM has been incorporated into GROMACS to allow for out-of-the-box electrostatic calculations. The performance of the single-GPU FMM implementation, tested in GROMACS 2019, achieves about a third of highly optimized CUDA PME performance when simulating systems with uniform particle distributions. However, the FMM is expected to outperform PME at high parallelization because the FMM global communication overhead is minimal compared to that of PME. Further, the FMM has been enhanced to provide the energies of an arbitrary number of titratable sites as needed in the constant-pH method. The extension is not fully optimized yet, but the first results show the strength of the FMM for constant pH simulations. For a relatively large system with half a million particles and more than a hundred titratable sites, a straightforward approach to compute alternative energies requires the repetition of a simulation for each state of the sites. The FMM calculates all energy terms only a factor 1.5 slower than a single simulation step. Further improvements of the GPU implementation are expected to yield even more speedup compared to the actual implementation.2021-11-1
DEANN: Speeding up Kernel-Density Estimation using Approximate Nearest Neighbor Search
Kernel Density Estimation (KDE) is a nonparametric method for estimating the
shape of a density function, given a set of samples from the distribution.
Recently, locality-sensitive hashing, originally proposed as a tool for nearest
neighbor search, has been shown to enable fast KDE data structures. However,
these approaches do not take advantage of the many other advances that have
been made in algorithms for nearest neighbor algorithms. We present an
algorithm called Density Estimation from Approximate Nearest Neighbors (DEANN)
where we apply Approximate Nearest Neighbor (ANN) algorithms as a black box
subroutine to compute an unbiased KDE. The idea is to find points that have a
large contribution to the KDE using ANN, compute their contribution exactly,
and approximate the remainder with Random Sampling (RS). We present a
theoretical argument that supports the idea that an ANN subroutine can speed up
the evaluation. Furthermore, we provide a C++ implementation with a Python
interface that can make use of an arbitrary ANN implementation as a subroutine
for KDE evaluation. We show empirically that our implementation outperforms
state of the art implementations in all high dimensional datasets we
considered, and matches the performance of RS in cases where the ANN yield no
gains in performance.Comment: 24 pages, 1 figure. Submitted for revie
X10 for high-performance scientific computing
High performance computing is a key technology that enables large-scale physical
simulation in modern science. While great advances have been made in methods and
algorithms for scientific computing, the most commonly used programming models
encourage a fragmented view of computation that maps poorly to the underlying
computer architecture.
Scientific applications typically manifest physical locality, which means that interactions
between entities or events that are nearby in space or time are stronger
than more distant interactions. Linear-scaling methods exploit physical locality by approximating
distant interactions, to reduce computational complexity so that cost is
proportional to system size. In these methods, the computation required for each
portion of the system is different depending on that portion’s contribution to the
overall result. To support productive development, application programmers need
programming models that cleanly map aspects of the physical system being simulated
to the underlying computer architecture while also supporting the irregular
workloads that arise from the fragmentation of a physical system.
X10 is a new programming language for high-performance computing that uses
the asynchronous partitioned global address space (APGAS) model, which combines
explicit representation of locality with asynchronous task parallelism. This thesis
argues that the X10 language is well suited to expressing the algorithmic properties
of locality and irregular parallelism that are common to many methods for physical
simulation.
The work reported in this thesis was part of a co-design effort involving researchers
at IBM and ANU in which two significant computational chemistry codes
were developed in X10, with an aim to improve the expressiveness and performance
of the language. The first is a Hartree–Fock electronic structure code, implemented
using the novel Resolution of the Coulomb Operator approach. The second evaluates
electrostatic interactions between point charges, using either the smooth particle
mesh Ewald method or the fast multipole method, with the latter used to simulate
ion interactions in a Fourier Transform Ion Cyclotron Resonance mass spectrometer.
We compare the performance of both X10 applications to state-of-the-art software
packages written in other languages.
This thesis presents improvements to the X10 language and runtime libraries for
managing and visualizing the data locality of parallel tasks, communication using
active messages, and efficient implementation of distributed arrays. We evaluate these improvements in the context of computational chemistry application examples.
This work demonstrates that X10 can achieve performance comparable to established
programming languages when running on a single core. More importantly,
X10 programs can achieve high parallel efficiency on a multithreaded architecture,
given a divide-and-conquer pattern parallel tasks and appropriate use of worker-local
data. For distributed memory architectures, X10 supports the use of active messages
to construct local, asynchronous communication patterns which outperform global,
synchronous patterns. Although point-to-point active messages may be implemented
efficiently, productive application development also requires collective communications;
more work is required to integrate both forms of communication in the X10
language. The exploitation of locality is the key insight in both linear-scaling methods and
the APGAS programming model; their combination represents an attractive opportunity
for future co-design efforts
Modeling EMI Resulting from a Signal Via Transition Through Power/Ground Layers
Signal transitioning through layers on vias are very common in multi-layer printed circuit board (PCB) design. For a signal via transitioning through the internal power and ground planes, the return current must switch from one reference plane to another reference plane. The discontinuity of the return current at the via excites the power and ground planes, and results in noise on the power bus that can lead to signal integrity, as well as EMI problems. Numerical methods, such as the finite-difference time-domain (FDTD), Moment of Methods (MoM), and partial element equivalent circuit (PEEC) method, were employed herein to study this problem. The modeled results are supported by measurements. In addition, a common EMI mitigation approach of adding a decoupling capacitor was investigated with the FDTD method
MODULAR FAST DIRECT ANALYSIS USING NON-RADIATING LOCAL-GLOBAL SOLUTION MODES
This dissertation proposes a modular fast direct (MFD) analysis method for a class of problems involving a large fixed platform region and a smaller, variable design region. A modular solution algorithm is obtained by first decomposing the problem geometry into platform and design regions. The two regions are effectively detached from one another using basic equivalence concepts. Equivalence principles allow the total system model to be constructed in terms of independent interaction modules associated with the platform and design regions. These modules include interactions with the equivalent surface that bounds the design region. This dissertation discusses how to analyze (fill and factor) each of these modules separately and how to subsequently compose the solution to the original system using the separately analyzed modules.
The focus of this effort is on surface integral equation formulations of electromagnetic scattering from conductors and dielectrics. In order to treat large problems, it is necessary to work with sparse representations of the underlying system matrix and other, related matrices. Fortunately, a number of such representations are available. In the following, we will primarily use the adaptive cross approximation (ACA) to fill the multilevel simply sparse method (MLSSM) representation of the system matrix. The MLSSM provides a sparse representation that is similar to the multilevel fast multipole method.
Solutions to the linear systems obtained using the modular analysis strategies described above are obtained using direct methods based on the local-global solution (LOGOS) method. In particular, the LOGOS factorization provides a data sparse factorization of the MLSSM representation of the system matrix. In addition, the LOGOS solver also provides an approximate sparse factorization of the inverse of the system matrix. The availability of the inverse eases the development of the MFD method. Because the behavior of the LOGOS factorization is critical to the development of the proposed MFD method, a significant part of this dissertation is devoted to providing additional analyses, improvements, and characterizations of LOGOS-based direct solution methods. These further developments of the LOGOS factorization algorithms and their application to the development of the MFD method comprise the most significant contributions of this dissertation
- …