191 research outputs found

    Passively parallel regularized stokeslets

    Get PDF
    Stokes flow, discussed by G.G. Stokes in 1851, describes many microscopic biological flow phenomena, including cilia-driven transport and flagellar motility; the need to quantify and understand these flows has motivated decades of mathematical and computational research. Regularized stokeslet methods, which have been used and refined over the past twenty years, offer significant advantages in simplicity of implementation, with a recent modification based on nearest-neighbour interpolation providing significant improvements in efficiency and accuracy. Moreover this method can be implemented with the majority of the computation taking place through built-in linear algebra, entailing that state-of-the-art hardware and software developments in the latter, in particular multicore and GPU computing, can be exploited through minimal modifications ('passive parallelism') to existing MATLAB computer code. Hence, and with widely-available GPU hardware, significant improvements in the efficiency of the regularized stokeslet method can be obtained. The approach is demonstrated through computational experiments on three model biological flows: undulatory propulsion of multiple C. Elegans, simulation of progression and transport by multiple sperm in a geometrically confined region, and left-right symmetry breaking particle transport in the ventral node of the mouse embryo. In general an order-of-magnitude improvement in efficiency is observed. This development further widens the complexity of biological flow systems that are accessible without the need for extensive code development or specialist facilities.Comment: 21 pages, 7 figures, submitte

    Power Bounded Computing on Current & Emerging HPC Systems

    Get PDF
    Power has become a critical constraint for the evolution of large scale High Performance Computing (HPC) systems and commercial data centers. This constraint spans almost every level of computing technologies, from IC chips all the way up to data centers due to physical, technical, and economic reasons. To cope with this reality, it is necessary to understand how available or permissible power impacts the design and performance of emergent computer systems. For this reason, we propose power bounded computing and corresponding technologies to optimize performance on HPC systems with limited power budgets. We have multiple research objectives in this dissertation. They center on the understanding of the interaction between performance, power bounds, and a hierarchical power management strategy. First, we develop heuristics and application aware power allocation methods to improve application performance on a single node. Second, we develop algorithms to coordinate power across nodes and components based on application characteristic and power budget on a cluster. Third, we investigate performance interference induced by hardware and power contentions, and propose a contention aware job scheduling to maximize system throughput under given power budgets for node sharing system. Fourth, we extend to GPU-accelerated systems and workloads and develop an online dynamic performance & power approach to meet both performance requirement and power efficiency. Power bounded computing improves performance scalability and power efficiency and decreases operation costs of HPC systems and data centers. This dissertation opens up several new ways for research in power bounded computing to address the power challenges in HPC systems. The proposed power and resource management techniques provide new directions and guidelines to green exscale computing and other computing systems

    Development and Application of Numerical Methods in Biomolecular Solvation

    Full text link
    This work addresses the development of fast summation methods for long range particle interactions and their application to problems in biomolecular solvation, which describes the interaction of proteins or other biomolecules with their solvent environment. At the core of this work are treecodes, tree-based fast summation methods which, for N particles, reduce the cost of computing particle interactions from O(N^2) to O(N log N). Background on fast summation methods and treecodes in particular, as well as several treecode improvements developed in the early stages of this work, are presented. Building on treecodes, dual tree traversal (DTT) methods are another class of tree-based fast summation methods which reduce the cost of computing particle interactions for N particles to O(N). The primary result of this work is the development of an O(N) dual tree traversal fast summation method based on barycentric Lagrange polynomial interpolation (BLDTT). This method is implemented to run across multiple GPU compute nodes in the software package BaryTree. Across different problem sizes, particle distributions, geometries, and interaction kernels, the BLDTT shows consistently better performance than the previously developed barycentric Lagrange treecode (BLTC). The first major biomolecular solvation application of fast summation methods presented is to the Poisson–Boltzmann implicit solvent model, and in particular, the treecode-accelerated boundary integral Poisson–Boltzmann solver (TABI-PB). The work on TABI-PB consists of three primary projects and an application. The first project investigates the impact of various biomolecular surface meshing codes on TABI-PB, and integrated the NanoShaper software into the package, resulting in significantly better performance. Second, a node patch method for discretizing the system of integral equations is introduced to replace the previous centroid collocation scheme, resulting in faster convergence of solvation energies. Third, a new version of TABI-PB with GPU acceleration based on the BLDTT is developed, resulting in even more scalability. An application investigating the binding of biomolecular complexes is undertaken using the previous Taylor treecode-based version of TABI-PB. In addition to these projects, work performed over the course of this thesis integrated TABI-PB into the popular Adaptive Poisson–Boltzmann Solver (APBS) developed at Pacific Northwest National Laboratory. The second major application of fast summation methods is to the 3D reference interaction site model (3D-RISM), a statistical-mechanics based continuum solvation model. This work applies cluster-particle Taylor expansion treecodes to treat long-range asymptotic Coulomb-like potentials in 3D-RISM, and results in significant speedups and improved scalability to the 3D-RISM package implemented in AmberTools. Additionally, preliminary work on specialized GPU-accelerated treecodes based on BaryTree for 3D-RISM long-range asymptotic functions is presented.PHDApplied and Interdisciplinary MathematicsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/168120/1/lwwilson_1.pd

    Biomolecular electrostatics with continuum models: a boundary integral implementation and applications to biosensors

    Full text link
    The implicit-solvent model uses continuum electrostatic theory to represent the salt solution around dissolved biomolecules, leading to a coupled system of the Poisson-Boltzmann and Poisson equations. This thesis uses the implicit-solvent model to study solvation, binding and adsorption of proteins. We developed an implicit-solvent model solver that uses the boundary element method (BEM), called PyGBe. BEM numerically solves integral equations along the biomolecule-solvent interface only, therefore, it does not need to discretize the entire domain. PyGBe accelerates the BEM with a treecode algorithm and runs on graphic processing units. We performed extensive verification and validation of the code, comparing it with experimental observations, analytical solutions, and other numerical tools. Our results suggest that a BEM approach is more appropriate than volumetric based methods, like finite-difference or finite-element, for high accuracy calculations. We also discussed the effect of features like solvent-filled cavities and Stern layers in the implicit-solvent model, and realized that they become relevant in binding energy calculations. The application that drove this work was nano-scale biosensors-- devices designed to detect biomolecules. Biosensors are built with a functionalized layer of ligand molecules, to which the target molecule binds when it is detected. With our code, we performed a study of the orientation of proteins near charged surfaces, and investigated the ideal conditions for ligand molecule adsorption. Using immunoglobulin G as a test case, we found out that low salt concentration in the solvent and high positive surface charge density leads to favorable orientations of the ligand molecule for biosensing applications. We also studied the plasmonic response of localized surface plasmon resonance (LSPR) biosensors. LSPR biosensors monitor the plasmon resonance frequency of metallic nanoparticles, which shifts when a target molecule binds to a ligand molecule. Electrostatics is a valid approximation to the LSPR biosensor optical phenomenon in the long-wavelength limit, and BEM was able to reproduce the shift in the plasmon resonance frequency as proteins approach the nanoparticle

    Fast and Accurate Boundary Element Methods in Three Dimensions

    Get PDF
    The Laplace and Helmholtz equations are two of the most important partial differential equations (PDEs) in science, and govern problems in electromagnetism, acoustics, astrophysics, and aerodynamics. The boundary element method (BEM) is a powerful method for solving these PDEs. The BEM reduces the dimensionality of the problem by one, and treats complex boundary shapes and multi-domain problems well. The BEM also suffers from a few problems. The entries in the system matrices require computing boundary integrals, which can be difficult to do accurately, especially in the Galerkin formulation. These matrices are also dense, requiring O(N^2) to store and O(N^3) to solve using direct matrix decompositions, where N is the number of unknowns. This can effectively restrict the size of a problem. Methods are presented for computing the boundary integrals that arise in the Galerkin formulation to any accuracy. Integrals involving geometrically separated triangles are non-singular, and are computed using a technique based on spherical harmonics and multipole expansions and translations. Integrals involving triangles that have common vertices, edges, or are coincident are treated via scaling and symmetry arguments, combined with recursive geometric decomposition of the integrals. The fast multipole method (FMM) is used to accelerate the BEM. The FMM is usually designed around point sources, not the integral expressions in the BEM. To apply the FMM to these expressions, the internal logic of the FMM must be changed, but this can be difficult. The correction factor matrix method is presented, which approximates the integrals using a quadrature. The quadrature points are treated as point sources, which are plugged directly into current FMM codes. Any inaccuracies are corrected during a correction factor step. This method reduces the quadratic and cubic scalings of the BEM to linear. Software is developed for computing the solutions to acoustic scattering problems involving spheroids and disks. This software uses spheroidal wave functions to analytically build the solutions to these problems. This software is used to verify the accuracy of the BEM for the Helmholtz equation. The product of these contributions is a fast and accurate BEM solver for the Laplace and Helmholtz equations

    Geometry-Oblivious FMM for Compressing Dense SPD Matrices

    Full text link
    We present GOFMM (geometry-oblivious FMM), a novel method that creates a hierarchical low-rank approximation, "compression," of an arbitrary dense symmetric positive definite (SPD) matrix. For many applications, GOFMM enables an approximate matrix-vector multiplication in NlogNN \log N or even NN time, where NN is the matrix size. Compression requires NlogNN \log N storage and work. In general, our scheme belongs to the family of hierarchical matrix approximation methods. In particular, it generalizes the fast multipole method (FMM) to a purely algebraic setting by only requiring the ability to sample matrix entries. Neither geometric information (i.e., point coordinates) nor knowledge of how the matrix entries have been generated is required, thus the term "geometry-oblivious." Also, we introduce a shared-memory parallel scheme for hierarchical matrix computations that reduces synchronization barriers. We present results on the Intel Knights Landing and Haswell architectures, and on the NVIDIA Pascal architecture for a variety of matrices.Comment: 13 pages, accepted by SC'1

    GPU-based Private Information Retrieval for On-Device Machine Learning Inference

    Full text link
    On-device machine learning (ML) inference can enable the use of private user data on user devices without revealing them to remote servers. However, a pure on-device solution to private ML inference is impractical for many applications that rely on embedding tables that are too large to be stored on-device. In particular, recommendation models typically use multiple embedding tables each on the order of 1-10 GBs of data, making them impractical to store on-device. To overcome this barrier, we propose the use of private information retrieval (PIR) to efficiently and privately retrieve embeddings from servers without sharing any private information. As off-the-shelf PIR algorithms are usually too computationally intensive to directly use for latency-sensitive inference tasks, we 1) propose novel GPU-based acceleration of PIR, and 2) co-design PIR with the downstream ML application to obtain further speedup. Our GPU acceleration strategy improves system throughput by more than 20×20 \times over an optimized CPU PIR implementation, and our PIR-ML co-design provides an over 5×5 \times additional throughput improvement at fixed model quality. Together, for various on-device ML applications such as recommendation and language modeling, our system on a single V100 GPU can serve up to 100,000100,000 queries per second -- a >100×>100 \times throughput improvement over a CPU-based baseline -- while maintaining model accuracy
    corecore