221 research outputs found
Parallel HOP: A Scalable Halo Finder for Massive Cosmological Data Sets
Modern N-body cosmological simulations contain billions () of dark
matter particles. These simulations require hundreds to thousands of gigabytes
of memory, and employ hundreds to tens of thousands of processing cores on many
compute nodes. In order to study the distribution of dark matter in a
cosmological simulation, the dark matter halos must be identified using a halo
finder, which establishes the halo membership of every particle in the
simulation. The resources required for halo finding are similar to the
requirements for the simulation itself. In particular, simulations have become
too extensive to use commonly-employed halo finders, such that the
computational requirements to identify halos must now be spread across multiple
nodes and cores. Here we present a scalable-parallel halo finding method called
Parallel HOP for large-scale cosmological simulation data. Based on the halo
finder HOP, it utilizes MPI and domain decomposition to distribute the halo
finding workload across multiple compute nodes, enabling analysis of much
larger datasets than is possible with the strictly serial or previous parallel
implementations of HOP. We provide a reference implementation of this method as
a part of the toolkit yt, an analysis toolkit for Adaptive Mesh Refinement
(AMR) data that includes complementary analysis modules. Additionally, we
discuss a suite of benchmarks that demonstrate that this method scales well up
to several hundred tasks and datasets in excess of particles. The
Parallel HOP method and our implementation can be readily applied to any kind
of N-body simulation data and is therefore widely applicable.Comment: 29 pages, 11 figures, 2 table
Geometry-Oblivious FMM for Compressing Dense SPD Matrices
We present GOFMM (geometry-oblivious FMM), a novel method that creates a
hierarchical low-rank approximation, "compression," of an arbitrary dense
symmetric positive definite (SPD) matrix. For many applications, GOFMM enables
an approximate matrix-vector multiplication in or even time,
where is the matrix size. Compression requires storage and work.
In general, our scheme belongs to the family of hierarchical matrix
approximation methods. In particular, it generalizes the fast multipole method
(FMM) to a purely algebraic setting by only requiring the ability to sample
matrix entries. Neither geometric information (i.e., point coordinates) nor
knowledge of how the matrix entries have been generated is required, thus the
term "geometry-oblivious." Also, we introduce a shared-memory parallel scheme
for hierarchical matrix computations that reduces synchronization barriers. We
present results on the Intel Knights Landing and Haswell architectures, and on
the NVIDIA Pascal architecture for a variety of matrices.Comment: 13 pages, accepted by SC'1
Distributed Many-to-Many Protein Sequence Alignment using Sparse Matrices
Identifying similar protein sequences is a core step in many computational
biology pipelines such as detection of homologous protein sequences, generation
of similarity protein graphs for downstream analysis, functional annotation and
gene location. Performance and scalability of protein similarity searches have
proven to be a bottleneck in many bioinformatics pipelines due to increases in
cheap and abundant sequencing data. This work presents a new distributed-memory
software, PASTIS. PASTIS relies on sparse matrix computations for efficient
identification of possibly similar proteins. We use distributed sparse matrices
for scalability and show that the sparse matrix infrastructure is a great fit
for protein similarity searches when coupled with a fully-distributed
dictionary of sequences that allows remote sequence requests to be fulfilled.
Our algorithm incorporates the unique bias in amino acid sequence substitution
in searches without altering the basic sparse matrix model, and in turn,
achieves ideal scaling up to millions of protein sequences.Comment: To appear in International Conference for High Performance Computing,
Networking, Storage, and Analysis (SC'20
Applications on emerging paradigms in parallel computing
The area of computing is seeing parallelism increasingly being incorporated at various levels: from the lowest levels of vector processing units following Single Instruction Multiple Data (SIMD) processing, Simultaneous Multi-threading (SMT) architectures, and multi/many-cores with thread-level shared memory and SIMT parallelism, to the higher levels of distributed memory parallelism as in supercomputers and clusters, and scaling them to large distributed systems as
server farms and clouds. All together these form a large hierarchy of parallelism. Developing high-performance parallel algorithms and efficient software tools, which make use of the available parallelism, is inevitable in order to harness the raw computational power these emerging systems have to offer. In the work presented in this thesis, we develop architecture-aware parallel techniques on such emerging paradigms in parallel computing, specifically, parallelism offered by the emerging multi- and many-core architectures, as well as the emerging area of cloud computing, to target large scientific applications.
First, we develop efficient parallel algorithms to compute optimal pairwise alignments of genomic sequences on heterogeneous multi-core processors, and demonstrate them on the IBM Cell Broadband Engine. Then, we develop parallel techniques for scheduling all-pairs computations on heterogeneous systems, including clusters of Cell processors, and NVIDIA graphics processors. We compare the performance of our strategies on Cell, GPU and Intel Nehalem multi-core processors. Further, we apply our algorithms to specific applications taken from the areas of systems biology, fluid dynamics and materials science: pairwise Mutual Information computations for reconstruction of gene regulatory networks; pairwise Lp-norm distance computations for coherent structures discovery in the design of flapping-wing Micro Air Vehicles, and construction of stochastic models for a set of properties of heterogeneous materials.
Lastly, in the area of cloud computing, we propose and develop an abstract framework to enable computations in parallel on large tree structures, to facilitate easy development of a class of scientific applications based on trees. Our framework, in the style of Google\u27s MapReduce paradigm, is based on two generic user-defined functions through which a user writes an application. We implement our framework as a generic programming library for a large cluster of homogeneous multi-core processor, and demonstrate its applicability through two applications: all-k-nearest neighbors computations, and Fast Multipole Method (FMM) based simulations
Quantification of 3D spatial correlations between state variables and distances to the grain boundary network in full-field crystal plasticity spectral method simulations
Deformation microstructure heterogeneities play a pivotal role during
dislocation patterning and interface network restructuring. Thus, they affect
indirectly how an alloy recrystallizes if at all. Given this relevance, it has
become common practice to study the evolution of deformation microstructure
heterogeneities with 3D experiments and full-field crystal plasticity computer
simulations including tools such as the spectral method.
Quantifying material point to grain or phase boundary distances, though, is a
practical challenge with spectral method crystal plasticity models because
these discretize the material volume rather than mesh explicitly the grain and
phase boundary interface network. This limitation calls for the development of
interface reconstruction algorithms which enable us to develop specific data
post-processing protocols to quantify spatial correlations between state
variable values at each material point and the points' corresponding distance
to the closest grain or phase boundary.
This work contributes to advance such post-processing routines. Specifically,
two grain reconstruction and three distancing methods are developed to solve
above challenge. The individual strengths and limitations of these methods
surplus the efficiency of their parallel implementation is assessed with an
exemplary DAMASK large scale crystal plasticity study. We apply the new tool to
assess the evolution of subtle stress and disorientation gradients towards
grain boundaries.Comment: Manuscript submitted to Modelling and Simulation in Materials Science
and Engineerin
Performance Modeling and Prediction for the Scalable Solution of Partial Differential Equations on Unstructured Grids
This dissertation studies the sources of poor performance in scientific computing codes based on partial differential equations (PDEs), which typically perform at a computational rate well below other scientific simulations (e.g., those with dense linear algebra or N-body kernels) on modern architectures with deep memory hierarchies. We identify that the primary factors responsible for this relatively poor performance are: insufficient available memory bandwidth, low ratio of work to data size (good algorithmic efficiency), and nonscaling cost of synchronization and gather/scatter operations (for a fixed problem size scaling). This dissertation also illustrates how to reuse the legacy scientific and engineering software within a library framework.
Specifically, a three-dimensional unstructured grid incompressible Euler code from NASA has been parallelized with the Portable Extensible Toolkit for Scientific Computing (PETSc) library for distributed memory architectures. Using this newly instrumented code (called PETSc-FUN3D) as an example of a typical PDE solver, we demonstrate some strategies that are effective in tolerating the latencies arising from the hierarchical memory system and the network. Even on a single processor from each of the major contemporary architectural families, the PETSc-FUN3D code runs from 2.5 to 7.5 times faster than the legacy code on a medium-sized data set (with approximately 105 degrees of freedom). The major source of performance improvement is the increased locality in data reference patterns achieved through blocking, interlacing, and edge reordering. To explain these performance gains, we provide simple performance models based on memory bandwidth and instruction issue rates.
Experimental evidence, in terms of translation lookaside buffer (TLB) and data cache miss rates, achieved memory bandwidth, and graduated floating point instructions per memory reference, is provided through accurate measurements with hardware counters. The performance models and experimental results motivate algorithmic and software practices that lead to improvements in both parallel scalability and per-node performance. We identify the bottlenecks to scalability (algorithmic as well as implementation) for a fixed-size problem when the number of processors grows to several thousands (the expected level of concurrency on terascale architectures). We also evaluate the hybrid programming model (mixed distributed/shared) from a performance standpoint
Performance Evaluation of Pseudospectral Ultrasound Simulations on a Cluster of Xeon Phi Accelerators
The rapid development of novel procedures in medical ultrasonics, including treatment planning in therapeutic ultrasound and image reconstruction in photoacoustic tomography, leads to increasing demand for large-scale ultrasound simulations. However, routine execution of such simulations using traditional methods, e.g., finite difference time domain, is expensive and often considered intractable due to the computational and memory requirements. The k-space corrected pseudospectral time domain method used by the k-Wave toolbox allows for significant reductions in spatial and temporal grid resolution. These improvements are achieved at the cost of all-to-all communication, which are inherent to the multi-dimensional fast Fourier transforms. To improve data locality, reduce communication and allow efficient use of accelerators, we recently implemented a domain decomposition technique based on a local Fourier basis.
In this paper, we investigate whether it is feasible to run the distributed k-Wave implementation on the Salomon cluster equipped with 864 Intel Xeon Phi (Knight’s Corner) accelerators. The results show the immaturity of the KNC platform with issues ranging from limited support of Infiniband and LustreFS in Intel MPI on this platform to poor performance of 3D FFTs achieved by Intel MKL on the KNC architecture. Yet, we show that it is possible to achieve strong and weak scaling comparable to CPU-only platforms albeit with the runtime 1.8× to 4.3× longer. However, the accounting policy for Salomon’s accelerators is far more favorable and thus their employment reduces the computational cost significantly
Integrated Development and Parallelization of Automated Dicentric Chromosome Identification Software to Expedite Biodosimetry Analysis
Manual cytogenetic biodosimetry lacks the ability to handle mass casualty events. We present an automated dicentric chromosome identification (ADCI) software utilizing parallel computing technology. A parallelization strategy combining data and task parallelism, as well as optimization of I/O operations, has been designed, implemented, and incorporated in ADCI. Experiments on an eight-core desktop show that our algorithm can expedite the process of ADCI by at least four folds. Experiments on Symmetric Computing, SHARCNET, Blue Gene/Q multi-processor computers demonstrate the capability of parallelized ADCI to process thousands of samples for cytogenetic biodosimetry in a few hours. This increase in speed underscores the effectiveness of parallelization in accelerating ADCI. Our software will be an important tool to handle the magnitude of mass casualty ionizing radiation events by expediting accurate detection of dicentric chromosomes
- …