40 research outputs found
General Purpose Flow Visualization at the Exascale
Exascale computing, i.e., supercomputers that can perform 1018 math operations per second, provide significant opportunity for improving the computational sciences. That said, these machines can be difficult to use efficiently, due to their massive parallelism, due to the use of accelerators, and due to the diversity of accelerators used. All areas of the computational science stack need to be reconsidered to address these problems. With this dissertation, we consider flow visualization, which is critical for analyzing vector field data from simulations. We specifically consider flow visualization techniques that use particle advection, i.e., tracing particle trajectories, which presents performance and implementation challenges. The dissertation makes four primary contributions. First, it synthesizes previous work on particle advection performance and introduces a high-level analytical cost model. Second, it proposes an approach for performance portability across accelerators. Third, it studies expected speedups based on using accelerators, including the importance of factors such as duration, particle count, data set, and others. Finally, it proposes an exascale-capable particle advection system that addresses diversity in many dimensions, including accelerator type, parallelism approach, analysis use case, underlying vector field, and more
Efficient Parallel Particle Advection via Targeting Devices
Particle advection is a fundamental operation for a wide range of flow visualization algorithms. Particle advection execution times can vary based on many factors, including the number of particles, duration of advection, and the underlying architecture. In this study, we introduce a new algorithm for parallel particle advection which improves execution time by targeting devices, i.e., adapting to use the CPU or GPU based on the current work. This algorithm is motivated by the observation that CPUs are sometimes able to better perform part of the overall computation since CPUs operate at a faster rate when the threads of a GPU can not be fully utilized. To evaluate our algorithm, we ran 162 experiments and compared our algorithm to traditional GPU-only and CPU-only approaches. Our results show that our algorithm adapts to match the performance of the faster of CPU-only and GPU-only approaches
FluTAS: A GPU-accelerated finite difference code for multiphase flows
We present the Fluid Transport Accelerated Solver, FluTAS, a scalable GPU
code for multiphase flows with thermal effects. The code solves the
incompressible Navier-Stokes equation for two-fluid systems, with a direct
FFT-based Poisson solver for the pressure equation. The interface between the
two fluids is represented with the Volume of Fluid (VoF) method, which is mass
conserving and well suited for complex flows thanks to its capacity of handling
topological changes. The energy equation is explicitly solved and coupled with
the momentum equation through the Boussinesq approximation. The code is
conceived in a modular fashion so that different numerical methods can be used
independently, the existing routines can be modified, and new ones can be
included in a straightforward and sustainable manner. FluTAS is written in
modern Fortran and parallelized using hybrid MPI/OpenMP in the CPU-only version
and accelerated with OpenACC directives in the GPU implementation. We present
different benchmarks to validate the code, and two large-scale simulations of
fundamental interest in turbulent multiphase flows: isothermal emulsions in HIT
and two-layer Rayleigh-B\'enard convection. FluTAS is distributed through a MIT
license and arises from a collaborative effort of several scientists, aiming to
become a flexible tool to study complex multiphase flows
GAMER: a GPU-Accelerated Adaptive Mesh Refinement Code for Astrophysics
We present the newly developed code, GAMER (GPU-accelerated Adaptive MEsh
Refinement code), which has adopted a novel approach to improve the performance
of adaptive mesh refinement (AMR) astrophysical simulations by a large factor
with the use of the graphic processing unit (GPU). The AMR implementation is
based on a hierarchy of grid patches with an oct-tree data structure. We adopt
a three-dimensional relaxing TVD scheme for the hydrodynamic solver, and a
multi-level relaxation scheme for the Poisson solver. Both solvers have been
implemented in GPU, by which hundreds of patches can be advanced in parallel.
The computational overhead associated with the data transfer between CPU and
GPU is carefully reduced by utilizing the capability of asynchronous memory
copies in GPU, and the computing time of the ghost-zone values for each patch
is made to diminish by overlapping it with the GPU computations. We demonstrate
the accuracy of the code by performing several standard test problems in
astrophysics. GAMER is a parallel code that can be run in a multi-GPU cluster
system. We measure the performance of the code by performing purely-baryonic
cosmological simulations in different hardware implementations, in which
detailed timing analyses provide comparison between the computations with and
without GPU(s) acceleration. Maximum speed-up factors of 12.19 and 10.47 are
demonstrated using 1 GPU with 4096^3 effective resolution and 16 GPUs with
8192^3 effective resolution, respectively.Comment: 60 pages, 22 figures, 3 tables. More accuracy tests are included.
Accepted for publication in ApJ
Doctor of Philosophy
dissertationVisualizing surfaces is a fundamental technique in computer science and is frequently used across a wide range of fields such as computer graphics, biology, engineering, and scientific visualization. In many cases, visualizing an interface between boundaries can provide meaningful analysis or simplification of complex data. Some examples include physical simulation for animation, multimaterial mesh extraction in biophysiology, flow on airfoils in aeronautics, and integral surfaces. However, the quest for high-quality visualization, coupled with increasingly complex data, comes with a high computational cost. Therefore, new techniques are needed to solve surface visualization problems within a reasonable amount of time while also providing sophisticated visuals that are meaningful to scientists and engineers. In this dissertation, novel techniques are presented to facilitate surface visualization. First, a particle system for mesh extraction is parallelized on the graphics processing unit (GPU) with a red-black update scheme to achieve an order of magnitude speed-up over a central processing unit (CPU) implementation. Next, extending the red-black technique to multiple materials showed inefficiencies on the GPU. Therefore, we borrow the underlying data structure from the closest point method, the closest point embedding, and the particle system solver is switched to hierarchical octree-based approach on the GPU. Third, to demonstrate that the closest point embedding is a fast, flexible data structure for surface particles, it is adapted to unsteady surface flow visualization at near-interactive speeds. Finally, the closest point embedding is a three-dimensional dense structure that does not scale well. Therefore, we introduce a closest point sparse octree that allows the closest point embedding to scale to higher resolution. Further, we demonstrate unsteady line integral convolution using the closest point method
Recommended from our members
Hybrid Analog-Digital Co-Processing for Scientific Computation
In the past 10 years computer architecture research has moved to more heterogeneity and less adherence to conventional abstractions. Scientists and engineers hold an unshakable belief that computing holds keys to unlocking humanity's Grand Challenges. Acting on that belief they have looked deeper into computer architecture to find specialized support for their applications. Likewise, computer architects have looked deeper into circuits and devices in search of untapped performance and efficiency. The lines between computer architecture layers---applications, algorithms, architectures, microarchitectures, circuits and devices---have blurred. Against this backdrop, a menagerie of computer architectures are on the horizon, ones that forgo basic assumptions about computer hardware, and require new thinking of how such hardware supports problems and algorithms.
This thesis is about revisiting hybrid analog-digital computing in support of diverse modern workloads. Hybrid computing had extensive applications in early computing history, and has been revisited for small-scale applications in embedded systems. But architectural support for using hybrid computing in modern workloads, at scale and with high accuracy solutions, has been lacking.
I demonstrate solving a variety of scientific computing problems, including stochastic ODEs, partial differential equations, linear algebra, and nonlinear systems of equations, as case studies in hybrid computing. I solve these problems on a system of multiple prototype analog accelerator chips built by a team at Columbia University. On that team I made contributions toward programming the chips, building the digital interface, and validating the chips' functionality. The analog accelerator chip is intended for use in conjunction with a conventional digital host computer.
The appeal and motivation for using an analog accelerator is efficiency and performance, but it comes with limitations in accuracy and problem sizes that we have to work around.
The first problem is how to do problems in this unconventional computation model. Scientific computing phrases problems as differential equations and algebraic equations. Differential equations are a continuous view of the world, while algebraic equations are a discrete one. Prior work in analog computing mostly focused on differential equations; algebraic equations played a minor role in prior work in analog computing. The secret to using the analog accelerator to support modern workloads on conventional computers is that these two viewpoints are interchangeable. The algebraic equations that underlie most workloads can be solved as differential equations,
and differential equations are naturally solvable in the analog accelerator chip. A hybrid analog-digital computer architecture can focus on solving linear and nonlinear algebra problems to support many workloads.
The second problem is how to get accurate solutions using hybrid analog-digital computing. The reason that the analog computation model gives less accurate solutions is it gives up representing numbers as digital binary numbers, and instead uses the full range of analog voltage and current to represent real numbers. Prior work has established that encoding data in analog signals gives an energy efficiency advantage as long as the analog data precision is limited. While the analog accelerator alone may be useful for energy-constrained applications where inputs and outputs are imprecise, we are more interested in using analog in conjunction with digital for precise solutions. This thesis gives novel insight that the trick to do so is to solve nonlinear problems where low-precision guesses are useful for conventional digital algorithms.
The third problem is how to solve large problems using hybrid analog-digital computing. The reason the analog computation model can't handle large problems is it gives up step-by-step discrete-time operation, instead allowing variables to evolve smoothly in continuous time. To make that happen the analog accelerator works by chaining hardware for mathematical operations end-to-end. During computation analog data flows through the hardware with no overheads in control logic and memory accesses. The downside is then the needed hardware size grows alongside problem sizes. While scientific computing researchers have for a long time split large problems into smaller subproblems to fit in digital computer constraints, this thesis is a first attempt to consider these divide-and-conquer algorithms as an essential tool in using the analog model of computation.
As we enter the post-Moore’s law era of computing, unconventional architectures will offer specialized models of computation that uniquely support specific problem types. Two prominent examples are deep neural networks and quantum computers. Recent trends in computer science research show these unconventional architectures will soon have broad adoption. In this thesis I show another specialized, unconventional architecture is to use analog accelerators to solve problems in scientific computing. Computer architecture researchers will discover other important models of computation in the future. This thesis is an example of the discovery process, implementation, and evaluation of how an unconventional architecture supports specialized workloads
A holistic scalable implementation approach of the lattice Boltzmann method for CPU/GPU heterogeneous clusters
This is the author accepted manuscript. The final version is available from MDPI via the DOI in this record.Heterogeneous clusters are a widely utilized class of supercomputers assembled from
different types of computing devices, for instance CPUs and GPUs, providing a huge computational
potential. Programming them in a scalable way exploiting the maximal performance introduces
numerous challenges such as optimizations for different computing devices, dealing with multiple
levels of parallelism, the application of different programming models, work distribution, and hiding
of communication with computation. We utilize the lattice Boltzmann method for fluid flow as
a representative of a scientific computing application and develop a holistic implementation for
large-scale CPU/GPU heterogeneous clusters. We review and combine a set of best practices and
techniques ranging from optimizations for the particular computing devices to the orchestration
of tens of thousands of CPU cores and thousands of GPUs. Eventually, we come up with
an implementation using all the available computational resources for the lattice Boltzmann
method operators. Our approach shows excellent scalability behavior making it future-proof for
heterogeneous clusters of the upcoming architectures on the exaFLOPS scale. Parallel efficiencies of
more than 90% are achieved leading to 2,604.72 GLUPS utilizing 24,576 CPU cores and 2,048 GPUs of
the CPU/GPU heterogeneous cluster Piz Daint and computing more than 6.8 · 109
lattice cells.This work was supported by the German Research Foundation (DFG) as part of the
Transregional Collaborative Research Centre “Invasive Computing” (SFB/TR 89). In addition, this work was
supported by a grant from the Swiss National Supercomputing Centre (CSCS) under project ID d68. We further
thank the Max Planck Computing & Data Facility (MPCDF) and the Global Scientific Information and Computing
Center (GSIC) for providing computational resources
Towards Expressive and Versatile Visualization-as-a-Service (VaaS)
The rapid growth of data in scientific visualization has posed significant challenges to the scalability and availability of interactive visualization tools. These challenges can be largely attributed to the limitations of traditional monolithic applications in handling large datasets and accommodating multiple users or devices. To address these issues, the Visualization-as-a-Service (VaaS) architecture has emerged as a promising solution. VaaS leverages cloud-based visualization capabilities to provide on-demand and cost-effective interactive visualization. Existing VaaS has been simplistic by design with focuses on task-parallelism with single-user-per-device tasks for predetermined visualizations. This dissertation aims to extend the capabilities of VaaS by exploring data-parallel visualization services with multi-device support and hypothesis-driven explorations. By incorporating stateful information and enabling dynamic computation, VaaS\u27 performance and flexibility for various real-world applications is improved. This dissertation explores the history of monolithic and VaaS architectures, the design and implementations of 3 new VaaS applications, and a final exploration of the future of VaaS. This research contributes to the advancement of interactive scientific visualization, addressing the challenges posed by large datasets and remote collaboration scenarios
Lagrangian coherent structures and trajectory similarity: two important tools for scientific visualization
This thesis studies the computation and visualization of Lagrangian coherent structures (LCS), an emerging technique for analyzing time-varying velocity fields (e.g. blood vessels and airflows), and the measure of similarity for trajectories (e.g. hurricane paths). LCS surfaces and trajectory-based techniques (e.g. trajectory clustering) are complementary to each other for visualization, while velocity fields and trajectories are two important types of scientific data, which are more and more accessible by virtue of the technology development for both data collection and numerical simulation.
A key step for LCS computation is tracing the paths of collections of particles through a flow field. When a flow field is interpolated from the nodes of an unstructured mesh, the process of advecting a particle must first find which cell in the unstructured mesh contains the particle. Since the paths of nearby particles often diverge, the parallelization of particle advection quickly leads to incoherent memory accesses of the unstructured mesh. We have developed a new block advection GPU approach that reorganizes particles into spatially coherent bundles as they follow their advection paths, which greatly improves memory coherence and thus shared-memory GPU performance. This approach works best for flows that meet the CFL criterion on unstructured meshes of uniformly sized elements, small enough to fit at least two timesteps in GPU memory.
LCS surfaces provide insight into unsteady fluid flow, but their construction has posed many challenges. These structures can be characterized as ridges of a field, but their local definition utilizes an ambiguous eigenvector direction that can point in one of two directions, and its ambiguity can lead to noise and other problems. We overcome these issues with an application of a global ridge definition, applied using the hierarchical watershed transformation. We show results on a mathematical flow model and a simulated vascular flow dataset indicating the watershed method produces less noisy structures.
Trajectory similarity has been shown to be a powerful tool for visualizing and analyzing trajectories. In this paper we propose a novel measure of trajectory similarity using both spatial and directional information. The similarity is asymmetric, bounded within [0,1], affine-invariant, and efficiently computed. Asymmetric mappings between a pair of trajectories can be derived from this similarity. Experimental results demonstrate that the measure is better than existing measures in both similarity scores and trajectory mappings. The measure also inspires a simple similarity-based clustering method for effectivly visualizing a large number of trajectories, which outperforms the state-of-the-art model-based clustering method (VFKM)