15 research outputs found
The fast multipole method at exascale
This thesis presents a top to bottom analysis on designing and implementing fast algorithms for current and future systems. We present new analysis, algorithmic techniques, and implementations of the Fast Multipole Method (FMM) for solving N- body problems. We target the FMM because it is broadly applicable to a variety of scientific particle simulations used to study electromagnetic, fluid, and gravitational phenomena, among others. Importantly, the FMM has asymptotically optimal time complexity with guaranteed approximation accuracy. As such, it is among the most attractive solutions for scalable particle simulation on future extreme scale systems.
We specifically address two key challenges. The first challenge is how to engineer fast code for today’s platforms. We present the first in-depth study of multicore op- timizations and tuning for FMM, along with a systematic approach for transforming a conventionally-parallelized FMM into a highly-tuned one. We introduce novel opti- mizations that significantly improve the within-node scalability of the FMM, thereby enabling high-performance in the face of multicore and manycore systems. The second challenge is how to understand scalability on future systems. We present a new algorithmic complexity analysis of the FMM that considers both intra- and inter- node communication costs. Using these models, we present results for choosing the optimal algorithmic tuning parameter. This analysis also yields the surprising prediction that although the FMM is largely compute-bound today, and therefore highly scalable on current systems, the trajectory of processor architecture designs, if there are no significant changes could cause it to become communication-bound as early as the year 2015. This prediction suggests the utility of our analysis approach, which directly relates algorithmic and architectural characteristics, for enabling a new kind of highlevel algorithm-architecture co-design.
To demonstrate the scientific significance of FMM, we present two applications
namely, direct simulation of blood which is a multi-scale multi-physics problem and large-scale biomolecular electrostatics. MoBo (Moving Boundaries) is the infrastruc- ture for the direct numerical simulation of blood. It comprises of two key algorithmic components of which FMM is one. We were able to simulate blood flow using Stoke- sian dynamics on 200,000 cores of Jaguar, a peta-flop system and achieve a sustained performance of 0.7 Petaflop/s. The second application we propose as future work in this thesis is biomolecular electrostatics where we solve for the electrical potential using the boundary-integral formulation discretized with boundary element methods (BEM). The computational kernel in solving the large linear system is dense matrix vector multiply which we propose can be calculated using our scalable FMM. We propose to begin with the two dielectric problem where the electrostatic field is cal- culated using two continuum dielectric medium, the solvent and the molecule. This is only a first step to solving biologically challenging problems which have more than two dielectric medium, ion-exclusion layers, and solvent filled cavities.
Finally, given the difficulty in producing high-performance scalable code, productivity is a key concern. Recently, numerical algorithms are being redesigned to take advantage of the architectural features of emerging multicore processors. These new classes of algorithms express fine-grained asynchronous parallelism and hence reduce the cost of synchronization. We performed the first extensive performance study of a recently proposed parallel programming model, called Concurrent Collections (CnC). In CnC, the programmer expresses her computation in terms of application-specific operations, partially-ordered by semantic scheduling constraints. The CnC model is well-suited to expressing asynchronous-parallel algorithms, so we evaluate CnC using two dense linear algebra algorithms in this style for execution on state-of-the-art mul- ticore systems. Our implementations in CnC was able to match and in some cases even exceed competing vendor-tuned and domain specific library codes. We combine these two distinct research efforts by expressing FMM in CnC, our approach tries to marry performance with productivity that will be critical on future systems. Looking forward, we would like to extend this to distributed memory machines, specifically implement FMM in the new distributed CnC, distCnC to express fine-grained paral- lelism which would require significant effort in alternative models.Ph.D
adPerf: Characterizing the Performance of Third-party Ads
Monetizing websites and web apps through online advertising is widespread in
the web ecosystem. The online advertising ecosystem nowadays forces publishers
to integrate ads from these third-party domains. On the one hand, this raises
several privacy and security concerns that are actively studied in recent
years. On the other hand, given the ability of today's browsers to load dynamic
web pages with complex animations and Javascript, online advertising has also
transformed and can have a significant impact on webpage performance. The
performance cost of online ads is critical since it eventually impacts user
satisfaction as well as their Internet bill and device energy consumption.
In this paper, we apply an in-depth and first-of-a-kind performance
evaluation of web ads. Unlike prior efforts that rely primarily on adblockers,
we perform a fine-grained analysis on the web browser's page loading process to
demystify the performance cost of web ads. We aim to characterize the cost by
every component of an ad, so the publisher, ad syndicate, and advertiser can
improve the ad's performance with detailed guidance. For this purpose, we
develop an infrastructure, adPerf, for the Chrome browser that classifies page
loading workloads into ad-related and main-content at the granularity of
browser activities (such as Javascript and Layout). Our evaluations show that
online advertising entails more than 15% of browser page loading workload and
approximately 88% of that is spent on JavaScript. We also track the sources and
delivery chain of web ads and analyze performance considering the origin of the
ad contents. We observe that 2 of the well-known third-party ad domains
contribute to 35% of the ads performance cost and surprisingly, top news
websites implicitly include unknown third-party ads which in some cases build
up to more than 37% of the ads performance cost
Review: Artificial Intelligence for Liquid-Vapor Phase-Change Heat Transfer
Artificial intelligence (AI) is shifting the paradigm of two-phase heat
transfer research. Recent innovations in AI and machine learning uniquely offer
the potential for collecting new types of physically meaningful features that
have not been addressed in the past, for making their insights available to
other domains, and for solving for physical quantities based on first
principles for phase-change thermofluidic systems. This review outlines core
ideas of current AI technologies connected to thermal energy science to
illustrate how they can be used to push the limit of our knowledge boundaries
about boiling and condensation phenomena. AI technologies for meta-analysis,
data extraction, and data stream analysis are described with their potential
challenges, opportunities, and alternative approaches. Finally, we offer
outlooks and perspectives regarding physics-centered machine learning,
sustainable cyberinfrastructures, and multidisciplinary efforts that will help
foster the growing trend of AI for phase-change heat and mass transfer
CFDNet: a deep learning-based accelerator for fluid simulations
CFD is widely used in physical system design and optimization, where it is
used to predict engineering quantities of interest, such as the lift on a plane
wing or the drag on a motor vehicle. However, many systems of interest are
prohibitively expensive for design optimization, due to the expense of
evaluating CFD simulations. To render the computation tractable, reduced-order
or surrogate models are used to accelerate simulations while respecting the
convergence constraints provided by the higher-fidelity solution. This paper
introduces CFDNet -- a physical simulation and deep learning coupled framework,
for accelerating the convergence of Reynolds Averaged Navier-Stokes
simulations. CFDNet is designed to predict the primary physical properties of
the fluid including velocity, pressure, and eddy viscosity using a single
convolutional neural network at its core. We evaluate CFDNet on a variety of
use-cases, both extrapolative and interpolative, where test geometries are
observed/not-observed during training. Our results show that CFDNet meets the
convergence constraints of the domain-specific physics solver while
outperforming it by 1.9 - 7.4x on both steady laminar and turbulent flows.
Moreover, we demonstrate the generalization capacity of CFDNet by testing its
prediction on new geometries unseen during training. In this case, the approach
meets the CFD convergence criterion while still providing significant speedups
over traditional domain-only models.Comment: It has been accepted and almost published in the International
Conference in Supercomputing (ICS) 202
BubbleML: A Multi-Physics Dataset and Benchmarks for Machine Learning
In the field of phase change phenomena, the lack of accessible and diverse
datasets suitable for machine learning (ML) training poses a significant
challenge. Existing experimental datasets are often restricted, with limited
availability and sparse ground truth data, impeding our understanding of this
complex multiphysics phenomena. To bridge this gap, we present the BubbleML
Dataset
\footnote{\label{git_dataset}\url{https://github.com/HPCForge/BubbleML}} which
leverages physics-driven simulations to provide accurate ground truth
information for various boiling scenarios, encompassing nucleate pool boiling,
flow boiling, and sub-cooled boiling. This extensive dataset covers a wide
range of parameters, including varying gravity conditions, flow rates,
sub-cooling levels, and wall superheat, comprising 79 simulations. BubbleML is
validated against experimental observations and trends, establishing it as an
invaluable resource for ML research. Furthermore, we showcase its potential to
facilitate exploration of diverse downstream tasks by introducing two
benchmarks: (a) optical flow analysis to capture bubble dynamics, and (b)
operator networks for learning temperature dynamics. The BubbleML dataset and
its benchmarks serve as a catalyst for advancements in ML-driven research on
multiphysics phase change phenomena, enabling the development and comparison of
state-of-the-art techniques and models.Comment: Submitted to Neurips Datasets and Benchmarks Track 202
Towards Portable Online Prediction of Network Utilization using MPI-level Monitoring
International audienceStealing network bandwidth helps a variety of HPC runtimes and services to run additional operations in the background without negatively affecting the applications. A key ingredient to make this possible is an accurate prediction of the future network utilization, enabling the runtime to plan the background operations in advance, such as to avoid competing with the application for network bandwidth. In this paper, we propose a portable deep learning predictor that only uses the information available through MPI introspection to construct a recurrent sequence-to-sequence neural network capable of forecasting network utilization. We leverage the fact that most HPC applications exhibit periodic behaviors to enable predictions far into the future (at least the length of a period). Our on-line approach does not have an initial training phase, it continuously improves itself during application execution without incurring significant computational overhead. Experimental results show better accuracy and lower computational overhead compared with the state-of-the-art on two representative applications
Recommended from our members
Scalable Communication Endpoints for MPI+Threads Applications
Hybrid MPI+threads programming is gaining prominence as an alternative to the
traditional "MPI everywhere'" model to better handle the disproportionate
increase in the number of cores compared with other on-node resources. Current
implementations of these two models represent the two extreme cases of
communication resource sharing in modern MPI implementations. In the
MPI-everywhere model, each MPI process has a dedicated set of communication
resources (also known as endpoints), which is ideal for performance but is
resource wasteful. With MPI+threads, current MPI implementations share a single
communication endpoint for all threads, which is ideal for resource usage but
is hurtful for performance.
In this paper, we explore the tradeoff space between performance and
communication resource usage in MPI+threads environments. We first demonstrate
the two extreme cases---one where all threads share a single communication
endpoint and another where each thread gets its own dedicated communication
endpoint (similar to the MPI-everywhere model) and showcase the inefficiencies
in both these cases. Next, we perform a thorough analysis of the different
levels of resource sharing in the context of Mellanox InfiniBand. Using the
lessons learned from this analysis, we design an improved resource-sharing
model to produce \emph{scalable communication endpoints} that can achieve the
same performance as with dedicated communication resources per thread but using
just a third of the resources
Scalable Communication Endpoints for MPI+Threads Applications
Hybrid MPI+threads programming is gaining prominence as an alternative to the
traditional "MPI everywhere'" model to better handle the disproportionate
increase in the number of cores compared with other on-node resources. Current
implementations of these two models represent the two extreme cases of
communication resource sharing in modern MPI implementations. In the
MPI-everywhere model, each MPI process has a dedicated set of communication
resources (also known as endpoints), which is ideal for performance but is
resource wasteful. With MPI+threads, current MPI implementations share a single
communication endpoint for all threads, which is ideal for resource usage but
is hurtful for performance.
In this paper, we explore the tradeoff space between performance and
communication resource usage in MPI+threads environments. We first demonstrate
the two extreme cases---one where all threads share a single communication
endpoint and another where each thread gets its own dedicated communication
endpoint (similar to the MPI-everywhere model) and showcase the inefficiencies
in both these cases. Next, we perform a thorough analysis of the different
levels of resource sharing in the context of Mellanox InfiniBand. Using the
lessons learned from this analysis, we design an improved resource-sharing
model to produce \emph{scalable communication endpoints} that can achieve the
same performance as with dedicated communication resources per thread but using
just a third of the resources