5,872 research outputs found
86 PFLOPS Deep Potential Molecular Dynamics simulation of 100 million atoms with ab initio accuracy
We present the GPU version of DeePMD-kit, which, upon training a deep neural
network model using ab initio data, can drive extremely large-scale molecular
dynamics (MD) simulation with ab initio accuracy. Our tests show that the GPU
version is 7 times faster than the CPU version with the same power consumption.
The code can scale up to the entire Summit supercomputer. For a copper system
of 113, 246, 208 atoms, the code can perform one nanosecond MD simulation per
day, reaching a peak performance of 86 PFLOPS (43% of the peak). Such
unprecedented ability to perform MD simulation with ab initio accuracy opens up
the possibility of studying many important issues in materials and molecules,
such as heterogeneous catalysis, electrochemical cells, irradiation damage,
crack propagation, and biochemical reactions.Comment: 29 pages, 11 figure
When parallel speedups hit the memory wall
After Amdahl's trailblazing work, many other authors proposed analytical
speedup models but none have considered the limiting effect of the memory wall.
These models exploited aspects such as problem-size variation, memory size,
communication overhead, and synchronization overhead, but data-access delays
are assumed to be constant. Nevertheless, such delays can vary, for example,
according to the number of cores used and the ratio between processor and
memory frequencies. Given the large number of possible configurations of
operating frequency and number of cores that current architectures can offer,
suitable speedup models to describe such variations among these configurations
are quite desirable for off-line or on-line scheduling decisions. This work
proposes new parallel speedup models that account for variations of the average
data-access delay to describe the limiting effect of the memory wall on
parallel speedups. Analytical results indicate that the proposed modeling can
capture the desired behavior while experimental hardware results validate the
former. Additionally, we show that when accounting for parameters that reflect
the intrinsic characteristics of the applications, such as degree of
parallelism and susceptibility to the memory wall, our proposal has significant
advantages over machine-learning-based modeling. Moreover, besides being
black-box modeling, our experiments show that conventional machine-learning
modeling needs about one order of magnitude more measurements to reach the same
level of accuracy achieved in our modeling.Comment: 24 page
Investigation into scalable energy and performance models for many-core systems
PhD ThesisIt is likely that many-core processor systems will continue to penetrate
emerging embedded and high-performance applications. Scalable energy and
performance models are two critical aspects that provide insights into the
conflicting trade-offs between them with growing hardware and software
complexity. Traditional performance models, such as Amdahl’s Law,
Gustafson’s and Sun-Ni’s, have helped the research community and industry
to better understand the system performance bounds with given processing
resources, which is otherwise known as speedup. However, these models and
their existing extensions have limited applicability for energy and/or
performance-driven system optimization in practical systems. For instance,
these are typically based on software characteristics, assuming ideal and
homogeneous hardware platforms or limited forms of processor
heterogeneity. In addition, the measurement of speedup and parallelization
factors of an application running on a specific hardware platform require
instrumenting the original software codes. Indeed, practical speedup and
parallelizability models of application workloads running on modern
heterogeneous hardware are critical for energy and performance models, as
they can be used to inform design and control decisions with an aim to
improve system throughput and energy efficiency.
This thesis addresses the limitations by firstly developing novel and
scalable speedup and energy consumption models based on a more general
representation of heterogeneity, referred to as the normal form heterogeneity.
A method is developed whereby standard performance counters found in
modern many-core platforms can be used to derive speedup, and therefore
the parallelizability of the software, without instrumenting applications. This
extends the usability of the new models to scenarios where the
parallelizability of software is unknown, leading to potentially Run-Time
Management (RTM) speedup and/or energy efficiency optimization. The
models and optimization methods presented in this thesis are validated
through extensive experimentation, by running a number of different
applications in wide-ranging concurrency scenarios on a number of different
homogeneous and heterogeneous Multi/Many Core Processor (M/MCP)
systems. These include homogeneous and heterogeneous architectures and
viii
range from existing off-the-shelf platforms to potential future system
extensions. The practical use of these models and methods is demonstrated
through real examples such as studying the effectiveness of the system load
balancer.
The models and methodologies proposed in this thesis provide guidance to
a new opportunities for improving the energy efficiency of M/MCP systemsHigher Committee of Education Development
(HCED) in Ira
A GPU-accelerated package for simulation of flow in nanoporous source rocks with many-body dissipative particle dynamics
Mesoscopic simulations of hydrocarbon flow in source shales are challenging,
in part due to the heterogeneous shale pores with sizes ranging from a few
nanometers to a few micrometers. Additionally, the sub-continuum fluid-fluid
and fluid-solid interactions in nano- to micro-scale shale pores, which are
physically and chemically sophisticated, must be captured. To address those
challenges, we present a GPU-accelerated package for simulation of flow in
nano- to micro-pore networks with a many-body dissipative particle dynamics
(mDPD) mesoscale model. Based on a fully distributed parallel paradigm, the
code offloads all intensive workloads on GPUs. Other advancements, such as
smart particle packing and no-slip boundary condition in complex pore
geometries, are also implemented for the construction and the simulation of the
realistic shale pores from 3D nanometer-resolution stack images. Our code is
validated for accuracy and compared against the CPU counterpart for speedup. In
our benchmark tests, the code delivers nearly perfect strong scaling and weak
scaling (with up to 512 million particles) on up to 512 K20X GPUs on Oak Ridge
National Laboratory's (ORNL) Titan supercomputer. Moreover, a single-GPU
benchmark on ORNL's SummitDev and IBM's AC922 suggests that the host-to-device
NVLink can boost performance over PCIe by a remarkable 40\%. Lastly, we
demonstrate, through a flow simulation in realistic shale pores, that the CPU
counterpart requires 840 Power9 cores to rival the performance delivered by our
package with four V100 GPUs on ORNL's Summit architecture. This simulation
package enables quick-turnaround and high-throughput mesoscopic numerical
simulations for investigating complex flow phenomena in nano- to micro-porous
rocks with realistic pore geometries
Runtime-guided mitigation of manufacturing variability in power-constrained multi-socket NUMA nodes
This work has been supported by the Spanish Government (Severo Ochoa grants SEV2015-0493, SEV-2011-00067), by
the Spanish Ministry of Science and Innovation (contracts TIN2015-65316-P), by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), by the RoMoL ERC Advanced Grant (GA 321253) and the European HiPEAC Network of Excellence. M. Moretó has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI-2012-15047. M. Casas is supported by the Secretary for Universities and Research of the Ministry of Economy and Knowledge of the Government of Catalonia and the Cofund
programme of the Marie Curie Actions of the 7th R&D Framework Programme of the European Union (Contract 2013 BP B 00243). This work was also partially performed
under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL-CONF-689878).
Finally, the authors are grateful to the reviewers for their valuable comments, to the RoMoL team, to Xavier Teruel and Kallia Chronaki from the Programming Models group
of BSC and the Computation Department of LLNL for their technical support and useful feedback.Peer ReviewedPostprint (published version
- …