3 research outputs found
Simulation of the performance and scalability of message passing interface (MPI) communications of atmospheric models running on exascale supercomputers
In this study, we identify the key message passing interface (MPI) operations
required in atmospheric modelling; then, we use a skeleton program and a
simulation framework (based on SST/macro simulation package) to simulate
these MPI operations (transposition, halo exchange, and
allreduce), with the perspective of future exascale machines in
mind. The experimental results show that the choice of the collective
algorithm has a great impact on the performance of communications; in
particular, we find that the generalized ring-k algorithm for the alltoallv
operation and the generalized recursive-k algorithm for the allreduce
operation perform the best. In addition, we observe that the impacts of
interconnect topologies and routing algorithms on the performance and
scalability of transpositions, halo exchange, and allreduce operations are
significant. However, the routing algorithm has a negligible impact on the
performance of allreduce operations because of its small message size. It is
impossible to infinitely grow bandwidth and reduce latency due to hardware
limitations. Thus, congestion may occur and limit the continuous improvement
of the performance of communications. The experiments show that the
performance of communications can be improved when congestion is mitigated by
a proper configuration of the topology and routing algorithm, which uniformly
distribute the congestion over the interconnect network to avoid the hotspots
and bottlenecks caused by congestion. It is generally believed that the
transpositions seriously limit the scalability of the spectral models. The
experiments show that the communication time of the transposition is larger
than those of the wide halo exchange for the semi-Lagrangian method and the
allreduce in the generalized conjugate residual (GCR) iterative solver for
the semi-implicit method below 2×105 MPI processes. The
transposition whose communication time decreases quickly with increasing
number of MPI processes demonstrates strong scalability in the case of very
large grids and moderate latencies. The halo exchange whose communication
time decreases more slowly than that of transposition with increasing number
of MPI processes reveals its weak scalability. In contrast, the allreduce
whose communication time increases with increasing number of MPI processes
does not scale well. From this point of view, the scalability of spectral
models could still be acceptable. Therefore it seems to be premature to
conclude that the scalability of the grid-point models is better than that of
spectral models at the exascale, unless innovative methods are exploited to
mitigate the problem of the scalability presented in the grid-point models.</p
Routing on the Channel Dependency Graph:: A New Approach to Deadlock-Free, Destination-Based, High-Performance Routing for Lossless Interconnection Networks
In the pursuit for ever-increasing compute power, and with Moore's law slowly coming to an end, high-performance computing started to scale-out to larger systems. Alongside the increasing system size, the interconnection network is growing to accommodate and connect tens of thousands of compute nodes. These networks have a large influence on total cost, application performance, energy consumption, and overall system efficiency of the supercomputer. Unfortunately, state-of-the-art routing algorithms, which define the packet paths through the network, do not utilize this important resource efficiently. Topology-aware routing algorithms become increasingly inapplicable, due to irregular topologies, which either are irregular by design, or most often a result of hardware failures. Exchanging faulty network components potentially requires whole system downtime further increasing the cost of the failure. This management approach becomes more and more impractical due to the scale of today's networks and the accompanying steady decrease of the mean time between failures. Alternative methods of operating and maintaining these high-performance interconnects, both in terms of hardware- and software-management, are necessary to mitigate negative effects experienced by scientific applications executed on the supercomputer. However, existing topology-agnostic routing algorithms either suffer from poor load balancing or are not bounded in the number of virtual channels needed to resolve deadlocks in the routing tables.
Using the fail-in-place strategy, a well-established method for storage systems to repair only critical component failures, is a feasible solution for current and future HPC interconnects as well as other large-scale installations such as data center networks. Although, an appropriate combination of topology and routing algorithm is required to minimize the throughput degradation for the entire system. This thesis contributes a network simulation toolchain to facilitate the process of finding a suitable combination, either during system design or while it is in operation. On top of this foundation, a key contribution is a novel scheduling-aware routing, which reduces fault-induced throughput degradation while improving overall network utilization. The scheduling-aware routing performs frequent property preserving routing updates to optimize the path balancing for simultaneously running batch jobs. The increased deployment of lossless interconnection networks, in conjunction with fail-in-place modes of operation and topology-agnostic, scheduling-aware routing algorithms, necessitates new solutions to solve the routing-deadlock problem. Therefore, this thesis further advances the state-of-the-art by introducing a novel concept of routing on the channel dependency graph, which allows the design of an universally applicable destination-based routing capable of optimizing the path balancing without exceeding a given number of virtual channels, which are a common hardware limitation. This disruptive innovation enables implicit deadlock-avoidance during path calculation, instead of solving both problems separately as all previous solutions
Scalable and Accurate Memory System Simulation
Memory systems today possess more complexity than ever. On one hand, main memory technology has a much more diverse portfolio. Other than the mainstream DDR DRAMs, a variety of DRAM protocols have been proliferating in certain domains. Non-Volatile Memory(NVM) also finally has commodity main memory products, introducing more heterogeneity to the main memory media. On the other hand, the scale of computer systems, from personal computers, server computers, to high performance computing systems, has been growing in response to increasing computing demand. Memory systems have to be able to keep scaling to avoid bottlenecking the whole system. However, current memory simulation works cannot accurately or efficiently model these developments, making it hard for researchers and developers to evaluate or to optimize designs for memory systems.
In this study, we attack these issues from multiple angles. First, we develop a fast and validated cycle accurate main memory simulator that can accurately model almost all existing DRAM protocols and some NVM protocols, and it can be easily extended to support upcoming protocols as well. We showcase this simulator by conducting a thorough characterization over existing DRAM protocols and provide insights on memory system designs.
Secondly, to efficiently simulate the increasingly paralleled memory systems, we propose a lax synchronization model that allows efficient parallel DRAM simulation. We build the first ever practical parallel DRAM simulator that can speedup the simulation by up to a factor of three with single digit percentage loss in accuracy comparing to cycle accurate simulations. We also developed mitigation schemes to further improve the accuracy with no additional performance cost.
Moreover, we discuss the limitation of cycle accurate models, and explore the possibility of alternative modeling of DRAM. We propose a novel approach that converts DRAM timing simulation into a classification problem. By doing so we can make predictions on DRAM latency for each memory request upon first sight, which makes it compatible for scalable architecture simulation frameworks. We developed prototypes based on various machine learning models and they demonstrate excellent performance and accuracy results that makes them a promising alternative to cycle accurate models.
Finally, for large scale memory systems where data movement is often the performance limiting factor, we propose a set of interconnect topologies and implement them in a parallel discrete event simulation framework. We evaluate the proposed topologies through simulation and prove that their scalability and performance exceeds existing topologies with increasing system size or workloads