2 research outputs found
Towards larger scale collective operations in the Message Passing Interface
Supercomputers continue to expand both in size and complexity as we reach the beginning of the exascale era. Networks have evolved, from simple mechanisms which
transport data to subsystems of computers which fulfil a significant fraction of the
workload that computers are tasked with. Inevitably with this change, assumptions
which were made at the beginning of the last major shift in computing are becoming
outdated.
We introduce a new latency-bandwidth model which captures the characteristics of
sending multiple small messages in quick succession on modern networks. Contrary
to other models representing the same effects, the pipelining latency-bandwidth model
is simple and physically based. In addition, we develop a discrete-event simulation,
Fennel, to capture non-analytical effects of communication within models.
AllReduce operations with small messages are common throughout supercomputing, particularly for iterative methods. The performance of network operations are
crucial to the overall time-to-solution of an application as a whole. The Message Passing Interface standard was introduced to abstract complex communications from application level development. The underlying algorithms used for the implementation
to achieve the specified behaviour, such as the recursive doubling algorithm for AllReduce, have to evolve with the computers on which they are used.
We introduce the recursive multiplying algorithm as a generalisation of recursive
doubling. By utilising the pipelining nature of modern networks, we lower the latency
of AllReduce operations and enable greater choice of schedule. A heuristic is used to
quickly generate a near-optimal schedule, by using the pipelining latency-bandwidth
model.
Alongside recursive multiplying, the endpoints of collective operations must be
able to handle larger numbers of incoming messages. Typically this is done by duplicating receive queues for remote peers, but this requires a linear amount of memory space for the size of the application. We introduce a single-consumer multipleproducer queue which is designed to be used with MPI as a protocol to insert messages
remotely, with minimal contention for shared receive queues
Bridging a Gap Between Research and Production: Contributions to Scheduling and Simulation
Large scale distributed computing infrastructures (e.g., data centers, grids, or clouds) are used by scientists from various domains to produce outstanding research results, such as the discovery of the Higgs Boson in High Energy Physics. These infrastructures are also studied by Computer Scientists to produce their own set of scientific results. Ideally, a virtuous circle should exist between Domain and Computer Scientists: the former raising challenges that could be addressed by the latter. Unfortunately, in many occasions, a gap exists that prevents such an ideal and fostering collaboration. This habilitation covers research works conducted in the fields of scheduling and simulation that contribute to the filling of this gap. It discusses the necessary conditions to achieve this goal and details concrete initiatives in this endeavor