The all-to-all collective communications primitive is widely used in machine
learning (ML) and high performance computing (HPC) workloads, and optimizing
its performance is of interest to both ML and HPC communities. All-to-all is a
particularly challenging workload that can severely strain the underlying
interconnect bandwidth at scale. This is mainly because of the quadratic
scaling in the number of messages that must be simultaneously serviced combined
with large message sizes. This paper takes a holistic approach to optimize the
performance of all-to-all collective communications on supercomputer-scale
direct-connect interconnects. We address several algorithmic and practical
challenges in developing efficient and bandwidth-optimal all-to-all schedules
for any topology, lowering the schedules to various backends and fabrics that
may or may not expose additional forwarding bandwidth, establishing an upper
bound on all-to-all throughput, and exploring novel topologies that deliver
near-optimal all-to-all performance