Load balancing plays a critical role in efficiently dispatching jobs in
parallel-server systems such as cloud networks and data centers. A fundamental
challenge in the design of load balancing algorithms is to achieve an optimal
trade-off between delay performance and implementation overhead (e.g.
communication or memory usage). This trade-off has primarily been studied so
far from the angle of the amount of overhead required to achieve asymptotically
optimal performance, particularly vanishing delay in large-scale systems. In
contrast, in the present paper, we focus on an arbitrarily sparse communication
budget, possibly well below the minimum requirement for vanishing delay,
referred to as the hyper-scalable operating region. Furthermore, jobs may only
be admitted when a specific limit on the queue position of the job can be
guaranteed.
The centerpiece of our analysis is a universal upper bound for the achievable
throughput of any dispatcher-driven algorithm for a given communication budget
and queue limit. We also propose a specific hyper-scalable scheme which can
operate at any given message rate and enforce any given queue limit, while
allowing the server states to be captured via a closed product-form network, in
which servers act as customers traversing various nodes. The product-form
distribution is leveraged to prove that the bound is tight and that the
proposed hyper-scalable scheme is throughput-optimal in a many-server regime
given the communication and queue limit constraints. Extensive simulation
experiments are conducted to illustrate the results