Burst-Aware Weighted Fair Queueing for Serverless Inference: Mitigating Noisy Neighbor Effects in Multi-Tenant Systems

Abstract

Multi-tenant serverless inference often devolves into noisy-neighbor scenarios where a single tenant’s bursty LLM batch floods the fleet, pushing interactive calls beyond their latency budgets. We are proposing a Burst-Aware Weighted Fair Queueing (BWFQ) - a scheduler that requires only two counters per tenant (tokens earned, tokens spent) and a constant-time heap pop to pick the next invocation. In BWFQ, we use the classic token-bucket shaper where tokens accumulate at a tenant-specific base rate and are reduced on each dispatch. When a tenant exhausts all its tokens, its requests are queued, giving chances to other quieter tenant s to run. Techniques described in other papers like Dominant-Resource Fairness, BWFQ requires neither per-invocation resource profiling nor multi-dimensional share accounting, making it easy to integrate onto existing Lambda-style dispatchers. To evaluate our algorithm, we built a prototype using AWS Lambda and observed that BWFQ reduces the P99 latency gap between interactive and batch tenants from 8.5s to 2.1s; a 4.0X improvement, while preserving 94% of the throughput achieved by First-Come-First-Serve. The algorithm adds only 35 µs of scheduling overhead per decision and fits in approximately in 150 lines of Go code. These results demonstrate that a simple token-bucket fair queueing is a practical, immediately usable step towards building fairness in production serverless inference

Similar works

Full text

Journals of Universiti Tun Hussein Onn Malaysia (UTHM)

redirect
Last time updated on 11/02/2026

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.

Licence: https://creativecommons.org/licenses/by-nc-sa/4.0