Multi-tenant serverless inference often devolves into noisy-neighbor scenarios where a single tenant’s bursty LLM batch floods the fleet, pushing interactive calls beyond their latency budgets. We are proposing a Burst-Aware Weighted Fair Queueing (BWFQ) - a scheduler that requires only two counters per tenant (tokens earned, tokens spent) and a constant-time heap pop to pick the next invocation. In BWFQ, we use the classic token-bucket shaper where tokens accumulate at a tenant-specific base rate and are reduced on each dispatch. When a tenant exhausts all its tokens, its requests are queued, giving chances to other quieter tenant s to run. Techniques described in other papers like Dominant-Resource Fairness, BWFQ requires neither per-invocation resource profiling nor multi-dimensional share accounting, making it easy to integrate onto existing Lambda-style dispatchers. To evaluate our algorithm, we built a prototype using AWS Lambda and observed that BWFQ reduces the P99 latency gap between interactive and batch tenants from 8.5s to 2.1s; a 4.0X improvement, while preserving 94% of the throughput achieved by First-Come-First-Serve. The algorithm adds only 35 µs of scheduling overhead per decision and fits in approximately in 150 lines of Go code. These results demonstrate that a simple token-bucket fair queueing is a practical, immediately usable step towards building fairness in production serverless inference
Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.