Addressing microarchitectural implications of serverless functions

Abstract

Serverless computing has emerged as a widely-used paradigm for running services in the cloud. In this model, developers organize applications as a set of functions invoked on-demand in response to events, such as HTTP requests. Developers are charged for CPU time and memory footprint during function execution, incentivising them to reduce runtime and memory consumption. Furthermore, to avoid long start-up delays, cloud providers keep recently-triggered instances idle (or warm) for some time, anticipating future invocations. Consequently, a server may host thousands of warm instances of various functions, their executions interleaved based on incoming invocations. This thesis investigates the workload characteristics of serverless and observes that: (1) there is high interleaving among warm instances on a given server; (2) individual warm functions are invoked relatively infrequently, often at intervals of seconds or minutes; and (3) many function invocations complete within milliseconds. This interleaved execution of rarely invoked functions leads to thrashing of each function's microarchitectural state between invocations. Meanwhile, the short execution time of functions impedes the amortization of warming up on-chip microarchitectural state. As a result, when a given memory-resident function is re-invoked, it commonly finds its on-chip microarchitectural state completely cold due to thrashing by other functions---a phenomenon we term lukewarm execution. Our analysis reveals that the cold microarchitectural state severely affects CPU performance, with the main source of degradation being the core front-end, comprising instruction delivery, branch identification via the BTB, and conditional branch prediction. Based on our analysis, we propose two mechanisms to address performance degradation due to lukewarm invocations. The first technique is Jukebox, a record-and-replay instruction prefetcher specifically designed to mitigate the high cost of off-chip instruction misses. We demonstrate that Jukebox's simplistic design effectively eliminates more than 95% of long-latency off-chip instruction misses. The second technique is Ignite, which builds on Jukebox to offer a comprehensive solution for restoring front-end microarchitectural state, including instructions, BTB, and branch predictor state, via unified metadata. Ignite records an invocation's control flow graph in compressed format and uses that to restore the state of the front-end structures the next time the function is invoked. Ignite significantly reduces instruction misses, BTB misses, and branch mispredictions, resulting in an average performance improvement of 43%. In summary, this thesis demonstrates that serverless systems present distinct workload characteristics that fail to match traditional CPU designs, severely impacting performance. Two simple techniques can overcome these bottlenecks by preserving the microarchitectural state across function invocations

Similar works

This paper was published in Edinburgh Research Archive.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.