60 research outputs found
Flow control for Latency-Critical RPCs
In todayâs modern datacenters, the waiting time spent within a serverâs queue is a major contributor of the end-to-end tail latency of ÎŒs-scale remote procedure calls. In traditional TCP, congestion control handles in-network congestion, while flow control was designed to avoid memory overruns in streaming scenarios. The latter is unfortunately oblivious to the load on the server when processing short requests from multiple clients at very high rates. Acknowledging flow control as the mechanism that controls queuing on the end-host, we propose a different flow control mechanism that depends on the application-specific service-level objectives and controls the waiting time in the receivers queue by adjusting the incoming load accordingly. We design this latency-aware flow control mechanism as part of TCP by maintaining a wire-compatible header format without introducing extra messages. We implement a proof-of-concept userspace TCP stack on top of DPDK and we show that the new flow control mechanism prevents applications from violating service-level objectives in a single-server environment by throttling the incoming requests. We demonstrate the true benefit of the approach in a replicated, multi-server scenario, where independent clients leverage the flow-control signal to avoid directing requests to the overloaded server
Automated Debugging for Arbitrarily Long Executions
One of the most energy-draining and frustrating parts of software development is playing detective with elu-sive bugs. In this paper we argue that automated post-mortem debugging of failures is feasible for real, in-production systems with no runtime recording. We pro-pose reverse execution synthesis (RES), a technique that takes a coredump obtained after a failure and automat-ically computes the suffix of an execution that leads to that coredump. RES provides a way to then play back this suffix in a debugger deterministically, over and over again. We argue that the RES approach could be used to (1) automatically classify bug reports based on their root cause, (2) automatically identify coredumps for which hardware errors (e.g., bad memory), not software bugs are to blame, and (3) ultimately help developers repro-duce the root cause of the failure in order to debug it.
How to Measure the Killer Microsecond
Datacenter-networking research requires tools to both generate traffic and accurately measure latency and throughput. While hardware-based tools have long existed commercially, they are primarily used to validate ASICs and lack flexibility, e.g. to study new protocols. They are also too expensive for academics. The recent development of kernel-bypass networking and advanced NIC features such as hardware timestamping have created new opportunities for accurate latency measurements. This paper compares these two approaches, and in particular whether commodity servers and NICs, when properly configured, can measure the latency distributions as precisely as specialized hardware. Our work shows that well-designed commodity solutions can capture subtle differences in the tail latency of stateless UDP traffic. We use hardware devices as the ground truth, both to measure latency and to forward traffic. We compare the ground truth with observations that combine five latency-measuring clients and five different port forwarding solutions and configurations. State-of-the-art software such as MoonGen that uses NIC hardware timestamping provides sufficient visibility into tail latencies to study the effect of subtle operating system configuration changes. We also observe that the kernel-bypass-based TRex software, that only relies on the CPU to timestamp traffic, can also provide solid results when NIC timestamps are not available for a particular protocol or device
Measuring Latency: Am I doing it right?
This poster describes the basic methodology to conduct an accurate latency experiment
Benchmarking, Analysis, and Optimization of Serverless Function Snapshots
Serverless computing has seen rapid adoption due to its high scalability and
flexible, pay-as-you-go billing model. In serverless, developers structure
their services as a collection of functions, sporadically invoked by various
events like clicks. High inter-arrival time variability of function invocations
motivates the providers to start new function instances upon each invocation,
leading to significant cold-start delays that degrade user experience. To
reduce cold-start latency, the industry has turned to snapshotting, whereby an
image of a fully-booted function is stored on disk, enabling a faster
invocation compared to booting a function from scratch.
This work introduces vHive, an open-source framework for serverless
experimentation with the goal of enabling researchers to study and innovate
across the entire serverless stack. Using vHive, we characterize a
state-of-the-art snapshot-based serverless infrastructure, based on
industry-leading Containerd orchestration framework and Firecracker hypervisor
technologies. We find that the execution time of a function started from a
snapshot is 95% higher, on average, than when the same function is
memory-resident. We show that the high latency is attributable to frequent page
faults as the function's state is brought from disk into guest memory one page
at a time. Our analysis further reveals that functions access the same stable
working set of pages across different invocations of the same function. By
leveraging this insight, we build REAP, a light-weight software mechanism for
serverless hosts that records functions' stable working set of guest memory
pages and proactively prefetches it from disk into memory. Compared to baseline
snapshotting, REAP slashes the cold-start delays by 3.7x, on average.Comment: To appear in ASPLOS 202
Lightweight Snapshots and System-level Backtracking
We propose a new system-level abstraction, the lightweight immutable execution snapshot, which combines the immutable characteristics of checkpoints with the direct integration into the virtual memory subsystem of standard mutable address spaces. The abstraction can give arbitrary x86 programs and libraries system-level support for backtracking (akin to logic programming) and the ability to manipulate an entire address space as an immutable data structure (akin to functional programming). Our proposed implementation leverages modern x86 hardware-virtualization support
Design Guidelines for High-Performance SCM Hierarchies
With emerging storage-class memory (SCM) nearing commercialization, there is
evidence that it will deliver the much-anticipated high density and access
latencies within only a few factors of DRAM. Nevertheless, the
latency-sensitive nature of memory-resident services makes seamless integration
of SCM in servers questionable. In this paper, we ask the question of how best
to introduce SCM for such servers to improve overall performance/cost over
existing DRAM-only architectures. We first show that even with the most
optimistic latency projections for SCM, the higher memory access latency
results in prohibitive performance degradation. However, we find that
deployment of a modestly sized high-bandwidth 3D stacked DRAM cache makes the
performance of an SCM-mostly memory system competitive. The high degree of
spatial locality that memory-resident services exhibit not only simplifies the
DRAM cache's design as page-based, but also enables the amortization of
increased SCM access latencies and the mitigation of SCM's read/write latency
disparity.
We identify the set of memory hierarchy design parameters that plays a key
role in the performance and cost of a memory system combining an SCM technology
and a 3D stacked DRAM cache. We then introduce a methodology to drive
provisioning for each of these design parameters under a target
performance/cost goal. Finally, we use our methodology to derive concrete
results for specific SCM technologies. With PCM as a case study, we show that a
two bits/cell technology hits the performance/cost sweet spot, reducing the
memory subsystem cost by 40% while keeping performance within 3% of the best
performing DRAM-only system, whereas single-level and triple-level cell
organizations are impractical for use as memory replacements.Comment: Published at MEMSYS'1
Expedited Data Transfers for Serverless Clouds
Serverless computing has emerged as a popular cloud deployment paradigm. In
serverless, the developers implement their application as a set of chained
functions that form a workflow in which functions invoke each other. The cloud
providers are responsible for automatically scaling the number of instances for
each function on demand and forwarding the requests in a workflow to the
appropriate function instance. Problematically, today's serverless clouds lack
efficient support for cross-function data transfers in a workflow, preventing
the efficient execution of data-intensive serverless applications. In
production clouds, functions transmit intermediate, i.e., ephemeral, data to
other functions either as part of invocation HTTP requests (i.e., inline) or
via third-party services, such as AWS S3 storage or AWS ElastiCache in-memory
cache. The former approach is restricted to small transfer sizes, while the
latter supports arbitrary transfers but suffers from performance and cost
overheads. This work introduces Expedited Data Transfers (XDT), an
API-preserving high-performance data communication method for serverless that
enables direct function-to-function transfers. With XDT, a trusted component of
the sender function buffers the payload in its memory and sends a secure
reference to the receiver, which is picked by the load balancer and autoscaler
based on the current load. Using the reference, the receiver instance pulls the
transmitted data directly from the sender's memory. XDT is natively compatible
with existing autoscaling infrastructure, preserves function invocation
semantics, is secure, and avoids the cost and performance overheads of using an
intermediate service for data transfers. We prototype our system in
vHive/Knative deployed on a cluster of AWS EC2 nodes, showing that XDT improves
latency, bandwidth, and cost over AWS S3 and ElasticCache.Comment: latest versio
- âŠ