202 research outputs found
Measuring and Managing Answer Quality for Online Data-Intensive Services
Online data-intensive services parallelize query execution across distributed
software components. Interactive response time is a priority, so online query
executions return answers without waiting for slow running components to
finish. However, data from these slow components could lead to better answers.
We propose Ubora, an approach to measure the effect of slow running components
on the quality of answers. Ubora randomly samples online queries and executes
them twice. The first execution elides data from slow components and provides
fast online answers; the second execution waits for all components to complete.
Ubora uses memoization to speed up mature executions by replaying network
messages exchanged between components. Our systems-level implementation works
for a wide range of platforms, including Hadoop/Yarn, Apache Lucene, the
EasyRec Recommendation Engine, and the OpenEphyra question answering system.
Ubora computes answer quality much faster than competing approaches that do not
use memoization. With Ubora, we show that answer quality can and should be used
to guide online admission control. Our adaptive controller processed 37% more
queries than a competing controller guided by the rate of timeouts.Comment: Technical Repor
Getafix: Workload-aware distributed interactive analytics
Distributed interactive analytics engines (Druid, Redshift, Pinot)
need to achieve low query latency while using the least storage
space. This paper presents a solution to the problem of replication
of data blocks and routing of queries. Our techniques decide
the replication level of individual data blocks (based on popularity,
access counts), as well as output optimal placement patterns for
such data blocks. For the static version of the problem (given set
of queries accessing some segments), our techniques are provably
optimal in both storage and query latency. For the dynamic version
of the problem, we build a system called Getafix that dynamically
tracks data block popularity, adjusts replication levels, dynamically
routes queries, and garbage collects less useful data blocks. We implemented
Getafix into Druid, the most popular open-source interactive
analytics engine. Our experiments use both synthetic traces
and production traces from Yahoo! Inc.’s production Druid cluster.
Compared to existing techniques Getafix either improves storage
space used by up to 3.5x while achieving comparable query
latency, or improves query latency by up to 60% while using comparable
storage.Ope
NetClone: Fast, Scalable, and Dynamic Request Cloning for Microsecond-Scale RPCs
Spawning duplicate requests, called cloning, is a powerful technique to
reduce tail latency by masking service-time variability. However, traditional
client-based cloning is static and harmful to performance under high load,
while a recent coordinator-based approach is slow and not scalable. Both
approaches are insufficient to serve modern microsecond-scale Remote Procedure
Calls (RPCs). To this end, we present NetClone, a request cloning system that
performs cloning decisions dynamically within nanoseconds at scale. Rather than
the client or the coordinator, NetClone performs request cloning in the network
switch by leveraging the capability of programmable switch ASICs. Specifically,
NetClone replicates requests based on server states and blocks redundant
responses using request fingerprints in the switch data plane. To realize the
idea while satisfying the strict hardware constraints, we address several
technical challenges when designing a custom switch data plane. NetClone can be
integrated with emerging in-network request schedulers like RackSched. We
implement a NetClone prototype with an Intel Tofino switch and a cluster of
commodity servers. Our experimental results show that NetClone can improve the
tail latency of microsecond-scale RPCs for synthetic and real-world application
workloads and is robust to various system conditions.Comment: 13 pages, ACM SIGCOMM 202
Recommended from our members
The Design and Implementation of Low-Latency Prediction Serving Systems
Machine learning is being deployed in a growing number of applications which demand real- time, accurate, and cost-efficient predictions under heavy query load. These applications employ a variety of machine learning frameworks and models, often composing several models within the same application. However, most machine learning frameworks and systems are optimized for model training and not deployment.In this thesis, I discuss three prediction serving systems designed to meet the needs of modern interactive machine learning applications. The key idea in this work is to utilize a decoupled, layered design that interposes systems on top of training frameworks to build low-latency, scalable serving systems. Velox introduced this decoupled architecture to enable fast online learning and model personalization in response to feedback. Clipper generalized this system architecture to be framework-agnostic and introduced a set of optimizations to reduce and bound prediction latency and improve prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks. And InferLine provisions and manages the individual stages of prediction pipelines to minimize cost while meeting end-to-end tail latency constraints
- …