1,011 research outputs found
Tempo: Robust and Self-Tuning Resource Management in Multi-tenant Parallel Databases
Multi-tenant database systems have a component called the Resource Manager,
or RM that is responsible for allocating resources to tenants. RMs today do not
provide direct support for performance objectives such as: "Average job
response time of tenant A must be less than two minutes", or "No more than 5%
of tenant B's jobs can miss the deadline of 1 hour." Thus, DBAs have to tinker
with the RM's low-level configuration settings to meet such objectives. We
propose a framework called Tempo that brings simplicity, self-tuning, and
robustness to existing RMs. Tempo provides a simple interface for DBAs to
specify performance objectives declaratively, and optimizes the RM
configuration settings to meet these objectives. Tempo has a solid theoretical
foundation which gives key robustness guarantees. We report experiments done on
Tempo using production traces of data-processing workloads from companies such
as Facebook and Cloudera. These experiments demonstrate significant
improvements in meeting desired performance objectives over RM configuration
settings specified by human experts.Comment: 14 pages, 12 figures, 2 table
FECBench: A Holistic Interference-aware Approach for Application Performance Modeling
Services hosted in multi-tenant cloud platforms often encounter performance
interference due to contention for non-partitionable resources, which in turn
causes unpredictable behavior and degradation in application performance. To
grapple with these problems and to define effective resource management
solutions for their services, providers often must expend significant efforts
and incur prohibitive costs in developing performance models of their services
under a variety of interference scenarios on different hardware. This is a hard
problem due to the wide range of possible co-located services and their
workloads, and the growing heterogeneity in the runtime platforms including the
use of fog and edge-based resources, not to mention the accidental complexity
in performing application profiling under a variety of scenarios. To address
these challenges, we present FECBench, a framework to guide providers in
building performance interference prediction models for their services without
incurring undue costs and efforts. The contributions of the paper are as
follows. First, we developed a technique to build resource stressors that can
stress multiple system resources all at once in a controlled manner to gain
insights about the interference on an application's performance. Second, to
overcome the need for exhaustive application profiling, FECBench intelligently
uses the design of experiments (DoE) approach to enable users to build
surrogate performance models of their services. Third, FECBench maintains an
extensible knowledge base of application combinations that create resource
stresses across the multi-dimensional resource design space. Empirical results
using real-world scenarios to validate the efficacy of FECBench show that the
predicted application performance has a median error of only 7.6% across all
test cases, with 5.4% in the best case and 13.5% in the worst case
PerfEnforce: A Dynamic Scaling Engine for Analytics with Performance Guarantees
In this paper, we present PerfEnforce, a scaling engine designed to enable
cloud providers to sell performance levels for data analytics cloud services.
PerfEnforce scales a cluster of virtual machines allocated to a user in a way
that minimizes cost while probabilistically meeting the query runtime
guarantees offered by a service level agreement. With PerfEnforce, we show how
to scale a cluster in a way that minimally disrupts a user's query session. We
further show when to scale the cluster using one of three methods: feedback
control, reinforcement learning, or perceptron learning. We find that
perceptron learning outperforms the other two methods when making cluster
scaling decisions
Identifying the Major Sources of Variance in Transaction Latencies: Towards More Predictable Databases
Decades of research have sought to improve transaction processing performance
and scalability in database management systems (DBMSs). However, significantly
less attention has been dedicated to the predictability of performance: how
often individual transactions exhibit execution latency far from the mean?
Performance predictability is vital when transaction processing lies on the
critical path of a complex enterprise software or an interactive web service,
as well as in emerging database-as-a-service markets where customers contract
for guaranteed levels of performance. In this paper, we take several steps
towards achieving more predictable database systems. First, we propose a
profiling framework called VProfiler that, given the source code of a DBMS, is
able to identify the dominant sources of variance in transaction latency.
VProfiler automatically instruments the DBMS source code to deconstruct the
overall variance of transaction latencies into variances and covariances of the
execution time of individual functions, which in turn provide insight into the
root causes of variance. Second, we use VProfiler to analyze MySQL and Postgres
- two of the most popular and complex open-source database systems. Our case
studies reveal that the primary causes of variance in MySQL and Postgres are
lock scheduling and centralized logging, respectively. Finally, based on
VProfiler's findings, we further focus on remedying the performance variance of
MySQL by (1) proposing a new lock scheduling algorithm, called Variance-Aware
Transaction Scheduling (VATS), (2) enhancing the buffer pool replacement
policy, and (3) identifying tuning parameters that can reduce variance
significantly. Our experimental results show that our schemes reduce overall
transaction latency variance by 37% on average (and up to 64%) without
compromising throughput or mean latency
Multiple Workflows Scheduling in Multi-tenant Distributed Systems: A Taxonomy and Future Directions
The workflow is a general notion representing the automated processes along
with the flow of data. The automation ensures the processes being executed in
the order. Therefore, this feature attracts users from various background to
build the workflow. However, the computational requirements are enormous and
investing for a dedicated infrastructure for these workflows is not always
feasible. To cater to the broader needs, multi-tenant platforms for executing
workflows were began to be built. In this paper, we identify the problems and
challenges in the multiple workflows scheduling that adhere to the platforms.
We present a detailed taxonomy from the existing solutions on scheduling and
resource provisioning aspects followed by the survey of relevant works in this
area. We open up the problems and challenges to shove up the research on
multiple workflows scheduling in multi-tenant distributed systems.Comment: Several changes has been done based on reviewers' comments after
first round review. This is a pre-print for paper (currently under second
round review) submitted to ACM Computing Survey
Serifos: Workload Consolidation and Load Balancing for SSD Based Cloud Storage Systems
Achieving high performance in virtualized data centers requires both
deploying high throughput storage clusters, i.e. based on Solid State Disks
(SSDs), as well as optimally consolidating the workloads across storage nodes.
Nowadays, the only practical solution for cloud storage providers to offer
guaranteed performance is to grossly over-provision the storage nodes. The
current workload scheduling mechanisms used in production do not have the
intelligence to optimally allocate block storage volumes based on the
performance of SSDs. In this paper, we introduce Serifos, an autonomous
performance modeling and load balancing system designed for SSD-based cloud
storage. Serifos takes into account the characteristics of the SSD storage
units and constructs hardware dependent workload consolidation models. Thus
Serifos is able to predict the latency caused by workload interference and the
average latency of concurrent workloads. Furthermore, Serifos leverages an I/O
load balancing algorithm to dynamically balance the volumes across the cluster.
Experimental results indicate that Serifos consolidation model is able to
maintain the mean prediction error of around 10% for heterogeneous hardware. As
a result of Serifos load balancing, we found that the variance and the maximum
average latency are reduced by 82% and 52%, respectively. The supported Service
Level Objectives (SLOs) on the testbed improve 43% on average latency, 32% on
the maximum read and 63% on the maximum write latency.Comment: 12 page
IOTune: A G-states Driver for Elastic Performance of Block Storage
Imagining a disk which provides baseline performance at a relatively low
price during low-load periods, but when workloads demand more resources, the
disk performance is automatically promoted in situ and in real time. In a
hardware era, this is hardly achievable. However, this imagined disk is
becoming reality due to the technical advances of software-defined storage,
which enable volume performance to be adjusted on the fly. We propose IOTune, a
resource management middleware which employs software-defined storage
primitives to implement G-states of virtual block devices. G-states enable
virtual block devices to serve at multiple performance gears, getting rid of
conflicts between immutable resource reservation and dynamic resource demands,
and always achieving resource right-provisioning for workloads. Accompanying
G-states, we also propose a new block storage pricing policy for cloud
providers. Our case study for applying G-states to cloud block storage verifies
the effectiveness of the IOTune framework. Trace-replay based evaluations
demonstrate that storage volumes with G-states adapt to workload fluctuations.
For tenants, G-states enable volumes to provide much better QoS with a same
cost of ownership, comparing with static IOPS provisioning and the I/O credit
mechanism. G-states also reduce I/O tail latencies by one to two orders of
magnitude. From the standpoint of cloud providers, G-states promote storage
utilization, creating values and benefiting competitiveness. G-states supported
by IOTune provide a new paradigm for storage resource management and pricing in
multi-tenant clouds.Comment: 15 pages, 10 figure
Query2Vec: An Evaluation of NLP Techniques for Generalized Workload Analytics
We consider methods for learning vector representations of SQL queries to
support generalized workload analytics tasks, including workload summarization
for index selection and predicting queries that will trigger memory errors. We
consider vector representations of both raw SQL text and optimized query plans,
and evaluate these methods on synthetic and real SQL workloads. We find that
general algorithms based on vector representations can outperform existing
approaches that rely on specialized features. For index recommendation, we
cluster the vector representations to compress large workloads with no loss in
performance from the recommended index. For error prediction, we train a
classifier over learned vectors that can automatically relate subtle syntactic
patterns with specific errors raised during query execution. Surprisingly, we
also find that these methods enable transfer learning, where a model trained on
one SQL corpus can be applied to an unrelated corpus and still enable good
performance. We find that these general approaches, when trained on a large
corpus of SQL queries, provides a robust foundation for a variety of workload
analysis tasks and database features, without requiring application-specific
feature engineering
Database-Agnostic Workload Management
We present a system to support generalized SQL workload analysis and
management for multi-tenant and multi-database platforms. Workload analysis
applications are becoming more sophisticated to support database
administration, model user behavior, audit security, and route queries, but the
methods rely on specialized feature engineering, and therefore must be
carefully implemented and reimplemented for each SQL dialect, database system,
and application. Meanwhile, the size and complexity of workloads are increasing
as systems centralize in the cloud. We model workload analysis and management
tasks as variations on query labeling, and propose a system design that can
support general query labeling routines across multiple applications and
database backends. The design relies on the use of learned vector embeddings
for SQL queries as a replacement for application-specific syntactic features,
reducing custom code and allowing the use of off-the-shelf machine learning
algorithms for labeling. The key hypothesis, for which we provide evidence in
this paper, is that these learned features can outperform conventional feature
engineering on representative machine learning tasks. We present the design of
a database-agnostic workload management and analytics service, describe
potential applications, and show that separating workload representation from
labeling tasks affords new capabilities and can outperform existing solutions
for representative tasks, including workload sampling for index recommendation
and user labeling for security audits
Learning-based Dynamic Cache Management in a Cloud
Caches are an important component of modern computing systems given their
significant impact on performance. In particular, caches play a key role in the
cloud due to the nature of large-scale, data-intensive processing. One of the
key challenges for the cloud providers is how to share the caching capacity
among tenants, under the circumstance that each often requires a different
degree of quality of service (QoS) with respect to data access performance. The
invariant is that the individual tenants' QoS requirements should be satisfied
while the cache usage is optimized in a system-wide manner. In this paper, we
introduce a learning-based approach for dynamic cache management in a cloud,
which is based on the estimation of data access pattern of a tenant and the
prediction of cache performance for the access pattern in question. We consider
a variety of probability distributions to estimate the data access pattern, and
examine a set of learning-based regression techniques to predict the cache hit
rate for the access pattern. The predicted cache hit rate is then used to make
a decision whether reallocating cache space is needed to meet the QoS
requirement for the tenant. Our experimental results with an extensive set of
synthetic traces and the YCSB benchmark show that the proposed method
consistently optimizes the cache space while satisfying the QoS requirement
- …