185 research outputs found
Testing the Accuracy of Query Optimizers
ABSTRACT The accuracy of a query optimizer is intricately connected with a database system performance and its operational cost: the more accurate the optimizer's cost model, the better the resulting execution plans. Database application programmers and other practitioners have long provided anecdotal evidence that database systems differ widely with respect to the quality of their optimizers, yet, to date no formal method is available to database users to assess or refute such claims. In this paper, we develop a framework to quantify an optimizer's accuracy for a given workload. We make use of the fact that optimizers expose switches or hints that let users influence the plan choice and generate plans other than the default plan. Using these implements, we force the generation of multiple alternative plans for each test case, time the execution of all alternatives and rank the plans by their effective costs. We compare this ranking with the ranking of the estimated cost and compute a score for the accuracy of the optimizer. We present initial results of an anonymized comparisons for several major commercial database systems demonstrating that there are in fact substantial differences between systems. We also suggest ways to incorporate this knowledge into the commercial development process
Stubby: A Transformation-based Optimizer for MapReduce Workflows
There is a growing trend of performing analysis on large datasets using
workflows composed of MapReduce jobs connected through producer-consumer
relationships based on data. This trend has spurred the development of a number
of interfaces--ranging from program-based to query-based interfaces--for
generating MapReduce workflows. Studies have shown that the gap in performance
can be quite large between optimized and unoptimized workflows. However,
automatic cost-based optimization of MapReduce workflows remains a challenge
due to the multitude of interfaces, large size of the execution plan space, and
the frequent unavailability of all types of information needed for
optimization. We introduce a comprehensive plan space for MapReduce workflows
generated by popular workflow generators. We then propose Stubby, a cost-based
optimizer that searches selectively through the subspace of the full plan space
that can be enumerated correctly and costed based on the information available
in any given setting. Stubby enumerates the plan space based on plan-to-plan
transformations and an efficient search algorithm. Stubby is designed to be
extensible to new interfaces and new types of optimizations, which is a
desirable feature given how rapidly MapReduce systems are evolving. Stubby's
efficiency and effectiveness have been evaluated using representative workflows
from many domains.Comment: VLDB201
Design-time performance testing
Software designers make decisions between alternate approaches early in the development of a software application and these decisions can be difficult to change later. Designers make these decisions based on estimates of how alternatives affect software qualities. One software quality that can be difficult to predict is performance, that is, the efficient use of resources in the system. It is particularly challenging to estimate the performance of large, interconnected software systems composed of components. With the proliferation of class libraries, middle-ware systems, web services, and third party components, many software projects rely on third party services to meet their requirements. Often choosing between services involves considering both the functionality and performance of the services. To help software developers compare their designs and third-party services, I propose using performance prototypes of alternatives and test suites to estimate performance trade-offs early in the development cycle, a process called Design-Time Performance Testing (DTPT).
Providing software designers with performance evidence based on prototypes will allow designers to make informed decisions regarding performance trade-offs. To show how DTPT can help inform real design decisions. In particular: a process for DTPT, a framework implementation written in Java, and experiments to verify and validate the process and implementation. The implemented framework assists when designing, running, and documenting performance test suites, allowing designers to make accurate comparisons between alternate approaches. Performance metrics are captured by instrumenting and running prototypes.
This thesis describes the process and framework for gathering software performance estimates at design-time using prototypes and test suites
Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training
Training deep neural networks (DNNs) is becoming increasingly more resource-
and energy-intensive every year. Unfortunately, existing works primarily focus
on optimizing DNN training for faster completion, often without considering the
impact on energy efficiency.
In this paper, we observe that common practices to improve training
performance can often lead to inefficient energy usage. More importantly, we
demonstrate that there is a tradeoff between energy consumption and performance
optimization. To this end, we propose Zeus, an optimization framework to
navigate this tradeoff by automatically finding optimal job- and GPU-level
configurations for recurring DNN training jobs. Zeus uses an online
exploration-exploitation approach in conjunction with just-in-time energy
profiling, averting the need for expensive offline measurements, while adapting
to data drifts over time. Our evaluation shows that Zeus can improve the energy
efficiency of DNN training by 15.3%-75.8% for diverse workloads.Comment: NSDI 2023 | Homepage https://ml.energy/zeu
Recommended from our members
Physical Plan Instrumentation in Databases: Mechanisms and Applications
Database management systems (DBMSs) are designed with the goal set to compile SQL queries to physical plans that, when executed, provide results to the SQL queries. Building on this functionality, an ever-increasing number of application domains (e.g., provenance management, online query optimization, physical database design, interactive data profiling, monitoring, and interactive data visualization) seek to operate on how queries are executed by the DBMS for a wide variety of purposes ranging from debugging and data explanation to optimization and monitoring. Unfortunately, DBMSs provide little, if any, support to facilitate the development of this class of important application domains. The effect is such that database application developers and database system architects either rewrite the database internals in ad-hoc ways; work around the SQL interface, if possible, with inevitable performance penalties; or even build new databases from scratch only to express and optimize their domain-specific application logic over how queries are executed.
To address this problem in a principled manner in this dissertation, we introduce a prototype DBMS, namely, Smoke, that exposes instrumentation mechanisms in the form of a framework to allow external applications to manipulate physical plans. Intuitively, a physical plan is the underlying representation that DBMSs use to encode how a SQL query will be executed, and providing instrumentation mechanisms at this representation level allows applications to express and optimize their logic on how queries are executed.
Having such an instrumentation-enabled DBMS in-place, we then consider how to express and optimize applications that rely their logic on how queries are executed. To best demonstrate the expressive and optimization power of instrumentation-enabled DBMSs, we express and optimize applications across several important domains including provenance management, interactive data visualization, interactive data profiling, physical database design, online query optimization, and query discovery. Expressivity-wise, we show that Smoke can express known techniques, introduce novel semantics on known techniques, and introduce new techniques across domains. Performance-wise, we show case-by-case that Smoke is on par with or up-to several orders of magnitudes faster than state-of-the-art imperative and declarative implementations of important applications across domains.
As such, we believe our contributions provide evidence and form the basis towards a class of instrumentation-enabled DBMSs with the goal set to express and optimize applications across important domains with core logic over how queries are executed by DBMSs
Recommended from our members
Use of a Fast Information Extraction Method as a Decision Support Tool
Ad-hoc extraction of information from documents can ensure the transparency of decisions made by an organization. Different Information Extraction methods have been applied to extract information from various domains. Most widely known methods use manually annotated training documents that require high development time. The automated training methods are not scalable to large application domains. We have developed a semi-automated knowledge-engineering method for building the knowledge-base with minimal efforts. Because our method reduces manual processing of the training data, the development process is very fast. We have developed a prototype application to extract information from the project-reports of the American Recovery and Reinvestment Act (ARRA) of 2009. The fast development process of our system, its scalability to large application domains, and its high extraction effectiveness will help the transparency of management decisions by extracting and mining relevant information
A Survey on Automatic Parameter Tuning for Big Data Processing Systems
Big data processing systems (e.g., Hadoop, Spark, Storm) contain a vast number of configuration parameters controlling parallelism, I/O behavior, memory settings, and compression. Improper parameter settings can cause significant performance degradation and stability issues. However, regular users and even expert administrators grapple with understanding and tuning them to achieve good performance. We investigate existing approaches on parameter tuning for both batch and stream data processing systems and classify them into six categories: rule-based, cost modeling, simulation-based, experiment-driven, machine learning, and adaptive tuning. We summarize the pros and cons of each approach and raise some open research problems for automatic parameter tuning.Peer reviewe
Fast machine translation on parallel and massively parallel hardware
Parallel systems have been widely adopted in the field of machine translation, because
the raw computational power they offer is well suited to this computationally intensive
task. However programming for parallel hardware is not trivial as it requires redesign
of the existing algorithms. In my thesis I design efficient algorithms for machine translation
on parallel hardware. I identify memory accesses as the biggest bottleneck to
processing speed and propose novel algorithms that minimize them. I present three distinct
case studies in which minimizing memory access substantially improves speed:
Starting with statistical machine translation, I design a phrase table that makes decoding
ten times faster on a multi-threaded CPU. Next, I design a GPU-based n-gram
language model that is twice as fast per £ as a highly optimized CPU implementation.
Turning to neural machine translation, I design new stochastic gradient descent techniques
that make end-to-end training twice as fast. The work in this thesis has been
incorporated in two popular machine translation toolkits: Moses and Marian
Evaluation of Sql Performance Tuning Features in Oracle Database Software
Timely access to data is one of the most important requirements of database management systems. Having access to data in acceptable time is crucial for efficient decision making. Tuning inefficient SQL is one of the most important elements of enhancing performance of databases. With growing repositories and complexity of underlying data management systems, maintaining decent levels of performance and tuning has become a complicated task. DBMS providers acknowledge this tendency and developed tools and features that simplify the process. DBAs and developers have to make use of these tools in the attempt to provide their companies with stable and efficient systems. Performance tuning functions differ from platform to platform. Oracle is the main DBMS provider in the world, and this study focuses on the tools provided in all releases of their software. A thorough literature analysis is performed in order to gain understanding of the functionality and assessment of each tool is performed. It also provides insight into factual utilization of tools by gathering responses through the use of an online survey and an analysis of the results
- …