951 research outputs found
Query Generation based on Generative Adversarial Networks
Many problems in database systems, such as cardinality estimation, database
testing and optimizer tuning, require a large query load as data. However, it
is often difficult to obtain a large number of real queries from users due to
user privacy restrictions or low frequency of database access. Query generation
is one of the approaches to solve this problem. Existing query generation
methods, such as random generation and template-based generation, do not
consider the relationship between the generated queries and existing queries,
or even generate semantically incorrect queries. In this paper, we propose a
query generation framework based on generative adversarial networks (GAN) to
generate query load that is similar to the given query load. In our framework,
we use a syntax parser to transform the query into a parse tree and traverse
the tree to obtain the sequence of production rules corresponding to the query.
The generator of GAN takes a fixed distribution prior as input and outputs the
query sequence, and the discriminator takes the real query and the fake query
generated by the generator as input and outputs a gradient to guide the
generator learning. In addition, we add context-free grammar and semantic rules
to the generation process, which ensures that the generated queries are
syntactically and semantically correct. We conduct experiments to evaluate our
approach on real-world dataset, which show that our approach can generate new
query loads with a similar distribution to a given query load, and that the
generated queries are syntactically correct with no semantic errors. The
generated query loads are used in downstream task, and the results show a
significant improvement in the models trained with the expanded query loads
using our approach
06472 Abstracts Collection - XQuery Implementation Paradigms
From 19.11.2006 to 22.11.2006, the Dagstuhl Seminar 06472 ``XQuery Implementation Paradigms'' was held in the International Conference and Research Center (IBFI), Schloss Dagstuhl. During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts of the presentations given during the seminar as well as abstracts of seminar results and ideas are put together in this paper. The first section describes the seminar topics and goals in general. Links to extended abstracts or full papers are provided, if available
Is Your Learned Query Optimizer Behaving As You Expect? A Machine Learning Perspective
The current boom of learned query optimizers (LQO) can be explained not only
by the general continuous improvement of deep learning (DL) methods but also by
the straightforward formulation of a query optimization problem (QOP) as a
machine learning (ML) one. The idea is often to replace dynamic programming
approaches, widespread for solving QOP, with more powerful methods such as
reinforcement learning. However, such a rapid "game change" in the field of QOP
could not pass without consequences - other parts of the ML pipeline, except
for predictive model development, have large improvement potential. For
instance, different LQOs introduce their own restrictions on training data
generation from queries, use an arbitrary train/validation approach, and
evaluate on a voluntary split of benchmark queries.
In this paper, we attempt to standardize the ML pipeline for evaluating LQOs
by introducing a new end-to-end benchmarking framework. Additionally, we guide
the reader through each data science stage in the ML pipeline and provide novel
insights from the machine learning perspective, considering the specifics of
QOP. Finally, we perform a rigorous evaluation of existing LQOs, showing that
PostgreSQL outperforms these LQOs in almost all experiments depending on the
train/test splits
Learning to Generate Posters of Scientific Papers
Researchers often summarize their work in the form of posters. Posters
provide a coherent and efficient way to convey core ideas from scientific
papers. Generating a good scientific poster, however, is a complex and time
consuming cognitive task, since such posters need to be readable, informative,
and visually aesthetic. In this paper, for the first time, we study the
challenging problem of learning to generate posters from scientific papers. To
this end, a data-driven framework, that utilizes graphical models, is proposed.
Specifically, given content to display, the key elements of a good poster,
including panel layout and attributes of each panel, are learned and inferred
from data. Then, given inferred layout and attributes, composition of graphical
elements within each panel is synthesized. To learn and validate our model, we
collect and make public a Poster-Paper dataset, which consists of scientific
papers and corresponding posters with exhaustively labelled panels and
attributes. Qualitative and quantitative results indicate the effectiveness of
our approach.Comment: in Proceedings of the 30th AAAI Conference on Artificial Intelligence
(AAAI'16), Phoenix, AZ, 201
Automated Software Testing of Relational Database Schemas
Relational databases are critical for many software systems, holding the most valuable data for organisations. Data engineers build relational databases using schemas to specify the structure of the data within a database and defining integrity constraints. These constraints protect the data's consistency and coherency, leading industry experts to recommend testing them.
Since manual schema testing is labour-intensive and error-prone, automated techniques enable the generation of test data. Although these generators are well-established and effective, they use default values and often produce many, long, and similar tests --- this results in decreasing fault detection and increasing regression testing time and testers inspection efforts. It raises the following questions: How effective is the optimised random generator at generating tests and its fault detection compared to prior methods? What factors make tests understandable for testers? How to reduce tests while maintaining effectiveness? How effectively do testers inspect differently reduced tests?
To answer these questions, the first contribution of this thesis is to evaluate a new optimised random generator against well-established methods empirically. Secondly, identifying understandability factors of schema tests using a human study. Thirdly, evaluating a novel approach that reduces and merge tests against traditional reduction methods. Finally, studying testers' inspection efforts with differently reduced tests using a human study.
The results show that the optimised random method efficiently generates effective tests compared to well-established methods. Testers reported that many NULLs and negative numbers are confusing, and they prefer simple repetition of unimportant values and readable strings. The reduction technique with merging is the most effective at minimising tests and producing efficient tests while maintaining effectiveness compared to traditional methods. The merged tests showed an increase in inspection efficiency with a slight accuracy decrease compared to only reduced tests. Therefore, these techniques and investigations can help practitioners adopt these generators in practice
Duet: efficient and scalable hybriD neUral rElation undersTanding
Learned cardinality estimation methods have achieved high precision compared
to traditional methods. Among learned methods, query-driven approaches face the
data and workload drift problem for a long time. Although both query-driven and
hybrid methods are proposed to avoid this problem, even the state-of-the-art of
them suffer from high training and estimation costs, limited scalability,
instability, and long-tailed distribution problem on high cardinality and
high-dimensional tables, which seriously affects the practical application of
learned cardinality estimators. In this paper, we prove that most of these
problems are directly caused by the widely used progressive sampling. We solve
this problem by introducing predicates information into the autoregressive
model and propose Duet, a stable, efficient, and scalable hybrid method to
estimate cardinality directly without sampling or any non-differentiable
process, which can not only reduces the inference complexity from O(n) to O(1)
compared to Naru and UAE but also achieve higher accuracy on high cardinality
and high-dimensional tables. Experimental results show that Duet can achieve
all the design goals above and be much more practical and even has a lower
inference cost on CPU than that of most learned methods on GPU
- …