951 research outputs found

    Query Generation based on Generative Adversarial Networks

    Full text link
    Many problems in database systems, such as cardinality estimation, database testing and optimizer tuning, require a large query load as data. However, it is often difficult to obtain a large number of real queries from users due to user privacy restrictions or low frequency of database access. Query generation is one of the approaches to solve this problem. Existing query generation methods, such as random generation and template-based generation, do not consider the relationship between the generated queries and existing queries, or even generate semantically incorrect queries. In this paper, we propose a query generation framework based on generative adversarial networks (GAN) to generate query load that is similar to the given query load. In our framework, we use a syntax parser to transform the query into a parse tree and traverse the tree to obtain the sequence of production rules corresponding to the query. The generator of GAN takes a fixed distribution prior as input and outputs the query sequence, and the discriminator takes the real query and the fake query generated by the generator as input and outputs a gradient to guide the generator learning. In addition, we add context-free grammar and semantic rules to the generation process, which ensures that the generated queries are syntactically and semantically correct. We conduct experiments to evaluate our approach on real-world dataset, which show that our approach can generate new query loads with a similar distribution to a given query load, and that the generated queries are syntactically correct with no semantic errors. The generated query loads are used in downstream task, and the results show a significant improvement in the models trained with the expanded query loads using our approach

    06472 Abstracts Collection - XQuery Implementation Paradigms

    Get PDF
    From 19.11.2006 to 22.11.2006, the Dagstuhl Seminar 06472 ``XQuery Implementation Paradigms'' was held in the International Conference and Research Center (IBFI), Schloss Dagstuhl. During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts of the presentations given during the seminar as well as abstracts of seminar results and ideas are put together in this paper. The first section describes the seminar topics and goals in general. Links to extended abstracts or full papers are provided, if available

    Is Your Learned Query Optimizer Behaving As You Expect? A Machine Learning Perspective

    Full text link
    The current boom of learned query optimizers (LQO) can be explained not only by the general continuous improvement of deep learning (DL) methods but also by the straightforward formulation of a query optimization problem (QOP) as a machine learning (ML) one. The idea is often to replace dynamic programming approaches, widespread for solving QOP, with more powerful methods such as reinforcement learning. However, such a rapid "game change" in the field of QOP could not pass without consequences - other parts of the ML pipeline, except for predictive model development, have large improvement potential. For instance, different LQOs introduce their own restrictions on training data generation from queries, use an arbitrary train/validation approach, and evaluate on a voluntary split of benchmark queries. In this paper, we attempt to standardize the ML pipeline for evaluating LQOs by introducing a new end-to-end benchmarking framework. Additionally, we guide the reader through each data science stage in the ML pipeline and provide novel insights from the machine learning perspective, considering the specifics of QOP. Finally, we perform a rigorous evaluation of existing LQOs, showing that PostgreSQL outperforms these LQOs in almost all experiments depending on the train/test splits

    Learning to Generate Posters of Scientific Papers

    Full text link
    Researchers often summarize their work in the form of posters. Posters provide a coherent and efficient way to convey core ideas from scientific papers. Generating a good scientific poster, however, is a complex and time consuming cognitive task, since such posters need to be readable, informative, and visually aesthetic. In this paper, for the first time, we study the challenging problem of learning to generate posters from scientific papers. To this end, a data-driven framework, that utilizes graphical models, is proposed. Specifically, given content to display, the key elements of a good poster, including panel layout and attributes of each panel, are learned and inferred from data. Then, given inferred layout and attributes, composition of graphical elements within each panel is synthesized. To learn and validate our model, we collect and make public a Poster-Paper dataset, which consists of scientific papers and corresponding posters with exhaustively labelled panels and attributes. Qualitative and quantitative results indicate the effectiveness of our approach.Comment: in Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI'16), Phoenix, AZ, 201

    Automated Software Testing of Relational Database Schemas

    Get PDF
    Relational databases are critical for many software systems, holding the most valuable data for organisations. Data engineers build relational databases using schemas to specify the structure of the data within a database and defining integrity constraints. These constraints protect the data's consistency and coherency, leading industry experts to recommend testing them. Since manual schema testing is labour-intensive and error-prone, automated techniques enable the generation of test data. Although these generators are well-established and effective, they use default values and often produce many, long, and similar tests --- this results in decreasing fault detection and increasing regression testing time and testers inspection efforts. It raises the following questions: How effective is the optimised random generator at generating tests and its fault detection compared to prior methods? What factors make tests understandable for testers? How to reduce tests while maintaining effectiveness? How effectively do testers inspect differently reduced tests? To answer these questions, the first contribution of this thesis is to evaluate a new optimised random generator against well-established methods empirically. Secondly, identifying understandability factors of schema tests using a human study. Thirdly, evaluating a novel approach that reduces and merge tests against traditional reduction methods. Finally, studying testers' inspection efforts with differently reduced tests using a human study. The results show that the optimised random method efficiently generates effective tests compared to well-established methods. Testers reported that many NULLs and negative numbers are confusing, and they prefer simple repetition of unimportant values and readable strings. The reduction technique with merging is the most effective at minimising tests and producing efficient tests while maintaining effectiveness compared to traditional methods. The merged tests showed an increase in inspection efficiency with a slight accuracy decrease compared to only reduced tests. Therefore, these techniques and investigations can help practitioners adopt these generators in practice

    Duet: efficient and scalable hybriD neUral rElation undersTanding

    Full text link
    Learned cardinality estimation methods have achieved high precision compared to traditional methods. Among learned methods, query-driven approaches face the data and workload drift problem for a long time. Although both query-driven and hybrid methods are proposed to avoid this problem, even the state-of-the-art of them suffer from high training and estimation costs, limited scalability, instability, and long-tailed distribution problem on high cardinality and high-dimensional tables, which seriously affects the practical application of learned cardinality estimators. In this paper, we prove that most of these problems are directly caused by the widely used progressive sampling. We solve this problem by introducing predicates information into the autoregressive model and propose Duet, a stable, efficient, and scalable hybrid method to estimate cardinality directly without sampling or any non-differentiable process, which can not only reduces the inference complexity from O(n) to O(1) compared to Naru and UAE but also achieve higher accuracy on high cardinality and high-dimensional tables. Experimental results show that Duet can achieve all the design goals above and be much more practical and even has a lower inference cost on CPU than that of most learned methods on GPU
    • …
    corecore