10 research outputs found
USING ORACLE QUERY PLAN FOR AUTOMATED ASSESSMENT OF SQL
In our second-year undergraduate database foundations module students are exposed to using the Oracle cloud environment for developing SQL (Structured Query Language) queries. Their first summative assessment for the module is of a problem-solving nature, whereby they are expected to repair a partially developed SQL schema for a given scenario, before subsequently developing SQL queries based on the scenario. Marking this work can be laborious, error prone and time consuming, as well as requiring multiple sample solutions for each query which cater for all potential solutions that the students could produce. The aim of this paper is therefore to describe an early-stage prototype automated assessment system for marking the SQL queries developed by the students
A Comprehensive Survey on Database Management System Fuzzing: Techniques, Taxonomy and Experimental Comparison
Database Management System (DBMS) fuzzing is an automated testing technique
aimed at detecting errors and vulnerabilities in DBMSs by generating, mutating,
and executing test cases. It not only reduces the time and cost of manual
testing but also enhances detection coverage, providing valuable assistance in
developing commercial DBMSs. Existing fuzzing surveys mainly focus on
general-purpose software. However, DBMSs are different from them in terms of
internal structure, input/output, and test objectives, requiring specialized
fuzzing strategies. Therefore, this paper focuses on DBMS fuzzing and provides
a comprehensive review and comparison of the methods in this field. We first
introduce the fundamental concepts. Then, we systematically define a general
fuzzing procedure and decompose and categorize existing methods. Furthermore,
we classify existing methods from the testing objective perspective, covering
various components in DBMSs. For representative works, more detailed
descriptions are provided to analyze their strengths and limitations. To
objectively evaluate the performance of each method, we present an open-source
DBMS fuzzing toolkit, OpenDBFuzz. Based on this toolkit, we conduct a detailed
experimental comparative analysis of existing methods and finally discuss
future research directions.Comment: 34 pages, 22 figure
Genie: A Generator of Natural Language Semantic Parsers for Virtual Assistant Commands
To understand diverse natural language commands, virtual assistants today are
trained with numerous labor-intensive, manually annotated sentences. This paper
presents a methodology and the Genie toolkit that can handle new compound
commands with significantly less manual effort. We advocate formalizing the
capability of virtual assistants with a Virtual Assistant Programming Language
(VAPL) and using a neural semantic parser to translate natural language into
VAPL code. Genie needs only a small realistic set of input sentences for
validating the neural model. Developers write templates to synthesize data;
Genie uses crowdsourced paraphrases and data augmentation, along with the
synthesized data, to train a semantic parser. We also propose design principles
that make VAPL languages amenable to natural language translation. We apply
these principles to revise ThingTalk, the language used by the Almond virtual
assistant. We use Genie to build the first semantic parser that can support
compound virtual assistants commands with unquoted free-form parameters. Genie
achieves a 62% accuracy on realistic user inputs. We demonstrate Genie's
generality by showing a 19% and 31% improvement over the previous state of the
art on a music skill, aggregate functions, and access control.Comment: To appear in PLDI 201
Recommended from our members
Automated Testing and Debugging for Big Data Analytics
The prevalence of big data analytics in almost every large-scale software system has generated a substantial push to build data-intensive scalable computing (DISC) frameworks such as Google MapReduce and Apache Spark that can fully harness the power of existing data centers. However, frameworks once used by domain experts are now being leveraged by data scientists, business analysts, and researchers. This shift in user demographics calls for immediate advancements in the development, debugging, and testing practices of big data applications, which are falling behind compared to the DISC framework design and implementation. In practice, big data applications often fail as users are unable to test all behaviors emerging from interleaving dataflow operators, user-defined functions, and framework's code. "Testing based on a random sample" rarely guarantees the reliability and "trial and error" and "print" debugging methods are expensive and time-consuming. Thus, the current practice of developing a big data application must be improved and the tools built to enhance the developer's productivity must adapt to the distinct characteristics of data-intensive scalable computing. By synthesizing ideas from software engineering and database systems, our hypothesis is that we can design effective and scalable testing and debugging algorithms for big data analytics without compromising the performance and efficiency of the underlying DISC framework. To design such techniques, we investigate how we can build interactive and responsive debugging primitives that significantly reduce the debugging time, yet do not pose much performance overhead on big data applications. Furthermore, we investigate how we can leverage data provenance techniques from databases and fault-isolation algorithms from software engineering to pinpoint the minimal subset of failure-inducing inputs efficiently. To improve the reliability of big data analytics, we investigate how we can abstract the semantics of dataflow operators and use them in tandem with the semantics of user-defined functions to generate a minimum set of synthetic test inputs capable of revealing more defects than the entire input dataset.To examine the first hypothesis, we introduce interactive, real-time debugging primitives for big data analytics through innovative and scalable debugging features such as simulated breakpoint, dynamic watchpoint, and crash culprit identification. Second, we design a new automated fault localization approach that combines insights from both the software engineering and database literature to bring delta debugging closer to a reality in the big data applications by leveraging data provenance and by constructing systems optimizations for debugging provenance queries. Lastly, we devise a new symbolic-execution based white-box testing algorithm for big data applications that abstracts the implementation of dataflow operators using logical specifications instead of modeling their implementations and combines them with the semantics of any arbitrary user-defined function. We instantiate the idea of an interactive debugging algorithm as BigDebug, the idea of an automated debugging algorithm as BigSift, and the idea of symbolic execution-based testing as BigTest. Our investigation shows that the interactive debugging primitives can scale to terabytes---our record-level tracing incurs less than 25% overhead on average and provides up to 100% time saving compared to the baseline replay debugger. Second, we observe that by combining data provenance with delta debugging, we can identify the minimum faulty input in just under 30% of the original job execution time. Lastly, we verify that by abstracting dataflow operators using logical specifications, we can efficiently generate the most concise test data suitable for local testing while revealing twice as many faults as prior approaches. Our investigations collectively demonstrate that developer productivity can be significantly improved through effective and scalable testing and debugging techniques for big data analytics, without impacting the DISC framework's performance. This dissertation affirms the feasibility of automated debugging and testing techniques for big data analytics---techniques that were previously considered infeasible for large-scale data processing
Modeling of query languages and applications in code refactoring and code optimization
Проблем садржаности упита један је од фундаменталних проблема у рачунар-
ским наукама, иницијално дефинисан за релационе упите. Са растућом популарношћу
SPARQL упитног језика, проблем постаје релевантан и актуелан и у овом новом контексту.
У тези је представљен оригинални приступ решавању овог проблема заснован на сво-
ђењу на задовољивост у логици првог реда. Подржана је садржаност упита узимајући
у обзир RDF схему, а разматра се и релација стапања, као слабија форма садржаности.
Доказана је сагласност и потпуност предложеног приступа на широком спектру језич-
ких конструката. Описана је и његова имплементација, у виду решавача SPECS, чији је
кôд јавно доступан. Представљени су резултати детаљне експерименаталне евалуације
на релевантним скуповима примера за тестирање који показују да је SPECS ефикасан,
и да у поређењу са осталим савременим решавачима истог проблема даје прецизније ре-
зултате у краћем времену, уз бољу покривеност језичких конструката. Једна од примена
моделовања упитних језика може бити и при рефакторисању апликација које присту-
пају базама података. У таквим ситуацијама, врло су честе измене којима се мењају
и упити и кôд на језику у коме се они позивају. Такве промене могу сачувати укупну
еквивалентност кода, док на нивоу појединачних делова еквивалентност не мора бити
одржана. Коришћење алата за аутоматску верификацију еквивалентности рефактори-
саног кода може да дâ гаранцију задржавања понашања програма и од суштинског је
значаја за поуздан развој софтвера. Са том мотивацијом, у тези се разматра и модело-
вање SQL упита у теоријама логике првог реда, којим се омогућава аутоматска провера
еквивалентности C/C++ програма са уграђеним SQL-ом, што је и имплементирано у
виду јавно доступног алата отвореног кода SQLAV.The query containment problem is a very important computer science problem,
originally defined for relational queries. With the growing popularity of the SPARQL query
language, it became relevant and important in this new context, too. This thesis introduces
a new approach for solving this problem, based on a reduction to satisfiability in first order
logic. The approach covers containment under RDF SCHEMA entailment regime, and it can
deal with the subsumption relation, as a weaker form of containment. The thesis proves
soundness and completeness of the approach for a wide range of language constructs. It also
describes an implementation of the approach as an open source solver SPECS. The experimental
evaluation on relevant benchmarks shows that SPECS is efficient and comparing to
state-of-the-art solvers, it gives more precise results in a shorter amount of time, while supporting
a larger fragment of SPARQL constructs. An application of query language modeling can
be useful also along refactoring of database driven applications, where simultaneous changes
that include both a query and a host language code are very common. These changes can
preserve the overall equivalence, without preserving equivalence of these two parts considered
separately. Because of the ability to guarantee the absence of differences in behavior between
two versions of the code, tools that automatically verify code equivalence have great benefits
for reliable software development. With this motivation, a custom first-order logic modeling
of SQL queries is developed and described in the thesis. It enables an automated approach
for reasoning about equivalence of C/C++ programs with embedded SQL. The approach is
implemented within a publicly available and open source framework SQLAV
Learning to Map Natural Language to Executable Programs Over Databases
Natural language is a fundamental form of information and communication and is becoming the next frontier in computer interfaces. As the amount of data available online has increased exponentially, so has the need for Natural Language Interfaces (NLIs, which is not used for natural language inference in this thesis) to connect the data and the user by easily using natural language, significantly promoting the possibility and efficiency of information access for many users besides data experts. All consumer-facing software will one day have a dialogue interface, and this is the next vital leap in the evolution of search engines. Such intelligent dialogue systems should understand the meaning of language grounded in various contexts and generate effective language responses in different forms for information requests and human-computer communication.Developing these intelligent systems is challenging due to (1) limited benchmarks to drive advancements, (2) alignment mismatches between natural language and formal programs, (3) lack of trustworthiness and interpretability, (4) context dependencies in both human conversational interactions and the target programs, and (5) joint language understanding between dialog questions and NLI environments (e.g. databases and knowledge graphs). This dissertation presents several datasets, neural algorithms, and language models to address these challenges for developing deep learning technologies for conversational natural language interfaces (more specifically, NLIs to Databases or NLIDB). First, to drive advancements towards neural-based conversational NLIs, we design and propose several complex and cross-domain NLI benchmarks, along with introducing several datasets. These datasets enable training large, deep learning models. The evaluation is done on unseen databases. (e.g., about course arrangement). Systems must generalize well to not only new SQL queries but also to unseen database schemas to perform well on these tasks. Furthermore, in real-world applications, users often access information in a multi-turn interaction with the system by asking a sequence of related questions. The users may explicitly refer to or omit previously mentioned entities and constraints and may introduce refinements, additions, or substitutions to what has already been said. Therefore, some of them require systems to model dialog dynamics and generate natural language explanations for user verification. The full dialogue interaction with the system’s responses is also important as this supports clarifying ambiguous questions, verifying returned results, and notifying users of unanswerable or unrelated questions. A robust dialogue-based NLI system that can engage with users by forming its responses has thus become an increasingly necessary component for the query process. Moreover, this thesis presents the development of scalable algorithms designed to parse complex and sequential questions to formal programs (e.g., mapping questions to SQL queries that can execute against databases). We propose a novel neural model that utilizes type information from knowledge graphs to better understand rare entities and numbers in natural language questions. We also introduce a neural model based on syntax tree neural networks, which was the first methodology proposed for generating complex programs from language. Finally, language modeling creates contextualized vector representations of words by training a model to predict the next word given context words, which are the basis of deep learning for NLP. Recently, pre-trained language models such as BERT and RoBERTa achieve tremendous success in many natural language processing tasks such as text understanding and reading comprehension. However, most language models are pre-trained only on free-text such as Wikipedia articles and Books. Given that language in semantic parsing is usually related to some formal representations such as logic forms and SQL queries and has to be grounded in structural environments (e.g., databases), we propose better language models for NLIs by enforcing such compositional interpolation in them. To show they could better jointly understand dialog questions and NLI environments (e.g. databases and knowledge graphs), we show that these language models achieve new state-of-the-art results for seven representative tasks on semantic parsing, dialogue state tracking, and question answering. Also, our proposed pre-training method is much more effective than other prior work
Democratizing Information Access through Low Overhead Systems
Despite its importance, accessing information in storage systems or raw data is challenging or impossible for most people due to the sheer amount and heterogeneity of data as well as the overheads and complexities of existing systems. In this thesis, we propose several approaches to improve on that and therefore democratize information access.
Data-driven and AI based approaches make it possible to provide the necessary information access for many tasks at scale. Unfortunately, most existing approaches can only be built and used by IT experts and data scientists, yet the current demand for data scientists cannot be met by far. Furthermore, their application is expensive. To counter this, approaches with low overhead, i.e., without the need for large amounts of training data, manually annotating or extracting information, and extensive computation are needed. However, such systems still need to adapt to special terminology of different domains, and the individual information needs of the users. Moreover, they should be usable without extensive training; we thus aim to create ready-to-use
systems that provide intuitive or familiar ways for interaction, e.g., chatbot-like natural language input or graphical user interfaces.
In this thesis, we propose a number of contributions to three important subfields of data exploration and processing: Natural Language Interfaces for Data Access & Manipulation, Personalized Summarizations of Text Collections, and Information Extraction & Integration. These approaches allow data scientists, domain experts and end users to access and manipulate information in a quick and easy way.
First, we propose two natural language interfaces for data access and manipulation. Natural language is a useful alternative interface for relational databases, since it allows users to formulate complex questions without requiring knowledge of SQL. We propose an approach based on weak supervision that augments existing deep learning techniques in order to improve the performance of models for natural language to SQL translation. Moreover, we apply the idea to build a training pipeline for conversational agents (i.e., chatbot-like systems allowing to interact with a database and perform actions like ticket booking). The pipeline uses weak supervision to generate the training data automatically from a relational database and its set of defined transactions. Our approach is data-aware, i.e., it leverages the data characteristics of the DB at runtime to optimize the dialogue flow and reduce necessary interactions.
Additionally, we complement this research by presenting a meta-study on the reproducibility and availability of natural language interfaces for databases (NLIDBs) for real-world applications, and a benchmark to evaluate the linguistic robustness of NLIDBs.
Second, we work on personalized summarization and its usage for data exploration. The central idea is to produce summaries that exactly cover the current information need of the users. By creating multiple summaries or shifting the focus during the interactive creation process, these summaries can be used to explore the contents of unknown text collections. We propose an approach to create such personalized summaries at interactive speed; this is achieved by carefully sampling from the inputs.
As part of our research on multi-document summary, we noticed that there is a lack of diverse evaluation corpora for this task. We therefore present a framework that can be used to automatically create new summarization corpora, and apply and validate it.
Third, we provide ways to democratize information extraction and integration. This becomes relevant when data is scattered across different sources and there is no tabular representation that already contains all information needed. Therefore, it might be necessary to integrate different structured sources, or to even extract the required information pieces from text collections first and then to organize them. To integrate existing structured data sources, we present and evaluate a novel end-to-end approach for schema matching based on neural embeddings.
Finally, we tackle the automatic creation of tables from text for situations where no suitable structured source to answer an information need is available. Our proposed approach can execute SQL-like queries on text collections in an ad-hoc manner, both to directly extract facts from text documents, and to produce aggregated tables stating information that is not explicitly mentioned in the documents. Our approach works by generalizing user feedback and therefore does not need domain-specific resources for the domain adaption. It runs at interactive speed even on commodity hardware.
Overall, our approaches can provide a quality level compared to state-of-the-art approaches, but often at a fraction of the associated costs. In other fields like the table extractions, we even provide functionality that is—to our knowledge—not covered by any generic tooling available to end users. There are still many interesting challenges to solve, and the recent rise of large language models has shifted what seems possible with regard to dealing with human language once more. Yet, we hope that our contributions provide a useful step towards democratization of information access
Practical synthesis from real-world oracles
As software systems become increasingly heterogeneous, the ability of compilers to reason about an entire system has decreased. When components of a system are not implemented as traditional programs, but rather as specialised hardware, optimised architecture-specific libraries, or network services, the compiler is unable to cross these abstraction barriers and analyse the system as a whole.
If these components could be modelled or understood as programs, then the compiler would be able to reason about their behaviour without concern for their internal implementation details: a homogeneous view of the entire system would be afforded. However, it is not often the case that such components ever corresponded to an original program. This means that to facilitate this homogenenous analysis, programmatic models of component behaviour must be learned or constructed automatically.
Constructing these models is an inductive program synthesis problem, albeit a challenging one that is largely beyond the ability of existing implementations. In order for the problem to be made tractable, information provided by the underlying context (i.e. the real component behaviour to be matched) must be integrated.
This thesis presents three program synthesis approaches that integrate contextual information to synthesise programmatic models for real, existing components. The first, Annote, exploits informally-encoded information about a component's interface (e.g. from documentation) by weaving that information into an extended type-and-attribute system for component interfaces. The second, Presyn, learns a pair of cooperating probabilistic models from prior syntheses, that aim to predict likely program structure based on a component's interface. Finally, Haze uses observations of common side-effects of component executions to bias the search for programs. These approaches are each evaluated against comparable synthesisers from the literature, on a set of benchmark problems derived from real components.
Learning models for component behaviour is only a partial solution; the compiler must also have some mechanism to use those models for program analysis and transformation. This thesis additionally proposes a novel mechanism for context-sensitive automatic API migration based on synthesised programmatic models, and evaluates the effectiveness of doing so on real application code.
In summary, this thesis proposes a new framing for program synthesis problems that target the behaviour of real components, and demonstrates three different potential approaches to synthesis in this spirit. The success of these approaches is evaluated against implementations from the literature, and their results used to drive a novel API migration technique