8 research outputs found
From natural language questions to SPARQL queries: a pattern-based approach
Linked Data knowledge bases are valuable sources of knowledge which give insights, reveal facts about various relationships and provide a large amount of metadata in well-structured form. Although the format of semantic information – namely as RDF(S) – is kept simple by representing each fact as a triple of subject, property and object, the access to the knowledge is only available using SPARQL queries on the data. Therefore, Question Answering (QA) systems provide a user-friendly way to access any type of knowledge base and especially for Linked Data sources to get insight into the semantic information. As RDF(S) knowledge bases are usually structured in the same way and provide per se semantic metadata about the contained information, we provide a novel approach that is independent from the underlying knowledge base. Thus, the main contribution of our proposed approach constitutes the simple replaceability of the underlying knowledge base. The algorithm is based on general question and query patterns and only accesses the knowledge base for the actual query generation and execution. This paper presents the proposed approach and an evaluation in comparison to state-of-the-art Linked Data approaches for challenges of QA systems
Vamsa: Automated Provenance Tracking in Data Science Scripts
There has recently been a lot of ongoing research in the areas of fairness,
bias and explainability of machine learning (ML) models due to the self-evident
or regulatory requirements of various ML applications. We make the following
observation: All of these approaches require a robust understanding of the
relationship between ML models and the data used to train them. In this work,
we introduce the ML provenance tracking problem: the fundamental idea is to
automatically track which columns in a dataset have been used to derive the
features/labels of an ML model. We discuss the challenges in capturing such
information in the context of Python, the most common language used by data
scientists. We then present Vamsa, a modular system that extracts provenance
from Python scripts without requiring any changes to the users' code. Using 26K
real data science scripts, we verify the effectiveness of Vamsa in terms of
coverage, and performance. We also evaluate Vamsa's accuracy on a smaller
subset of manually labeled data. Our analysis shows that Vamsa's precision and
recall range from 90.4% to 99.1% and its latency is in the order of
milliseconds for average size scripts. Drawing from our experience in deploying
ML models in production, we also present an example in which Vamsa helps
automatically identify models that are affected by data corruption issues
Explaining Queries over Web Tables to Non-Experts
Designing a reliable natural language (NL) interface for querying tables has
been a longtime goal of researchers in both the data management and natural
language processing (NLP) communities. Such an interface receives as input an
NL question, translates it into a formal query, executes the query and returns
the results. Errors in the translation process are not uncommon, and users
typically struggle to understand whether their query has been mapped correctly.
We address this problem by explaining the obtained formal queries to non-expert
users. Two methods for query explanations are presented: the first translates
queries into NL, while the second method provides a graphic representation of
the query cell-based provenance (in its execution on a given table). Our
solution augments a state-of-the-art NL interface over web tables, enhancing it
in both its training and deployment phase. Experiments, including a user study
conducted on Amazon Mechanical Turk, show our solution to improve both the
correctness and reliability of an NL interface.Comment: Short paper version to appear in ICDE 201
Umsetzung von Provenance-Anfragen in Big-Data-Analytics-Umgebungen
Ziel der Arbeit ist die Adaption von Techniken der Provenance-Anfragen why, where und how in Umgebungen, die statt einfacher Anfragen wie Selektion, Projektion und Verbund auch OLAP-Operationen und weitere Machine-Learning-Algorithmen benutzen. Die ausschließlich extensionalen Provenance-Antworten werden dabei durch Provenance-Polynome sowie (minimalen) Zeugenbasen gegeben. Die Erweiterung des CHASE-Algorithmus für Datenbanken um eine BACKCHASE-Phase zur Provenance-Antwort-Bewertung ermöglicht so die Bestimmung des CHASE-Inversentyps (exakt/relaxt/ergebnisäquivalent) einer gegebenen Anfrage
Maximizing User Domain Expertise to Clarify Oblique Specifications of Relational Queries
While there is abundant access to data management technology today, working with data is still challenging for the average user. One common means of manipulating data is with SQL on relational databases, but this requires knowledge of SQL as well as the database's schema and contents. Consequently, previous work has proposed oblique query specification (OQS) methods such as natural language or programming-by-example to allow users to imprecisely specify their query intent. These methods, however, suffer from either low precision or low expressivity and, in addition, produce a list of candidate SQL queries that make it difficult for users to select their final target query.
My thesis is that OQS systems should maximize user domain expertise to triangulate the user's desired query. First, I demonstrate how to leverage previously-issued SQL queries to improve the accuracy of natural language interfaces. Second, I propose a system allowing users to specify a query with both natural language and programming-by-example. Finally, I develop a system where users provide feedback on system-suggested tuples to select a SQL query from a set of candidate queries generated by an OQS system.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/155114/1/cjbaik_1.pd