13 research outputs found
Dataset Discovery in Data Lakes
Data analytics stands to benefit from the increasing availability of datasets
that are held without their conceptual relationships being explicitly known.
When collected, these datasets form a data lake from which, by processes like
data wrangling, specific target datasets can be constructed that enable
value-adding analytics. Given the potential vastness of such data lakes, the
issue arises of how to pull out of the lake those datasets that might
contribute to wrangling out a given target. We refer to this as the problem of
dataset discovery in data lakes and this paper contributes an effective and
efficient solution to it. Our approach uses features of the values in a dataset
to construct hash-based indexes that map those features into a uniform distance
space. This makes it possible to define similarity distances between features
and to take those distances as measurements of relatedness w.r.t. a target
table. Given the latter (and exemplar tuples), our approach returns the most
related tables in the lake. We provide a detailed description of the approach
and report on empirical results for two forms of relatedness (unionability and
joinability) comparing them with prior work, where pertinent, and showing
significant improvements in all of precision, recall, target coverage, indexing
and discovery times
Active entailment encoding for explanation tree construction using parsimonious generation of hard negatives
Entailment trees have been proposed to simulate the human reasoning process
of explanation generation in the context of open--domain textual question
answering. However, in practice, manually constructing these explanation trees
proves a laborious process that requires active human involvement. Given the
complexity of capturing the line of reasoning from question to the answer or
from claim to premises, the issue arises of how to assist the user in
efficiently constructing multi--level entailment trees given a large set of
available facts. In this paper, we frame the construction of entailment trees
as a sequence of active premise selection steps, i.e., for each intermediate
node in an explanation tree, the expert needs to annotate positive and negative
examples of premise facts from a large candidate list. We then iteratively
fine--tune pre--trained Transformer models with the resulting positive and
tightly controlled negative samples and aim to balance the encoding of semantic
relationships and explanatory entailment relationships. Experimental evaluation
confirms the measurable efficiency gains of the proposed active fine--tuning
method in facilitating entailment trees construction: up to 20\% improvement in
explanatory premise selection when compared against several alternatives
Data context informed data wrangling
The process of preparing potentially large and complex data sets for further
analysis or manual examination is often called data wrangling. In classical
warehousing environments, the steps in such a process have been carried out
using Extract-Transform-Load platforms, with significant manual involvement in
specifying, configuring or tuning many of them. Cost-effective data wrangling
processes need to ensure that data wrangling steps benefit from automation
wherever possible. In this paper, we define a methodology to fully automate an
end-to-end data wrangling process incorporating data context, which associates
portions of a target schema with potentially spurious extensional data of types
that are commonly available. Instance-based evidence together with data
profiling paves the way to inform automation in several steps within the
wrangling process, specifically, matching, mapping validation, value format
transformation, and data repair. The approach is evaluated with real estate
data showing substantial improvements in the results of automated wrangling