4 research outputs found
Feature Engineering for Knowledge Base Construction
Knowledge base construction (KBC) is the process of populating a knowledge
base, i.e., a relational database together with inference rules, with
information extracted from documents and structured sources. KBC blurs the
distinction between two traditional database problems, information extraction
and information integration. For the last several years, our group has been
building knowledge bases with scientific collaborators. Using our approach, we
have built knowledge bases that have comparable and sometimes better quality
than those constructed by human volunteers. In contrast to these knowledge
bases, which took experts a decade or more human years to construct, many of
our projects are constructed by a single graduate student.
Our approach to KBC is based on joint probabilistic inference and learning,
but we do not see inference as either a panacea or a magic bullet: inference is
a tool that allows us to be systematic in how we construct, debug, and improve
the quality of such systems. In addition, inference allows us to construct
these systems in a more loosely coupled way than traditional approaches. To
support this idea, we have built the DeepDive system, which has the design goal
of letting the user "think about features---not algorithms." We think of
DeepDive as declarative in that one specifies what they want but not how to get
it. We describe our approach with a focus on feature engineering, which we
argue is an understudied problem relative to its importance to end-to-end
quality
Fonduer: Knowledge Base Construction from Richly Formatted Data
We focus on knowledge base construction (KBC) from richly formatted data. In
contrast to KBC from text or tabular data, KBC from richly formatted data aims
to extract relations conveyed jointly via textual, structural, tabular, and
visual expressions. We introduce Fonduer, a machine-learning-based KBC system
for richly formatted data. Fonduer presents a new data model that accounts for
three challenging characteristics of richly formatted data: (1) prevalent
document-level relations, (2) multimodality, and (3) data variety. Fonduer uses
a new deep-learning model to automatically capture the representation (i.e.,
features) needed to learn how to extract relations from richly formatted data.
Finally, Fonduer provides a new programming model that enables users to convert
domain expertise, based on multiple modalities of information, to meaningful
signals of supervision for training a KBC system. Fonduer-based KBC systems are
in production for a range of use cases, including at a major online retailer.
We compare Fonduer against state-of-the-art KBC approaches in four different
domains. We show that Fonduer achieves an average improvement of 41 F1 points
on the quality of the output knowledge base---and in some cases produces up to
1.87x the number of correct entries---compared to expert-curated public
knowledge bases. We also conduct a user study to assess the usability of
Fonduer's new programming model. We show that after using Fonduer for only 30
minutes, non-domain experts are able to design KBC systems that achieve on
average 23 F1 points higher quality than traditional machine-learning-based KBC
approaches
Web Table Extraction, Retrieval and Augmentation: A Survey
Tables are a powerful and popular tool for organizing and manipulating data.
A vast number of tables can be found on the Web, which represents a valuable
knowledge resource. The objective of this survey is to synthesize and present
two decades of research on web tables. In particular, we organize existing
literature into six main categories of information access tasks: table
extraction, table interpretation, table search, question answering, knowledge
base augmentation, and table augmentation. For each of these tasks, we identify
and describe seminal approaches, present relevant resources, and point out
interdependencies among the different tasks.Comment: ACM Transactions on Intelligent Systems and Technology. 11(2):
Article 13, January 202
Understanding Tables in Context Using Standard NLP Toolkits
Tabular information in text documents contains a wealth of information, and so tables are a natural candidate for information extraction. There are many cues buried in both a table and its surrounding text that allow us to understand the meaning of the data in a table. We study how natural-language tools, such as part-of-speech tagging, dependency paths, and named-entity recognition, can be used to improve the quality of relation extraction from tables. In three domains we show that (1) a model that performs joint probabilistic inference across tabular and natural language features achieves an F1 score that is twice as high as either a puretable or pure-text system, and (2) using only shallower features or non-joint inference results in lower quality.