268 research outputs found
Recommended from our members
Continually improving grounded natural language understanding through human-robot dialog
As robots become ubiquitous in homes and workplaces such as hospitals and factories, they must be able to communicate with humans. Several kinds of knowledge are required to understand and respond to a human's natural language commands and questions. If a person requests an assistant robot to take me to Alice's office, the robot must know that Alice is a person who owns some unique office, and that take me means it should navigate there. Similarly, if a person requests bring me the heavy, green mug, the robot must have accurate mental models of the physical concepts heavy, green, and mug. To avoid forcing humans to use key phrases or words robots already know, this thesis focuses on helping robots understanding new language constructs through interactions with humans and with the world around them. To understand a command in natural language, a robot must first convert that command to an internal representation that it can reason with. Semantic parsing is a method for performing this conversion, and the target representation is often semantic forms represented as predicate logic with lambda calculus. Traditional semantic parsing relies on hand-crafted resources from a human expert: an ontology of concepts, a lexicon connecting language to those concepts, and training examples of language with abstract meanings. One thrust of this thesis is to perform semantic parsing with sparse initial data. We use the conversations between a robot and human users to induce pairs of natural language utterances with the target semantic forms a robot discovers through its questions, reducing the annotation effort of creating training examples for parsing. We use this data to build more dialog-capable robots in new domains with much less expert human effort (Thomason et al., 2015; Padmakumar et al., 2017). Meanings of many language concepts are bound to the physical world. Understanding object properties and categories, such as heavy, green, and mug requires interacting with and perceiving the physical world. Embodied robots can use manipulation capabilities, such as pushing, picking up, and dropping objects to gather sensory data about them. This data can be used to understand non-visual concepts like heavy and empty (e.g. get the empty carton of milk from the fridge), and assist with concepts that have both visual and non-visual expression (e.g. tall things look big and also exert force sooner than short things when pressed down on). A second thrust of this thesis focuses on strategies for learning these concepts using multi-modal sensory information. We use human-in-the-loop learning to get labels between concept words and actual objects in the environment (Thomason et al., 2016, 2017). We also explore ways to tease out polysemy and synonymy in concept words (Thomason and Mooney, 2017) such as light, which can refer to a weight or a color, the latter sense being synonymous with pale. Additionally, pushing, picking up, and dropping objects to gather sensory information is prohibitively time-consuming, so we investigate strategies for using linguistic information and human input to expedite exploration when learning a new concept (Thomason et al., 2018). Finally, we build an integrated agent with both parsing and perception capabilities that learns from conversations with users to improve both components over time. We demonstrate that parser learning from conversations (Thomason et al., 2015) can be combined with multi-modal perception (Thomason et al., 2016) using predicate-object labels gathered through opportunistic active learning (Thomason et al., 2017) during those conversations to improve performance for understanding natural language commands from humans. Human users also qualitatively rate this integrated learning agent as more usable after it has improved from conversation-based learning.Computer Science
MultiSubs: A Large-scale Multimodal and Multilingual Dataset
This paper introduces a large-scale multimodal and multilingual dataset that
aims to facilitate research on grounding words to images in their contextual
usage in language. The dataset consists of images selected to unambiguously
illustrate concepts expressed in sentences from movie subtitles. The dataset is
a valuable resource as (i) the images are aligned to text fragments rather than
whole sentences; (ii) multiple images are possible for a text fragment and a
sentence; (iii) the sentences are free-form and real-world like; (iv) the
parallel texts are multilingual. We set up a fill-in-the-blank game for humans
to evaluate the quality of the automatic image selection process of our
dataset. We show the utility of the dataset on two automatic tasks: (i)
fill-in-the blank; (ii) lexical translation. Results of the human evaluation
and automatic models demonstrate that images can be a useful complement to the
textual context. The dataset will benefit research on visual grounding of words
especially in the context of free-form sentences, and can be obtained from
https://doi.org/10.5281/zenodo.5034604 under a Creative Commons licence.Comment: Manuscript update: (i) Added links to the dataset and evaluation
toolkit; (ii) Section 6.1.4: Added random and n-gram baselines to the
fill-in-the-blank task, and added further discussion at the end of the
section; (iii) Section 6.2.3: Further elaboration on the ALI metric; (iv)
Section 6.2.4: Corrected results for the lexical translation task (Table 8),
and updated the discussions accordingl
Advanced Semantics for Commonsense Knowledge Extraction
Commonsense knowledge (CSK) about concepts and their properties is useful for AI applications such as robust chatbots. Prior works like ConceptNet, TupleKB and others compiled large CSK collections, but are restricted in their expressiveness to subject-predicate-object (SPO) triples with simple concepts for S and monolithic strings for P and O. Also, these projects have either prioritized precision or recall, but hardly reconcile these complementary goals. This paper presents a methodology, called Ascent, to automatically build a large-scale knowledge base (KB) of CSK assertions, with advanced expressiveness and both better precision and recall than prior works. Ascent goes beyond triples by capturing composite concepts with subgroups and aspects, and by refining assertions with semantic facets. The latter are important to express temporal and spatial validity of assertions and further qualifiers. Ascent combines open information extraction with judicious cleaning using language models. Intrinsic evaluation shows the superior size and quality of the Ascent KB, and an extrinsic evaluation for QA-support tasks underlines the benefits of Ascent
Advanced Semantics for Commonsense Knowledge Extraction
Commonsense knowledge (CSK) about concepts and their properties is useful for
AI applications such as robust chatbots. Prior works like ConceptNet, TupleKB
and others compiled large CSK collections, but are restricted in their
expressiveness to subject-predicate-object (SPO) triples with simple concepts
for S and monolithic strings for P and O. Also, these projects have either
prioritized precision or recall, but hardly reconcile these complementary
goals. This paper presents a methodology, called Ascent, to automatically build
a large-scale knowledge base (KB) of CSK assertions, with advanced
expressiveness and both better precision and recall than prior works. Ascent
goes beyond triples by capturing composite concepts with subgroups and aspects,
and by refining assertions with semantic facets. The latter are important to
express temporal and spatial validity of assertions and further qualifiers.
Ascent combines open information extraction with judicious cleaning using
language models. Intrinsic evaluation shows the superior size and quality of
the Ascent KB, and an extrinsic evaluation for QA-support tasks underlines the
benefits of Ascent.Comment: Web interface available at https://ascent.mpi-inf.mpg.d
English WordNet Taxonomic Random Walk Pseudo-Corpora
This is a resource description paper that describes the creation and properties of a set of pseudo-corpora generated artificially from a random walk over the English WordNet taxonomy. Our WordNet taxonomic random walk implementation allows the exploration of different random walk hyperparameters and the generation of a variety of different pseudo-corpora. We find that different combinations of the walk’s hyperparameters result in varying statistical properties of the generated pseudo-corpora. We have published a total of 81 pseudo-corpora that we have used in our previous research, but have not exhausted all possible combinations of hyperparameters, which is why we have also published a codebase that allows the generation of additional WordNet taxonomic pseudo-corpora as needed. Ultimately, such pseudo-corpora can be used to train taxonomic word embeddings, as a way of transferring taxonomic knowledge into a word embedding space
Uni- and Multimodal and Structured Representations for Modeling Frame Semantics
Language is the most complex kind of shared knowledge evolved by humankind and it is the foundation of communication between humans.
At the same time, one of the most challenging problems in Artificial Intelligence is to grasp the meaning conveyed by language.
Humans use language to communicate knowledge and information about the world and to exchange their thoughts.
In order to understand the meaning of words in a sentence, single words are interpreted in the context of the sentence and of the situation together with a large background of commonsense knowledge and experience in the world.
The research field of Natural Language Processing aims at automatically understanding language as humans do naturally.
In this thesis, the overall challenge of understanding meaning in language by capturing world knowledge is examined from the two branches of
(a) knowledge about situations and actions as expressed in texts and
(b) structured relational knowledge as stored in knowledge bases.
Both branches can be studied with different kinds of vector representations, so-called embeddings, for operationalizing different aspects of knowledge:
textual, structured, and visual or multimodal embeddings.
This poses the challenge of determining the suitability of different embeddings for automatic language understanding with respect to the two branches.
To approach these challenges, we choose to closely rely upon the lexical-semantic knowledge base FrameNet.
It addresses both branches of capturing world knowledge whilst taking into account the linguistic theory of frame semantics which orients on human language understanding.
FrameNet provides frames, which are categories for knowledge of meaning, and frame-to-frame relations, which are structured meta-knowledge of interactions between frames.
These frames and relations are central to the tasks of Frame Identification and Frame-to-Frame Relation Prediction.
Concerning branch (a), the task of Frame Identification was introduced to advance the understanding of context knowledge about situations, actions and participants.
The task is to label predicates with frames in order to identify the meaning of the predicate in the context of the sentence.
We use textual embeddings to model the semantics of words in the sentential context and develop a state-of-the-art system for Frame Identification.
Our Frame Identification system can be used to automatically annotate frames on English or German texts.
Furthermore, in our multimodal approach to Frame Identification, we combine textual embeddings for words with visual embeddings for entities depicted on images.
We find that visual information is especially useful in difficult settings with rare frames.
To further advance the performance of the multimodal approach, we suggest to develop embeddings for verbs specifically that incorporate multimodal information.
Concerning branch (b), we introduce the task of Frame-to-Frame Relation Prediction to advance the understanding of relational knowledge of interactions between frames.
The task is to label connections between frames with relations in order to complete the meta-knowledge stored in FrameNet.
We train textual and structured embeddings for frames and explore the limitations of textual frame embeddings with respect to recovering relations between frames.
Moreover, we contrast textual frame embeddings versus structured frame embeddings and develop the first system for Frame-to-Frame Relation Prediction.
We find that textual and structured frame embeddings differ with respect to predicting relations;
thus when applied as features in the context of further tasks, they can provide different kinds of frame knowledge.
Our structured prediction system can be used to generate recommendations for annotations with relations.
To further advance the performance of Frame-to-Frame Relation Prediction and also of the induction of new frames and relations, we suggest to develop approaches that incorporate visual information.
The two kinds of frame knowledge from both branches, our Frame Identification system and our pre-trained frame embeddings, are combined in an extrinsic evaluation in the context of higher-level applications.
Across these applications, we see a trend that frame knowledge is particularly beneficial in ambiguous and short sentences.
Taken together, in this thesis, we approach semantic language understanding from the two branches of knowledge about situations and actions and structured relational knowledge and investigate different embeddings for textual, structured and multimodal language understanding
- …