418 research outputs found
Simple Learning and Compositional Application of Perceptually Grounded Word Meanings for Incremental Reference Resolution
Kennington C, Schlangen D. Simple Learning and Compositional Application of Perceptually Grounded Word Meanings for Incremental Reference Resolution. In: Proceedings of the Conference for the Association for Computational Linguistics (ACL). Association for Computational Linguistics; 2015: 292-301
Training an adaptive dialogue policy for interactive learning of visually grounded word meanings
We present a multi-modal dialogue system for interactive learning of
perceptually grounded word meanings from a human tutor. The system integrates
an incremental, semantic parsing/generation framework - Dynamic Syntax and Type
Theory with Records (DS-TTR) - with a set of visual classifiers that are
learned throughout the interaction and which ground the meaning representations
that it produces. We use this system in interaction with a simulated human
tutor to study the effects of different dialogue policies and capabilities on
the accuracy of learned meanings, learning rates, and efforts/costs to the
tutor. We show that the overall performance of the learning agent is affected
by (1) who takes initiative in the dialogues; (2) the ability to express/use
their confidence level about visual attributes; and (3) the ability to process
elliptical and incrementally constructed dialogue turns. Ultimately, we train
an adaptive dialogue policy which optimises the trade-off between classifier
accuracy and tutoring costs.Comment: 11 pages, SIGDIAL 2016 Conferenc
The BURCHAK corpus: a Challenge Data Set for Interactive Learning of Visually Grounded Word Meanings
We motivate and describe a new freely available human-human dialogue dataset
for interactive learning of visually grounded word meanings through ostensive
definition by a tutor to a learner. The data has been collected using a novel,
character-by-character variant of the DiET chat tool (Healey et al., 2003;
Mills and Healey, submitted) with a novel task, where a Learner needs to learn
invented visual attribute words (such as " burchak " for square) from a tutor.
As such, the text-based interactions closely resemble face-to-face conversation
and thus contain many of the linguistic phenomena encountered in natural,
spontaneous dialogue. These include self-and other-correction, mid-sentence
continuations, interruptions, overlaps, fillers, and hedges. We also present a
generic n-gram framework for building user (i.e. tutor) simulations from this
type of incremental data, which is freely available to researchers. We show
that the simulations produce outputs that are similar to the original data
(e.g. 78% turn match similarity). Finally, we train and evaluate a
Reinforcement Learning dialogue control agent for learning visually grounded
word meanings, trained from the BURCHAK corpus. The learned policy shows
comparable performance to a rule-based system built previously.Comment: 10 pages, THE 6TH WORKSHOP ON VISION AND LANGUAGE (VL'17
Incrementally resolving references in order to identify visually present objects in a situated dialogue setting
Kennington C. Incrementally resolving references in order to identify visually present objects in a situated dialogue setting. Bielefeld: Universität Bielefeld; 2016.The primary concern of this thesis is to model the resolution of spoken referring expressions
made in order to identify objects; in particular, everyday objects that can be perceived visually
and distinctly from other objects. The practical goal of such a model is for it to be implemented
as a component for use in a live, interactive, autonomous spoken dialogue system. The requirement of interaction imposes an added complication; one that has been ignored in previous
models and approaches to automatic reference resolution: the model must attempt to resolve
the reference incrementally as it unfolds–not wait until the end of the referring expression to
begin the resolution process.
Beyond components in dialogue systems, reference has been a major player in the philosophy of meaning for longer than a century. For example, Gottlob Frege (1892) has distinguished
between Sinn (sense) and Bedeutung (reference), discussed how they are related and how they
relate to the meaning of words and expressions. It has furthermore been argued (e.g., Dahlgren
(1976)) that reference to entities in the actual world is not just a fundamental notion of semantic theory, but the fundamental notion; for an individual acquiring a language, understanding
the meaning of many words and concepts is done via the task of reference, beginning in early
childhood. In this thesis, we pursue an account of word meaning that is based on perception of
objects; for example, the meaning of the word red is based on visual features that are selected
as distinguishing red objects from non-red ones.
This thesis proposes two statistical models of incremental reference resolution. Given ex-
amples of referring expressions and visual aspects of the objects to which those expressions
referred, both model components learn a functional mapping between the words of the refer-
ring expressions and the visual aspects. A generative model, the simple incremental update
model, presented in Chapter 5, uses a mediating variable to learn the mapping, whereas a dis-
criminative model, the words-as-classifiers model, presented in Chapter 6, learns the mapping
directly and improves over the generative model. Both models have been evaluated in various
reference resolution tasks to objects in virtual scenes as well as real, tangible objects. This
thesis shows that both models work robustly and are able to resolve referring expressions made
in reference to visually present objects despite realistic, noisy conditions of speech and object
recognition. A theoretical and practical comparison is also provided.
Special emphasis is given to the discriminative model in this thesis because of its simplicity
and ability to represent word meanings. It is in the learning and application of this model that
gives credence to the above claim that reference is the fundamental notion for semantic theory
and that meanings of (visual) words is done through experiencing referring expressions made
to objects that are visually perceivable
Real-Time Understanding of Complex Discriminative Scene Descriptions
Manuvinakurike R, Kennington C, DeVault D, Schlangen D. Real-Time Understanding of Complex Discriminative Scene Descriptions. In: Proceedings of the 17th Annual SIGdial Meeting on Discourse and Dialogue. 2016
Learning to Interpret and Apply Multimodal Descriptions
Han T. Learning to Interpret and Apply Multimodal Descriptions. Bielefeld: Universität Bielefeld; 2018.Enabling computers to understand natural human communication is a goal researchers have been long aspired to in artificial intelligence. Since the concept demonstration of “Put-That- There” in 1980s, significant achievements have been made in developing multimodal interfaces that can process human communication such as speech, eye gaze, facial emotion, co-verbal hand gestures and pen input. State-of-the-art multimodal interfaces are able to process pointing gestures, symbolic gestures with conventional meanings, as well as gesture commands with pre-defined meanings (e.g., circling for “select”). However, in natural communication, co- verbal gestures/pen input rarely convey meanings via conventions or pre-defined rules, but embody meanings relatable to the accompanying speech.
For example, in route given tasks, people often describe landmarks verbally (e.g., two buildings), while demonstrating the relative position with two hands facing each other in the space. Interestingly, when the same gesture is accompanied by the utterance a ball, it may indicate the size of the ball. Hence, the interpretation of such co-verbal hand gestures largely depends on the accompanied verbal content. Similarly, when describing objects, while verbal utterances are most convenient for describing colour and category (e.g., a brown elephant), hand-drawn sketches are often deployed to convey iconic information such as the exact shape of the elephant’s trunk, which is typically difficult to encode in language.
This dissertation concerns the task of learning to interpret multimodal descriptions com- posed of verbal utterances and hand gestures/sketches, and apply corresponding interpretations to tasks such as image retrieval. Specifically, we aim to address following research questions: 1) For co-verbal gestures that embody meanings relatable to accompanied verbal content, how can we use natural language information to interpret the semantics of such co-verbal gestures, e.g., does a gesture indicate relative position or size? 2) As an integral system of commu- nication, speech and gestures not only bear close semantic relations, but also close temporal relations. To what degree and on which dimensions can hand gestures benefit the task of inter- preting multimodal descriptions? 3) While it’s obvious that iconic information in hand-drawn sketches enriches verbal content in object descriptions, how to model the joint contributions of such multimodal descriptions and to what degree can verbal descriptions compensate reduced iconic details in hand-drawn sketches?
To address the above questions, we first introduce three multimodal description corpora: a spatial description corpus composed of natural language and placing gestures (also referred as abstract deictics), a multimodal object description corpus composed of natural language and hand-drawn sketches, and an existing corpus - the Bielefeld Speech and Gesture Alignment Corpus (SAGA).
3
4
We frame the problem of learning gesture semantics as a multi-label classification task us- ing natural language information and hand gesture features. We conducted an experiment with the SAGA corpus. The results show that natural language is informative for the interpretation of hand gestures.
Further more, we describe a system that models the interpretation and application of spatial descriptions and explored three variants of representation methods of the verbal content. When representing the verbal content in the descriptions with a set of automatically learned symbols, the system’s performance is on par with representations with manually defined symbols (e.g., pre-defined object properties). We show that abstract deictic gestures not only lead to better understanding of spatial descriptions, but also result in earlier correct decisions of the system, which can be used to trigger immediate reactions in dialogue systems.
Finally, we investigate the interplay of semantics between symbolic (natural language) and iconic (sketches) modes in multimodal object descriptions, where natural language and sketches jointly contribute to the communications. We model the meaning of natural language and sketches two existing models and combine the meanings from both modalities with a late fusion approach. The results show that even adding reduced sketches (30% of full sketches) can help in the retrieval task. Moreover, in current setup, natural language descriptions can compensate around 30% of reduced sketches
- …