197 research outputs found
A Discriminative Model for Perceptually-Grounded Incremental Reference Resolution
Kennington C, Dia L, Schlangen D. A Discriminative Model for Perceptually-Grounded Incremental Reference Resolution. In: Proceedings of the 11th International Conference on Computational Semantics (IWCS) 2015. 2015: 195-205.A large part of human communication involves referring to entities in the world, and often these entities are objects that are visually present for the interlocutors. A computer system that aims to resolve such references needs to tackle a complex task: objects and their visual features must be determined, the referring expressions must be recognised, extra-linguistic information such as eye gaze or pointing gestures must be incorporated --- and the intended connection between words and world must be reconstructed. In this paper, we introduce a discriminative model of reference resolution that processes incrementally (i.e., word for word), is perceptually-grounded, and improves when interpolated with information from gaze and pointing gestures. We evaluated our model and found that it performed robustly in a realistic reference resolution task, when compared to a generative model
Training an adaptive dialogue policy for interactive learning of visually grounded word meanings
We present a multi-modal dialogue system for interactive learning of
perceptually grounded word meanings from a human tutor. The system integrates
an incremental, semantic parsing/generation framework - Dynamic Syntax and Type
Theory with Records (DS-TTR) - with a set of visual classifiers that are
learned throughout the interaction and which ground the meaning representations
that it produces. We use this system in interaction with a simulated human
tutor to study the effects of different dialogue policies and capabilities on
the accuracy of learned meanings, learning rates, and efforts/costs to the
tutor. We show that the overall performance of the learning agent is affected
by (1) who takes initiative in the dialogues; (2) the ability to express/use
their confidence level about visual attributes; and (3) the ability to process
elliptical and incrementally constructed dialogue turns. Ultimately, we train
an adaptive dialogue policy which optimises the trade-off between classifier
accuracy and tutoring costs.Comment: 11 pages, SIGDIAL 2016 Conferenc
Towards an Indexical Model of Situated Language Comprehension for Cognitive Agents in Physical Worlds
We propose a computational model of situated language comprehension based on
the Indexical Hypothesis that generates meaning representations by translating
amodal linguistic symbols to modal representations of beliefs, knowledge, and
experience external to the linguistic system. This Indexical Model incorporates
multiple information sources, including perceptions, domain knowledge, and
short-term and long-term experiences during comprehension. We show that
exploiting diverse information sources can alleviate ambiguities that arise
from contextual use of underspecific referring expressions and unexpressed
argument alternations of verbs. The model is being used to support linguistic
interactions in Rosie, an agent implemented in Soar that learns from
instruction.Comment: Advances in Cognitive Systems 3 (2014
Resolving References in Visually-Grounded Dialogue via Text Generation
Vision-language models (VLMs) have shown to be effective at image retrieval
based on simple text queries, but text-image retrieval based on conversational
input remains a challenge. Consequently, if we want to use VLMs for reference
resolution in visually-grounded dialogue, the discourse processing capabilities
of these models need to be augmented. To address this issue, we propose
fine-tuning a causal large language model (LLM) to generate definite
descriptions that summarize coreferential information found in the linguistic
context of references. We then use a pretrained VLM to identify referents based
on the generated descriptions, zero-shot. We evaluate our approach on a
manually annotated dataset of visually-grounded dialogues and achieve results
that, on average, exceed the performance of the baselines we compare against.
Furthermore, we find that using referent descriptions based on larger context
windows has the potential to yield higher returns.Comment: Published at SIGDIAL 202
Simple Learning and Compositional Application of Perceptually Grounded Word Meanings for Incremental Reference Resolution
Kennington C, Schlangen D. Simple Learning and Compositional Application of Perceptually Grounded Word Meanings for Incremental Reference Resolution. In: Proceedings of the Conference for the Association for Computational Linguistics (ACL). Association for Computational Linguistics; 2015: 292-301
Incrementally resolving references in order to identify visually present objects in a situated dialogue setting
Kennington C. Incrementally resolving references in order to identify visually present objects in a situated dialogue setting. Bielefeld: Universität Bielefeld; 2016.The primary concern of this thesis is to model the resolution of spoken referring expressions
made in order to identify objects; in particular, everyday objects that can be perceived visually
and distinctly from other objects. The practical goal of such a model is for it to be implemented
as a component for use in a live, interactive, autonomous spoken dialogue system. The requirement of interaction imposes an added complication; one that has been ignored in previous
models and approaches to automatic reference resolution: the model must attempt to resolve
the reference incrementally as it unfolds–not wait until the end of the referring expression to
begin the resolution process.
Beyond components in dialogue systems, reference has been a major player in the philosophy of meaning for longer than a century. For example, Gottlob Frege (1892) has distinguished
between Sinn (sense) and Bedeutung (reference), discussed how they are related and how they
relate to the meaning of words and expressions. It has furthermore been argued (e.g., Dahlgren
(1976)) that reference to entities in the actual world is not just a fundamental notion of semantic theory, but the fundamental notion; for an individual acquiring a language, understanding
the meaning of many words and concepts is done via the task of reference, beginning in early
childhood. In this thesis, we pursue an account of word meaning that is based on perception of
objects; for example, the meaning of the word red is based on visual features that are selected
as distinguishing red objects from non-red ones.
This thesis proposes two statistical models of incremental reference resolution. Given ex-
amples of referring expressions and visual aspects of the objects to which those expressions
referred, both model components learn a functional mapping between the words of the refer-
ring expressions and the visual aspects. A generative model, the simple incremental update
model, presented in Chapter 5, uses a mediating variable to learn the mapping, whereas a dis-
criminative model, the words-as-classifiers model, presented in Chapter 6, learns the mapping
directly and improves over the generative model. Both models have been evaluated in various
reference resolution tasks to objects in virtual scenes as well as real, tangible objects. This
thesis shows that both models work robustly and are able to resolve referring expressions made
in reference to visually present objects despite realistic, noisy conditions of speech and object
recognition. A theoretical and practical comparison is also provided.
Special emphasis is given to the discriminative model in this thesis because of its simplicity
and ability to represent word meanings. It is in the learning and application of this model that
gives credence to the above claim that reference is the fundamental notion for semantic theory
and that meanings of (visual) words is done through experiencing referring expressions made
to objects that are visually perceivable
Learning to Interpret and Apply Multimodal Descriptions
Han T. Learning to Interpret and Apply Multimodal Descriptions. Bielefeld: Universität Bielefeld; 2018.Enabling computers to understand natural human communication is a goal researchers have been long aspired to in artificial intelligence. Since the concept demonstration of “Put-That- There” in 1980s, significant achievements have been made in developing multimodal interfaces that can process human communication such as speech, eye gaze, facial emotion, co-verbal hand gestures and pen input. State-of-the-art multimodal interfaces are able to process pointing gestures, symbolic gestures with conventional meanings, as well as gesture commands with pre-defined meanings (e.g., circling for “select”). However, in natural communication, co- verbal gestures/pen input rarely convey meanings via conventions or pre-defined rules, but embody meanings relatable to the accompanying speech.
For example, in route given tasks, people often describe landmarks verbally (e.g., two buildings), while demonstrating the relative position with two hands facing each other in the space. Interestingly, when the same gesture is accompanied by the utterance a ball, it may indicate the size of the ball. Hence, the interpretation of such co-verbal hand gestures largely depends on the accompanied verbal content. Similarly, when describing objects, while verbal utterances are most convenient for describing colour and category (e.g., a brown elephant), hand-drawn sketches are often deployed to convey iconic information such as the exact shape of the elephant’s trunk, which is typically difficult to encode in language.
This dissertation concerns the task of learning to interpret multimodal descriptions com- posed of verbal utterances and hand gestures/sketches, and apply corresponding interpretations to tasks such as image retrieval. Specifically, we aim to address following research questions: 1) For co-verbal gestures that embody meanings relatable to accompanied verbal content, how can we use natural language information to interpret the semantics of such co-verbal gestures, e.g., does a gesture indicate relative position or size? 2) As an integral system of commu- nication, speech and gestures not only bear close semantic relations, but also close temporal relations. To what degree and on which dimensions can hand gestures benefit the task of inter- preting multimodal descriptions? 3) While it’s obvious that iconic information in hand-drawn sketches enriches verbal content in object descriptions, how to model the joint contributions of such multimodal descriptions and to what degree can verbal descriptions compensate reduced iconic details in hand-drawn sketches?
To address the above questions, we first introduce three multimodal description corpora: a spatial description corpus composed of natural language and placing gestures (also referred as abstract deictics), a multimodal object description corpus composed of natural language and hand-drawn sketches, and an existing corpus - the Bielefeld Speech and Gesture Alignment Corpus (SAGA).
3
4
We frame the problem of learning gesture semantics as a multi-label classification task us- ing natural language information and hand gesture features. We conducted an experiment with the SAGA corpus. The results show that natural language is informative for the interpretation of hand gestures.
Further more, we describe a system that models the interpretation and application of spatial descriptions and explored three variants of representation methods of the verbal content. When representing the verbal content in the descriptions with a set of automatically learned symbols, the system’s performance is on par with representations with manually defined symbols (e.g., pre-defined object properties). We show that abstract deictic gestures not only lead to better understanding of spatial descriptions, but also result in earlier correct decisions of the system, which can be used to trigger immediate reactions in dialogue systems.
Finally, we investigate the interplay of semantics between symbolic (natural language) and iconic (sketches) modes in multimodal object descriptions, where natural language and sketches jointly contribute to the communications. We model the meaning of natural language and sketches two existing models and combine the meanings from both modalities with a late fusion approach. The results show that even adding reduced sketches (30% of full sketches) can help in the retrieval task. Moreover, in current setup, natural language descriptions can compensate around 30% of reduced sketches
A Discriminative Model for Polyphonic Piano Transcription
We present a discriminative model for polyphonic piano transcription. Support vector machines trained on spectral features are used to classify frame-level note instances. The classifier outputs are temporally constrained via hidden Markov models, and the proposed system is used to transcribe both synthesized and real piano recordings. A frame-level transcription accuracy of 68% was achieved on a newly generated test set, and direct comparisons to previous approaches are provided
- …