11 research outputs found

    Using parameterised semantics for speech-gesture integration

    Get PDF
    Klein U, Rieser H, Hahn F, Lawler I. Using parameterised semantics for speech-gesture integration. Presented at the Investigating Semantics - Empirical and Philosophical Approaches, Bochum

    Cognitive Principles in Robust Multimodal Interpretation

    Full text link
    Multimodal conversational interfaces provide a natural means for users to communicate with computer systems through multiple modalities such as speech and gesture. To build effective multimodal interfaces, automated interpretation of user multimodal inputs is important. Inspired by the previous investigation on cognitive status in multimodal human machine interaction, we have developed a greedy algorithm for interpreting user referring expressions (i.e., multimodal reference resolution). This algorithm incorporates the cognitive principles of Conversational Implicature and Givenness Hierarchy and applies constraints from various sources (e.g., temporal, semantic, and contextual) to resolve references. Our empirical results have shown the advantage of this algorithm in efficiently resolving a variety of user references. Because of its simplicity and generality, this approach has the potential to improve the robustness of multimodal input interpretation

    Unsupervised methods in multilingual and multimodal semantic modeling

    Get PDF
    In the first part of this project, independent component analysis has been applied to extract word clusters from two Farsi corpora. Both word-document and word-context matrices have been considered to extract such clusters. The application of ICA on the word-document matrices extracted from these two corpora led to the detection of syntagmatic word clusters, while the utilization of word-context matrix resulted in the extraction of both syntagmatic and paradigmatic word clusters. Furthermore, we have discussed some potential benefits of this automatically extracted thesaurus. In such a thesaurus, a word is defined by some other words without being connected to the outer physical objects. In order to fill such a gap, symbol grounding has been proposed by philosophers as a mechanism which might connect words to their physical referents. From their point of view, if words are properly connected to their referents, their meaning might be realized. Once this objective is achieved, a new promising horizon would open in the realm of artificial intelligence. In the second part of the project, we have offered a simple but novel method for grounding words based on the features coming from the visual modality. Firstly, indexical grounding is implemented. In this naĂŻve symbol grounding method, a word is characterized using video indexes as its context. Secondly, such indexical word vectors have been normalized according to the features calculated for motion videos. This multimodal fusion has been referred to as the pattern grounding. In addition, the indexical word vectors have been normalized using some randomly generated data instead of the original motion features. This third case was called randomized grounding. These three cases of symbol grounding have been compared in terms of the performance of translation. Besides that, word clusters have been excerpted by comparing the vector distances and from the dendrograms generated using an agglomerative hierarchical clustering method. We have observed that pattern grounding exceled the indexical grounding in the translation of the motion annotated words, while randomized grounding has deteriorated the translation significantly. Moreover, pattern grounding culminated in the formation of clusters in which a word fit semantically to the other members, while using the indexical grounding, some of the closely related words dispersed into arbitrary clusters

    Proceedings of the Eindhoven FASTAR Days 2004 : Eindhoven, The Netherlands, September 3-4, 2004

    Get PDF
    The Eindhoven FASTAR Days (EFD) 2004 were organized by the Software Construction group of the Department of Mathematics and Computer Science at the Technische Universiteit Eindhoven. On September 3rd and 4th 2004, over thirty participants|hailing from the Czech Republic, Finland, France, The Netherlands, Poland and South Africa|gathered at the Department to attend the EFD. The EFD were organized in connection with the research on finite automata by the FASTAR Research Group, which is centered in Eindhoven and at the University of Pretoria, South Africa. FASTAR (Finite Automata Systems|Theoretical and Applied Research) is an in- ternational research group that aims to lead in all areas related to finite state systems. The work in FASTAR includes both core and applied parts of this field. The EFD therefore focused on the field of finite automata, with an emphasis on practical aspects and applications. Eighteen presentations, mostly on subjects within this field, were given, by researchers as well as students from participating universities and industrial research facilities. This report contains the proceedings of the conference, in the form of papers for twelve of the presentations at the EFD. Most of them were initially reviewed and distributed as handouts during the EFD. After the EFD took place, the papers were revised for publication in these proceedings. We would like to thank the participants for their attendance and presentations, making the EFD 2004 as successful as they were. Based on this success, it is our intention to make the EFD into a recurring event. Eindhoven, December 2004 Loek Cleophas Bruce W. Watso

    Proceedings of the Eindhoven FASTAR Days 2004 : Eindhoven, The Netherlands, September 3-4, 2004

    Get PDF
    The Eindhoven FASTAR Days (EFD) 2004 were organized by the Software Construction group of the Department of Mathematics and Computer Science at the Technische Universiteit Eindhoven. On September 3rd and 4th 2004, over thirty participants|hailing from the Czech Republic, Finland, France, The Netherlands, Poland and South Africa|gathered at the Department to attend the EFD. The EFD were organized in connection with the research on finite automata by the FASTAR Research Group, which is centered in Eindhoven and at the University of Pretoria, South Africa. FASTAR (Finite Automata Systems|Theoretical and Applied Research) is an in- ternational research group that aims to lead in all areas related to finite state systems. The work in FASTAR includes both core and applied parts of this field. The EFD therefore focused on the field of finite automata, with an emphasis on practical aspects and applications. Eighteen presentations, mostly on subjects within this field, were given, by researchers as well as students from participating universities and industrial research facilities. This report contains the proceedings of the conference, in the form of papers for twelve of the presentations at the EFD. Most of them were initially reviewed and distributed as handouts during the EFD. After the EFD took place, the papers were revised for publication in these proceedings. We would like to thank the participants for their attendance and presentations, making the EFD 2004 as successful as they were. Based on this success, it is our intention to make the EFD into a recurring event. Eindhoven, December 2004 Loek Cleophas Bruce W. Watso

    An intelligent multimodal interface for in-car communication systems

    Get PDF
    In-car communication systems (ICCS) are becoming more frequently used by drivers. ICCS are used in order to minimise the driving distraction due to using a mobile phone while driving. Several usability studies of ICCS utilising speech user interfaces (SUIs) have identified usability issues that can affect the workload, performance, satisfaction and user experience of the driver. This is due to current speech technologies which can be a source of errors that may frustrate the driver and negatively affect the user experience. The aim of this research was to design a new multimodal interface that will manage the interaction between an ICCS and the driver. Unlike the current ICCS, it should make more voice input available, so as to support tasks (e.g. sending text messages; browsing the phone book, etc), which still require a cognitive workload from the driver. An adaptive multimodal interface was proposed in order to address current ICCS issues. The multimodal interface used both speech and manual input; however only the speech channel is used as output. This was done in order to minimise the visual distraction that graphical user interfaces or haptics devices can cause with current ICCS. The adaptive interface was designed to minimise the cognitive distraction of the driver. The adaptive interface ensures that whenever the distraction level of the driver is high, any information communication is postponed. After the design and the implementation of the first version of the prototype interface, called MIMI, a usability evaluation was conducted in order to identify any possible usability issues. Although voice dialling was found to be problematic, the results were encouraging in terms of performance, workload and user satisfaction. The suggestions received from the participants to improve the system usability were incorporated in the next implementation of MIMI. The adaptive module was then implemented to reduce driver distraction based on the driver‟s current context. The proposed architecture showed encouraging results in terms of usability and safety. The adaptive behaviour of MIMI significantly contributed to the reduction of cognitive distraction, because drivers received less information during difficult driving situations

    Toward Understanding Human Expression in Human-Robot Interaction

    Get PDF
    Intelligent devices are quickly becoming necessities to support our activities during both work and play. We are already bound in a symbiotic relationship with these devices. An unfortunate effect of the pervasiveness of intelligent devices is the substantial investment of our time and effort to communicate intent. Even though our increasing reliance on these intelligent devices is inevitable, the limits of conventional methods for devices to perceive human expression hinders communication efficiency. These constraints restrict the usefulness of intelligent devices to support our activities. Our communication time and effort must be minimized to leverage the benefits of intelligent devices and seamlessly integrate them into society. Minimizing the time and effort needed to communicate our intent will allow us to concentrate on tasks in which we excel, including creative thought and problem solving. An intuitive method to minimize human communication effort with intelligent devices is to take advantage of our existing interpersonal communication experience. Recent advances in speech, hand gesture, and facial expression recognition provide alternate viable modes of communication that are more natural than conventional tactile interfaces. Use of natural human communication eliminates the need to adapt and invest time and effort using less intuitive techniques required for traditional keyboard and mouse based interfaces. Although the state of the art in natural but isolated modes of communication achieves impressive results, significant hurdles must be conquered before communication with devices in our daily lives will feel natural and effortless. Research has shown that combining information between multiple noise-prone modalities improves accuracy. Leveraging this complementary and redundant content will improve communication robustness and relax current unimodal limitations. This research presents and evaluates a novel multimodal framework to help reduce the total human effort and time required to communicate with intelligent devices. This reduction is realized by determining human intent using a knowledge-based architecture that combines and leverages conflicting information available across multiple natural communication modes and modalities. The effectiveness of this approach is demonstrated using dynamic hand gestures and simple facial expressions characterizing basic emotions. It is important to note that the framework is not restricted to these two forms of communication. The framework presented in this research provides the flexibility necessary to include additional or alternate modalities and channels of information in future research, including improving the robustness of speech understanding. The primary contributions of this research include the leveraging of conflicts in a closed-loop multimodal framework, explicit use of uncertainty in knowledge representation and reasoning across multiple modalities, and a flexible approach for leveraging domain specific knowledge to help understand multimodal human expression. Experiments using a manually defined knowledge base demonstrate an improved average accuracy of individual concepts and an improved average accuracy of overall intents when leveraging conflicts as compared to an open-loop approach

    Meeting decision detection: multimodal information fusion for multi-party dialogue understanding

    Get PDF
    Modern advances in multimedia and storage technologies have led to huge archives of human conversations in widely ranging areas. These archives offer a wealth of information in the organization contexts. However, retrieving and managing information in these archives is a time-consuming and labor-intensive task. Previous research applied keyword and computer vision-based methods to do this. However, spontaneous conversations, complex in the use of multimodal cues and intricate in the interactions between multiple speakers, have posed new challenges to these methods. We need new techniques that can leverage the information hidden in multiple communication modalities – including not just “what” the speakers say but also “how” they express themselves and interact with others. In responding to this need, the thesis inquires into the multimodal nature of meeting dialogues and computational means to retrieve and manage the recorded meeting information. In particular, this thesis develops the Meeting Decision Detector (MDD) to detect and track decisions, one of the most important outcomes of the meetings. The MDD involves not only the generation of extractive summaries pertaining to the decisions (“decision detection”), but also the organization of a continuous stream of meeting speech into locally coherent segments (“discourse segmentation”). This inquiry starts with a corpus analysis which constitutes a comprehensive empirical study of the decision-indicative and segment-signalling cues in the meeting corpora. These cues are uncovered from a variety of communication modalities, including the words spoken, gesture and head movements, pitch and energy level, rate of speech, pauses, and use of subjective terms. While some of the cues match the previous findings of speech segmentation, some others have not been studied before. The analysis also provides empirical grounding for computing features and integrating them into a computational model. To handle the high-dimensional multimodal feature space in the meeting domain, this thesis compares empirically feature discriminability and feature pattern finding criteria. As the different knowledge sources are expected to capture different types of features, the thesis also experiments with methods that can harness synergy between the multiple knowledge sources. The problem formalization and the modeling algorithm so far correspond to an optimal setting: an off-line, post-meeting analysis scenario. However, ultimately the MDD is expected to be operated online – right after a meeting, or when a meeting is still in progress. Thus this thesis also explores techniques that help relax the optimal setting, especially those using only features that can be generated with a higher degree of automation. Empirically motivated experiments are designed to handle the corresponding performance degradation. Finally, with the users in mind, this thesis evaluates the use of query-focused summaries in a decision debriefing task, which is common in the organization context. The decision-focused extracts (which represent compressions of 1%) is compared against the general-purpose extractive summaries (which represent compressions of 10-40%). To examine the effect of model automation on the debriefing task, this evaluation experiments with three versions of decision-focused extracts, each relaxing one manual annotation constraint. Task performance is measured in actual task effectiveness, usergenerated report quality, and user-perceived success. The users’ clicking behaviors are also recorded and analyzed to understand how the users leverage the different versions of extractive summaries to produce abstractive summaries. The analysis framework and computational means developed in this work is expected to be useful for the creation of other dialogue understanding applications, especially those that require to uncover the implicit semantics of meeting dialogues

    Alignment of speech and co-speech gesture in a constraint-based grammar

    Get PDF
    This thesis concerns the form-meaning mapping of multimodal communicative actions consisting of speech signals and improvised co-speech gestures, produced spontaneously with the hand. The interaction between speech and speech-accompanying gestures has been standardly addressed from a cognitive perspective to establish the underlying cognitive mechanisms for the synchronous speech and gesture production, and also from a computational perspective to build computer systems that communicate through multiple modalities. Based on the findings of this previous research, we advance a new theory in which the mapping from the form of the combined speech-and-gesture signal to its meaning is analysed in a constraint-based multimodal grammar. We propose several construction rules about multimodal well-formedness that we motivate empirically from an extensive and detailed corpus study. In particular, the construction rules use the prosody, syntax and semantics of speech, the form and meaning of the gesture signal, as well as the temporal performance of the speech relative to the temporal performance of the gesture to constrain the derivation of a single multimodal syntax tree which in turn determines a meaning representation via standard mechanisms for semantic composition. Gestural form often underspecifies its meaning, and so the output of our grammar is underspecified logical formulae that support the range of possible interpretations of the multimodal act in its final context-of-use, given the current models of the semantics/ pragmatics interface. It is standardly held in the gesture community that the co-expressivity of speech and gesture is determined on the basis of their temporal co-occurrence: that is, a gesture signal is semantically related to the speech signal that happened at the same time as the gesture. Whereas this is usually taken for granted, we propose a methodology of establishing in a systematic and domain-independent way which spoken element(s) gesture can be semantically related to, based on their form, so as to yield a meaning representation that supports the intended interpretation(s) in context. The ‘semantic’ alignment of speech and gesture is thus driven not from the temporal co-occurrence alone, but also from the linguistic properties of the speech signal gesture overlaps with. In so doing, we contribute a fine-grained system for articulating the form-meaning mapping of multimodal actions that uses standard methods from linguistics. We show that just as language exhibits ambiguity in both form and meaning, so do multimodal actions: for instance, the integration of gesture is not restricted to a unique speech phrase but rather speech and gesture can be aligned in multiple multimodal syntax trees thus yielding distinct meaning representations. These multiple mappings stem from the fact that the meaning as derived from gesture form is highly incomplete even in context. An overall challenge is thus to account for the range of possible interpretations of the multimodal action in context using standard methods from linguistics for syntactic derivation and semantic composition

    Finite-state Multimodal Parsing and Understanding

    No full text
    Multimodal interfaces require effective parsing and understanding of utterances whose content is distributed across multiple input modes. Johnston 1998 presents an approach in which strategies for multimodal integration are stated declaratively using a unification-based grammar that is used by a multidimensional chart parser to compose inputs. This approach is highly expressive and supports a broad class of interfaces, but offers only limited potential for mutual compensation among the input modes, is subject to significant concerns in terms of computational complexity, and complicates selection among alternative multimodal interpretations of the input. In this paper, we present an alternative approach in which multimodal parsing and understanding are achieved using a weighted finite-state device which takes speech and gesture streams as inputs and outputs their joint interpretation. This approach is significantly more efficient, enables tight-coupling of multimodal understanding with ..