25,885 research outputs found
Hey, vitrivr! - A Multimodal UI for Video Retrieval
In this paper, we present a multimodal web-based user interface for the vitrivr system. vitrivr is a modern, open-source video retrieval system for searching in large collections of video using a great variety of query modes, including query-by-sketch, query-by-example and query-by-motion. With the multimodal user interface, prospective users benefit from being able to naturally interact with the vitrivr system by using spoken commands and also by applying multimodal commands which combine spoken instructions with manual pointing. While the main strength of the UI is the seamless combination of speech-based and sketch-based interaction for multimedia similarity search, the speech modality has shown to be very effective for retrieval on its own. In particular, it helps overcoming accessibility boundaries and offering retrieval functionality for users with disabilities. Finally, for a holistic natural experience with the vitrivr system, we have integrated a speech synthesis engine that returns spoken answers to the user
Listening while Speaking and Visualizing: Improving ASR through Multimodal Chain
Previously, a machine speech chain, which is based on sequence-to-sequence
deep learning, was proposed to mimic speech perception and production behavior.
Such chains separately processed listening and speaking by automatic speech
recognition (ASR) and text-to-speech synthesis (TTS) and simultaneously enabled
them to teach each other in semi-supervised learning when they received
unpaired data. Unfortunately, this speech chain study is limited to speech and
textual modalities. In fact, natural communication is actually multimodal and
involves both auditory and visual sensory systems. Although the said speech
chain reduces the requirement of having a full amount of paired data, in this
case we still need a large amount of unpaired data. In this research, we take a
further step and construct a multimodal chain and design a closely knit chain
architecture that combines ASR, TTS, image captioning, and image production
models into a single framework. The framework allows the training of each
component without requiring a large number of parallel multimodal data. Our
experimental results also show that an ASR can be further trained without
speech and text data and cross-modal data augmentation remains possible through
our proposed chain, which improves the ASR performance.Comment: Accepted in IEEE ASRU 201
Recommended from our members
Generation of multi-modal dialogue for a net environment
In this paper an architecture and special purpose markup language for simulated affective face-to-face communication is presented. In systems based on this architecture, users will be able to watch embodied conversational agents interact with each other in virtual locations on the internet. The markup language, or Rich Representation Language (RRL), has been designed to provide an integrated representation of speech, gesture, posture and facial animation
Online backchannel synthesis evaluation with the switching Wizard of Oz
In this paper, we evaluate a backchannel synthesis algorithm in an online conversation between a human speaker and a virtual listener. We adopt the Switching Wizard of Oz (SWOZ) approach to assess behavior synthesis algorithms online. A human speaker watches a virtual listener that is either controlled by a human listener or by an algorithm. The source switches at random intervals. Speakers indicate when they feel they are no longer talking to a human listener. Analysis of these responses reveals patterns of inappropriate behavior in terms of quantity and timing of backchannels
- …