338 research outputs found
Realistic multi-microphone data simulation for distant speech recognition
The availability of realistic simulated corpora is of key importance for the
future progress of distant speech recognition technology. The reliability,
flexibility and low computational cost of a data simulation process may
ultimately allow researchers to train, tune and test different techniques in a
variety of acoustic scenarios, avoiding the laborious effort of directly
recording real data from the targeted environment.
In the last decade, several simulated corpora have been released to the
research community, including the data-sets distributed in the context of
projects and international challenges, such as CHiME and REVERB. These efforts
were extremely useful to derive baselines and common evaluation frameworks for
comparison purposes. At the same time, in many cases they highlighted the need
of a better coherence between real and simulated conditions.
In this paper, we examine this issue and we describe our approach to the
generation of realistic corpora in a domestic context. Experimental validation,
conducted in a multi-microphone scenario, shows that a comparable performance
trend can be observed with both real and simulated data across different
recognition frameworks, acoustic models, as well as multi-microphone processing
techniques.Comment: Proc. of Interspeech 201
A multilingual corpus for rich audio-visual scene description in a meeting-room environment
In this paper, we present a multilingual database specifically designed to develop technologies for rich audio-visual scene
description in meeting-room environments. Part of that database includes the already existing CHIL audio-visual recordings, whose annotations have been extended. A relevant objective in the new recorded sessions was to include situations in which the semantic content can not be extracted from a single modality. The presented database, that includes five hours of rather spontaneously generated scientific presentations, was manually annotated using standard or previously reported annotation schemes, and will be publicly available for the research purposes.Peer ReviewedPostprint (author’s final draft
HIFI-AV: An Audio-visual Corpus for Spoken Language Human-Machine Dialogue Research in Spanish
In this paper, we describe a new multi-purpose audio-visual database on the context of speech interfaces for controlling household electronic devices. The database comprises speech and video recordings of 19 speakers interacting with a HIFI audio box by means of a spoken dialogue system. Dialogue management is based on Bayesian Networks and the system is provided with contextual information handling strategies. Each speaker was requested to fulfil different sets of specific goals following predefined scenarios, according to both different complexity levels and degrees of freedom or initiative allowed to the user. Due to a careful design and its size, the recorded database allows comprehensive studies on speech recognition, speech understanding, dialogue modeling and management, microphone array based speech processing, and both speech and video-based acoustic source localisation. The database has been labelled for quality and efficiency studies on dialogue performance. The whole database has been validated through both objective and subjective tests
A categorization of robust speech processing datasets
Speech and audio signal processing research is a tale of data collection efforts and evaluation campaigns. While large datasets for automatic speech recognition (ASR) in clean environments with various speaking styles are available, the landscape is not as picture- perfect when it comes to robust ASR in realistic environments, much less so for evaluation of source separation and speech enhancement methods. Many data collection efforts have been conducted, moving along towards more and more realistic conditions, each mak- ing different compromises between mostly antagonistic factors: financial and human cost; amount of collected data; availability and quality of annotations and ground truth; natural- ness of mixing conditions; naturalness of speech content and speaking style; naturalness of the background noise; etc. In order to better understand what directions need to be explored to build datasets that best support the development and evaluation of algorithms for recognition, separation or localization that can be used in real-world applications, we present here a study of existing datasets in terms of their key attributes
The CLEAR 2007 Evaluation
Abstract. This paper is a summary of the 2007 CLEAR Evaluation on the Classification of Events, Activities, and Relationships which took place in early 2007 and culminated with a two-day workshop held in May 2007. CLEAR is an international effort to evaluate systems for the perception of people, their activities, and interactions. In its second year, CLEAR has developed a following from the computer vision and speech communities, spawning a more multimodal perspective of research eval-uation. This paper describes the evaluation tasks, including metrics and databases used, and discusses the results achieved. The CLEAR 2007 tasks comprise person, face, and vehicle tracking, head pose estimation, as well as acoustic scene analysis. These include subtasks performed in the visual, acoustic and audio-visual domains for meeting room and surveillance data.
Audio‐Visual Speaker Tracking
Target motion tracking found its application in interdisciplinary fields, including but not limited to surveillance and security, forensic science, intelligent transportation system, driving assistance, monitoring prohibited area, medical science, robotics, action and expression recognition, individual speaker discrimination in multi‐speaker environments and video conferencing in the fields of computer vision and signal processing. Among these applications, speaker tracking in enclosed spaces has been gaining relevance due to the widespread advances of devices and technologies and the necessity for seamless solutions in real‐time tracking and localization of speakers. However, speaker tracking is a challenging task in real‐life scenarios as several distinctive issues influence the tracking process, such as occlusions and an unknown number of speakers. One approach to overcome these issues is to use multi‐modal information, as it conveys complementary information about the state of the speakers compared to single‐modal tracking. To use multi‐modal information, several approaches have been proposed which can be classified into two categories, namely deterministic and stochastic. This chapter aims at providing multimedia researchers with a state‐of‐the‐art overview of tracking methods, which are used for combining multiple modalities to accomplish various multimedia analysis tasks, classifying them into different categories and listing new and future trends in this field
- …