Search CORE

338 research outputs found

Realistic multi-microphone data simulation for distant speech recognition

Author: Omologo Maurizio
Ravanelli Mirco
Svaizer Piergiorgio
Publication venue
Publication date: 01/01/2016
Field of study

The availability of realistic simulated corpora is of key importance for the future progress of distant speech recognition technology. The reliability, flexibility and low computational cost of a data simulation process may ultimately allow researchers to train, tune and test different techniques in a variety of acoustic scenarios, avoiding the laborious effort of directly recording real data from the targeted environment. In the last decade, several simulated corpora have been released to the research community, including the data-sets distributed in the context of projects and international challenges, such as CHiME and REVERB. These efforts were extremely useful to derive baselines and common evaluation frameworks for comparison purposes. At the same time, in many cases they highlighted the need of a better coherence between real and simulated conditions. In this paper, we examine this issue and we describe our approach to the generation of realistic corpora in a domestic context. Experimental validation, conducted in a multi-microphone scenario, shows that a comparable performance trend can be observed with both real and simulated data across different recognition frameworks, acoustic models, as well as multi-microphone processing techniques.Comment: Proc. of Interspeech 201

arXiv.org e-Print Archive

Archivio della ricerca - Fondazione Bruno Kessler

Tracking Identities and Attention in Smart Environments - Contributions and Progress in the CHIL Project

Author: Bernardin K.
Ekenel H.
Stiefelhagen Rainer
Voit M.
Publication venue: Institute of Electrical and Electronics Engineers
Publication date: 01/01/2008
Field of study

KITopen

A multilingual corpus for rich audio-visual scene description in a meeting-room environment

Author: Butko Taras
Moreno Bilbao M. Asunción
Nadeu Camprubí Climent
Publication venue: ACM Press. Association for Computing Machinery
Publication date: 01/01/2011
Field of study

In this paper, we present a multilingual database specifically designed to develop technologies for rich audio-visual scene description in meeting-room environments. Part of that database includes the already existing CHIL audio-visual recordings, whose annotations have been extended. A relevant objective in the new recorded sessions was to include situations in which the semantic content can not be extracted from a single modality. The presented database, that includes five hours of rather spontaneously generated scientific presentations, was manually annotated using standard or previously reported annotation schemes, and will be publicly available for the research purposes.Peer ReviewedPostprint (author’s final draft

UPCommons. Portal del coneixement obert de la UPC

HIFI-AV: An Audio-visual Corpus for Spoken Language Human-Machine Dialogue Research in Spanish

Author: Barra Chicote Roberto
Fernández Martínez Fernando
Ferreiros López Javier
Lucas Cuesta Juan Manuel
Macías Guarasa Javier
Publication venue: E.T.S.I. Telecomunicación (UPM)
Publication date: 01/01/2010
Field of study

In this paper, we describe a new multi-purpose audio-visual database on the context of speech interfaces for controlling household electronic devices. The database comprises speech and video recordings of 19 speakers interacting with a HIFI audio box by means of a spoken dialogue system. Dialogue management is based on Bayesian Networks and the system is provided with contextual information handling strategies. Each speaker was requested to fulﬁl different sets of speciﬁc goals following predeﬁned scenarios, according to both different complexity levels and degrees of freedom or initiative allowed to the user. Due to a careful design and its size, the recorded database allows comprehensive studies on speech recognition, speech understanding, dialogue modeling and management, microphone array based speech processing, and both speech and video-based acoustic source localisation. The database has been labelled for quality and efﬁciency studies on dialogue performance. The whole database has been validated through both objective and subjective tests

Archivo Digital UPM

Interactive Multimodal Information Management: Shaping the Vision

Author: Bourlard Hervé
Popescu-Belis Andrei
Publication venue: Lausanne, EPFL Press
Publication date: 19/12/2013
Field of study

Infoscience - École polytechnique fédérale de Lausanne

A categorization of robust speech processing datasets

Author: Le Roux Jonathan
Vincent Emmanuel
Publication venue: HAL CCSD
Publication date: 05/09/2014
Field of study

Speech and audio signal processing research is a tale of data collection efforts and evaluation campaigns. While large datasets for automatic speech recognition (ASR) in clean environments with various speaking styles are available, the landscape is not as picture- perfect when it comes to robust ASR in realistic environments, much less so for evaluation of source separation and speech enhancement methods. Many data collection efforts have been conducted, moving along towards more and more realistic conditions, each mak- ing different compromises between mostly antagonistic factors: financial and human cost; amount of collected data; availability and quality of annotations and ground truth; natural- ness of mixing conditions; naturalness of speech content and speaking style; naturalness of the background noise; etc. In order to better understand what directions need to be explored to build datasets that best support the development and evaluation of algorithms for recognition, separation or localization that can be used in real-world applications, we present here a study of existing datasets in terms of their key attributes

INRIA a CCSD electronic archive server

HAL-Rennes 1

The CLEAR 2007 Evaluation

Author: Bernardin Keni
Bowers Rachel
Garofolo John
Michel Martial
Rose R. Travis
Stiefelhagen Rainer
Publication venue: Springer Verlag
Publication date: 01/01/2007
Field of study

Abstract. This paper is a summary of the 2007 CLEAR Evaluation on the Classification of Events, Activities, and Relationships which took place in early 2007 and culminated with a two-day workshop held in May 2007. CLEAR is an international effort to evaluate systems for the perception of people, their activities, and interactions. In its second year, CLEAR has developed a following from the computer vision and speech communities, spawning a more multimodal perspective of research eval-uation. This paper describes the evaluation tasks, including metrics and databases used, and discusses the results achieved. The CLEAR 2007 tasks comprise person, face, and vehicle tracking, head pose estimation, as well as acoustic scene analysis. These include subtasks performed in the visual, acoustic and audio-visual domains for meeting room and surveillance data.

CiteSeerX

Crossref

KITopen

Audio‐Visual Speaker Tracking

Author: Kılıç Volkan
Wang Wenwu
Publication venue: 'IntechOpen'
Publication date: 12/07/2017
Field of study

Target motion tracking found its application in interdisciplinary fields, including but not limited to surveillance and security, forensic science, intelligent transportation system, driving assistance, monitoring prohibited area, medical science, robotics, action and expression recognition, individual speaker discrimination in multi‐speaker environments and video conferencing in the fields of computer vision and signal processing. Among these applications, speaker tracking in enclosed spaces has been gaining relevance due to the widespread advances of devices and technologies and the necessity for seamless solutions in real‐time tracking and localization of speakers. However, speaker tracking is a challenging task in real‐life scenarios as several distinctive issues influence the tracking process, such as occlusions and an unknown number of speakers. One approach to overcome these issues is to use multi‐modal information, as it conveys complementary information about the state of the speakers compared to single‐modal tracking. To use multi‐modal information, several approaches have been proposed which can be classified into two categories, namely deterministic and stochastic. This chapter aims at providing multimedia researchers with a state‐of‐the‐art overview of tracking methods, which are used for combining multiple modalities to accomplish various multimedia analysis tasks, classifying them into different categories and listing new and future trends in this field

IntechOpen

Crossref