Search CORE

296 research outputs found

CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings

Author: Arora Ashish
Barker Jon
Boeddeker Christoph
Chang Xuankai
Fujita Yusuke
Horiguchi Shota
Kanda Naoyuki
Khudanpur Sanjeev
Mandel Michael
Manohar Vimal
Ni Zhaoheng
Povey Daniel
Raj Desh
Ryant Neville
Snyder David
Subramanian Aswin Shanmugam
Trmal Jan
Vincent Emmanuel
Watanabe Shinji
Yair Bar Ben
Yoshioka Takuya
Publication venue
Publication date: 02/05/2020
Field of study

Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material is the same as the previous CHiME-5 recordings except for accurate array synchronization. The material was elicited using a dinner party scenario with efforts taken to capture data that is representative of natural conversational speech. This paper provides a baseline description of the CHiME-6 challenge for both segmented multispeaker speech recognition (Track 1) and unsegmented multispeaker speech recognition (Track 2). Of note, Track 2 is the first challenge activity in the community to tackle an unsegmented multispeaker speech recognition scenario with a complete set of reproducible open source baselines providing speech enhancement, speaker diarization, and speech recognition modules

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL-Rennes 1

The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines

Author: Barker Jon
Trmal Jan
Vincent Emmanuel
Watanabe Shinji
Publication venue: HAL CCSD
Publication date: 02/09/2018
Field of study

International audienceThe CHiME challenge series aims to advance robust automatic speech recognition (ASR) technology by promoting research at the interface of speech and language processing, signal processing , and machine learning. This paper introduces the 5th CHiME Challenge, which considers the task of distant multi-microphone conversational ASR in real home environments. Speech material was elicited using a dinner party scenario with efforts taken to capture data that is representative of natural conversational speech and recorded by 6 Kinect microphone arrays and 4 binaural microphone pairs. The challenge features a single-array track and a multiple-array track and, for each track, distinct rankings will be produced for systems focusing on robustness with respect to distant-microphone capture vs. systems attempting to address all aspects of the task including conversational language modeling. We discuss the rationale for the challenge and provide a detailed description of the data collection procedure, the task, and the baseline systems for array synchronization, speech enhancement, and conventional and end-to-end ASR

INRIA a CCSD electronic archive server

RGB-D datasets using microsoft kinect or similar sensors: a survey

Author: Galili
Guan
Hu
Kolner
Mulvad
Nakazawa
Palushani
Palushani
Publication venue: Springer
Publication date: 01/01/2015
Field of study

RGB-D data has turned out to be a very useful representation of an indoor scene for solving fundamental computer vision problems. It takes the advantages of the color image that provides appearance information of an object and also the depth image that is immune to the variations in color, illumination, rotation angle and scale. With the invention of the low-cost Microsoft Kinect sensor, which was initially used for gaming and later became a popular device for computer vision, high quality RGB-D data can be acquired easily. In recent years, more and more RGB-D image/video datasets dedicated to various applications have become available, which are of great importance to benchmark the state-of-the-art. In this paper, we systematically survey popular RGB-D datasets for different applications including object recognition, scene classification, hand gesture recognition, 3D-simultaneous localization and mapping, and pose estimation. We provide the insights into the characteristics of each important dataset, and compare the popularity and the difficulty of those datasets. Overall, the main goal of this survey is to give a comprehensive description about the available RGB-D datasets and thus to guide researchers in the selection of suitable datasets for evaluating their algorithms

Northumbria University Research Portal

Crossref

Springer - Publisher Connector

Online Research Database In Technology

CHiME-6 Challenge: Tackling multispeaker speech recognition for unsegmented recordings

Author: Arora Ashish
Barker Jon
Boeddeker Christoph
Chang Xuankai
Fujita Yusuke
Horiguchi Shota
Kanda Naoyuki
Khudanpur Sanjeev
Mandel Michael
Manohar Vimal
Ni Zhaoheng
Povey Daniel
Raj Desh
Ryant Neville
Snyder David
Subramanian Aswin,
Trmal Jan
Vincent Emmanuel
Watanabe Shinji
Yair Bar,
Yoshioka Takuya
Publication venue: HAL CCSD
Publication date: 04/05/2020
Field of study

International audienceFollowing the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material is the same as the previous CHiME-5 recordings except for accurate array synchronization. The material was elicited using a dinner party scenario with efforts taken to capture data that is representative of natural conversational speech. This paper provides a baseline description of the CHiME-6 challenge for both segmented multispeaker speech recognition (Track 1) and unsegmented multispeaker speech recognition (Track 2). Of note, Track 2 is the first challenge activity in the community to tackle an unsegmented multispeaker speech recognition scenario with a complete set of reproducible open source baselines providing speech enhancement, speaker diarization, and speech recognition modules

INRIA a CCSD electronic archive server

Capturing Synchronous Collaborative Design Activities: A State-Of-The-Art Technology Review

Author: Bermell-Garcia P.
Hall M.
Johansson A.
McMahon C. A.
Ravindranath Ranjitun
Publication venue: 'Faculty of Mechanical Engineering and Naval Architecture, Univ. of Zagreb'
Publication date: 01/01/2018
Field of study

Crossref

Online Research Database In Technology

Social signal processing for studying parent–infant interaction

Author: Catherine eAchard
Chloé eLeclere
Chloé eLeclere
David eCohen
Marie eAvril
Miri eKeren
Mohamed eChetouani
Stéphane eMichelet
Sylvain eMissonnier
Sylvie eViaux-Savelon
Publication venue: 'Frontiers Media SA'
Publication date: 01/01/2014
Field of study

International audienceStudying early interactions is a core issue of infant development and psychopathology. Automatic social signal processing theoretically offers the possibility to extract and analyze communication by taking an integrative perspective, considering the multimodal nature and dynamics of behaviors (including synchrony).This paper proposes an explorative method to acquire and extract relevant social signals from a naturalistic early parent–infant interaction. An experimental setup is proposed based on both clinical and technical requirements. We extracted various cues from body postures and speech productions of partners using the IMI2S (Interaction, Multimodal Integration, and Social Signal) Framework. Preliminary clinical and computational results are reported for two dyads (one pathological in a situation of severe emotional neglect and one normal control) as an illustration of our cross-disciplinary protocol. The results from both clinical and computational analyzes highlight similar differences: the pathological dyad shows dyssynchronic interaction led by the infant whereas the control dyad shows synchronic interaction and a smooth interactive dialog.The results suggest that the current method might be promising for future studies

Directory of Open Access Journals

HAL Descartes

Frontiers - Publisher Connector

PubMed Central

Simulating realistic multiparty speech data: for the development of distant microphone ASR systems

Author: Deadman Jack
Publication venue
Publication date: 01/07/2023
Field of study

Automatic speech recognition has become a ubiquitous technology integrated into our daily lives. However, the problem remains challenging when the speaker is far away from the microphone. In such scenarios, the speech is degraded both by reverberation and by the presence of additive noise. This situation is particularly challenging when there are competing speakers present (i.e. multi-party scenarios) Acoustic scene simulation has been a major tool for training and developing distant microphone speech recognition systems, and is now being used to develop solutions for mult-party scenarios. It has been used both in training -- as it allows cheap generation of limitless amounts of data -- and for evaluation -- because it can provide easy access to a ground truth (i.e. a noise-free target signal). However, whilst much work has been conducted to produce realistic artificial scene simulators, the signals produced from such simulators are only as good as the `metadata' being used to define the setups, i.e., the data describing, for example, the number of speakers and their distribution relative to the microphones. This thesis looks at how realistic metadata can be derived by analysing how speakers behave in real domestic environments. In particular, how to produce scenes that provide a realistic distribution for various factors that are known to influence the 'difficulty' of the scene, including the separation angle between speakers, the absolute and relative distances of speakers to microphones, and the pattern of temporal overlap of speech. Using an existing audio-visual multi-party conversational dataset, CHiME-5, each of these aspects has been studied in turn. First, producing a realistic angular separation between speakers allows for algorithms which enhance signals based on the direction of arrival to be fairly evaluated, reducing the mismatch between real and simulated data. This was estimated using automatic people detection techniques in video recordings from CHiME-5. Results show that commonly used datasets of simulated signals do not follow a realistic distribution, and when a realistic distribution is enforced, a significant drop in performance is observed. Second, by using multiple cameras it has been possible to estimate the 2-D positions of people inside each scene. This has allowed the estimation of realistic distributions for the absolute distance to the microphone and relative distance to the competing speaker. The results show grouping behaviour among participants when located in a room and the impact this has on performance depends on the room size considered. Finally, the amount of overlap and points in the mixture which contain overlap were explored using finite-state models. These models allowed for mixtures to be generated, which approached the overlap patterns observed in the real data. Features derived from these models were also shown to be a predictor of the difficulty of the mixture. At each stage of the project, simulated datasets derived using the realistic metadata distributions have been compared to existing standard datasets that use naive or uninformed metadata distributions, and implications for speech recognition performance are observed and discussed. This work has demonstrated how unrealistic approaches can produce over-promising results, and can bias research towards techniques that might not work well in practice. Results will also be valuable in informing the design of future simulated datasets

White Rose E-theses Online