Search CORE

3,039 research outputs found

Robust Speech Recognition with Small Microphone Arrays using the Missing Data Approach

Author: Bourlard Hervé
McCowan Iain A.
Morris Andrew
Publication venue: Martigny, Switzerland, IDIAP
Publication date: 10/03/2006
Field of study

Traditional microphone array speech recognition systems simply recognise the enhanced output of the array. As the level of signal enhancement depends on the number of microphones, such systems do not achieve acceptable speech recognition performance for arrays having only a few microphones. For small microphone arrays, we instead propose using the enhanced output to estimate a reliability mask, which is then used in missing data speech recognition. In missing data speech recognition, the decoded sequence depends on the reliability of each input feature. This reliability is usually based on the signal to noise ratio in each frequency band. In this paper, we use the energy difference between the noisy input and the enhanced output of a small microphone array to determine the frequency band reliability. Recognition experiments with a small array demonstrate the effectiveness of the technique, compared to both traditional microphone array enhancement and a baseline missing data system

Towards End-to-End Acoustic Localization using Deep Learning: from Audio Signal to Source Position Coordinates

Author: Macias-Guarasa Javier
Pizarro Daniel
Vera-Diaz Juan Manuel
Publication venue: 'MDPI AG'
Publication date: 29/07/2018
Field of study

This paper presents a novel approach for indoor acoustic source localization using microphone arrays and based on a Convolutional Neural Network (CNN). The proposed solution is, to the best of our knowledge, the first published work in which the CNN is designed to directly estimate the three dimensional position of an acoustic source, using the raw audio signal as the input information avoiding the use of hand crafted audio features. Given the limited amount of available localization data, we propose in this paper a training strategy based on two steps. We first train our network using semi-synthetic data, generated from close talk speech recordings, and where we simulate the time delays and distortion suffered in the signal that propagates from the source to the array of microphones. We then fine tune this network using a small amount of real data. Our experimental results show that this strategy is able to produce networks that significantly improve existing localization methods based on \textit{SRP-PHAT} strategies. In addition, our experiments show that our CNN method exhibits better resistance against varying gender of the speaker and different window sizes compared with the other methods.Comment: 18 pages, 3 figures, 8 table

arXiv.org e-Print Archive

Directory of Open Access Journals