1 research outputs found
3-D Feature and Acoustic Modeling for Far-Field Speech Recognition
Automatic speech recognition in multi-channel reverberant conditions is a
challenging task. The conventional way of suppressing the reverberation
artifacts involves a beamforming based enhancement of the multi-channel speech
signal, which is used to extract spectrogram based features for a neural
network acoustic model. In this paper, we propose to extract features directly
from the multi-channel speech signal using a multi variate autoregressive (MAR)
modeling approach, where the correlations among all the three dimensions of
time, frequency and channel are exploited. The MAR features are fed to a
convolutional neural network (CNN) architecture which performs the joint
acoustic modeling on the three dimensions. The 3-D CNN architecture allows the
combination of multi-channel features that optimize the speech recognition cost
compared to the traditional beamforming models that focus on the enhancement
task. Experiments are conducted on the CHiME-3 and REVERB Challenge dataset
using multi-channel reverberant speech. In these experiments, the proposed 3-D
feature and acoustic modeling approach provides significant improvements over
an ASR system trained with beamformed audio (average relative improvements of
10 % and 9 % in word error rates for CHiME-3 and REVERB Challenge datasets
respectively