14 research outputs found
Deep Long Short-Term Memory Adaptive Beamforming Networks For Multichannel Robust Speech Recognition
Far-field speech recognition in noisy and reverberant conditions remains a
challenging problem despite recent deep learning breakthroughs. This problem is
commonly addressed by acquiring a speech signal from multiple microphones and
performing beamforming over them. In this paper, we propose to use a recurrent
neural network with long short-term memory (LSTM) architecture to adaptively
estimate real-time beamforming filter coefficients to cope with non-stationary
environmental noise and dynamic nature of source and microphones positions
which results in a set of timevarying room impulse responses. The LSTM adaptive
beamformer is jointly trained with a deep LSTM acoustic model to predict senone
labels. Further, we use hidden units in the deep LSTM acoustic model to assist
in predicting the beamforming filter coefficients. The proposed system achieves
7.97% absolute gain over baseline systems with no beamforming on CHiME-3 real
evaluation set.Comment: in 2017 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP
A unified multichannel far-field speech recognition system: combining neural beamforming with attention based end-to-end model
Far-field speech recognition is a challenging task that conventionally uses
signal processing beamforming to attack noise and interference problem. But the
performance has been found usually limited due to heavy reliance on
environmental assumption. In this paper, we propose a unified multichannel
far-field speech recognition system that combines the neural beamforming and
transformer-based Listen, Spell, Attend (LAS) speech recognition system, which
extends the end-to-end speech recognition system further to include speech
enhancement. Such framework is then jointly trained to optimize the final
objective of interest. Specifically, factored complex linear projection (fCLP)
has been adopted to form the neural beamforming. Several pooling strategies to
combine look directions are then compared in order to find the optimal
approach. Moreover, information of the source direction is also integrated in
the beamforming to explore the usefulness of source direction as a prior, which
is usually available especially in multi-modality scenario. Experiments on
different microphone array geometry are conducted to evaluate the robustness
against spacing variance of microphone array. Large in-house databases are used
to evaluate the effectiveness of the proposed framework and the proposed method
achieve 19.26\% improvement when compared with a strong baseline
Deep Audio Zooming: Beamwidth-Controllable Neural Beamformer
Audio zooming, a signal processing technique, enables selective focusing and
enhancement of sound signals from a specified region, attenuating others. While
traditional beamforming and neural beamforming techniques, centered on creating
a directional array, necessitate the designation of a singular target
direction, they often overlook the concept of a field of view (FOV), that
defines an angular area. In this paper, we proposed a simple yet effective FOV
feature, amalgamating all directional attributes within the user-defined field.
In conjunction, we've introduced a counter FOV feature capturing directional
aspects outside the desired field. Such advancements ensure refined sound
capture, particularly emphasizing the FOV's boundaries, and guarantee the
enhanced capture of all desired sound sources inside the user-defined field.
The results from the experiment demonstrate the efficacy of the introduced
angular FOV feature and its seamless incorporation into a low-power subband
model suited for real-time applica?tions.Comment: 6 pages, 5 figure
Multichannel Voice Trigger Detection Based on Transform-average-concatenate
Voice triggering (VT) enables users to activate their devices by just
speaking a trigger phrase. A front-end system is typically used to perform
speech enhancement and/or separation, and produces multiple enhanced and/or
separated signals. Since conventional VT systems take only single-channel audio
as input, channel selection is performed. A drawback of this approach is that
unselected channels are discarded, even if the discarded channels could contain
useful information for VT. In this work, we propose multichannel acoustic
models for VT, where the multichannel output from the frond-end is fed directly
into a VT model. We adopt a transform-average-concatenate (TAC) block and
modify the TAC block by incorporating the channel from the conventional channel
selection so that the model can attend to a target speaker when multiple
speakers are present. The proposed approach achieves up to 30% reduction in the
false rejection rate compared to the baseline channel selection approach