6 research outputs found
Time-domain speaker extraction network
Speaker extraction is to extract a target speaker's voice from multi-talker
speech. It simulates humans' cocktail party effect or the selective listening
ability. The prior work mostly performs speaker extraction in frequency domain,
then reconstructs the signal with some phase approximation. The inaccuracy of
phase estimation is inherent to the frequency domain processing, that affects
the quality of signal reconstruction. In this paper, we propose a time-domain
speaker extraction network (TseNet) that doesn't decompose the speech signal
into magnitude and phase spectrums, therefore, doesn't require phase
estimation. The TseNet consists of a stack of dilated depthwise separable
convolutional networks, that capture the long-range dependency of the speech
signal with a manageable number of parameters. It is also conditioned on a
reference voice from the target speaker, that is characterized by speaker
i-vector, to perform the selective listening to the target speaker. Experiments
show that the proposed TseNet achieves 16.3% and 7.0% relative improvements
over the baseline in terms of signal-to-distortion ratio (SDR) and perceptual
evaluation of speech quality (PESQ) under open evaluation condition.Comment: Published in ASRU 2019. arXiv admin note: text overlap with
arXiv:2004.0832
Improving the Robustness of Speaker Recognition in Noise and Multi-Speaker Conditions Using Deep Neural Networks
In speaker recognition, deep neural networks deliver state-of-the-art performance due
to their large capacities and powerful feature extraction abilities. However, this performance can be highly affected by interference from background noise and other speakers.
This thesis focuses on new neural network architectures that are designed to overcome
such interference and thereby improve the robustness of the speaker recognition system.
In order to improve the noise robustness of the speaker recognition model, two
novel network architectures are proposed. The first is the hierarchical attention network, which is able to capture both local and global features in order to improve the
robustness of the network. The experimental results show it can deliver results that
are comparable to the published state-of-the-art methods, reaching 4.28% equal error
rate using the Voxceleb1 training and test sets. The second approach is the speech
enhancement and speaker recognition joint system that consists of two networks; the
first integrates speech enhancement and speaker recognition into one framework to
better filter out noise, while the other makes further use of speaker embeddings input to a speech enhancement network. This provides prior knowledge for the speech
enhancement network which improves its performance. The results show that a joint
system with a speaker dependent speech enhancement model can deliver results that
are comparable to the published state-of-the-art methods, reaching 4.15% equal error
rate using the Voxceleb1 training and test sets.
In order to overcome interfering speaker, two novel approaches are proposed. The
first is referred to as the embedding de-mixing approach that separates the speaker and content properties from a two-speaker signal in an embedding space, rather than
in a signal space. The results show that the de-mixed embeddings are close to the
clean embeddings in terms of quality, and the back-end speaker recognition model can
make use of the de-mixed embeddings to reach 96.9% speaker identification accuracy,
compared to those achieved using clean embeddings (98.5%) on TIMIT dataset. The
second approach is the first end-to-end weakly supervised speaker identification approach based on a novel hierarchical transformer network architecture. The results
show that the proposed model can capture speaker properties from two speakers in
one input utterance. The hierarchical transformer network can reach more than 3%
relative improvement compared to the baselines in all of the test conditions