497 research outputs found
Robust Speaker Recognition Using Speech Enhancement And Attention Model
In this paper, a novel architecture for speaker recognition is proposed by
cascading speech enhancement and speaker processing. Its aim is to improve
speaker recognition performance when speech signals are corrupted by noise.
Instead of individually processing speech enhancement and speaker recognition,
the two modules are integrated into one framework by a joint optimisation using
deep neural networks. Furthermore, to increase robustness against noise, a
multi-stage attention mechanism is employed to highlight the speaker related
features learned from context information in time and frequency domain. To
evaluate speaker identification and verification performance of the proposed
approach, we test it on the dataset of VoxCeleb1, one of mostly used benchmark
datasets. Moreover, the robustness of our proposed approach is also tested on
VoxCeleb1 data when being corrupted by three types of interferences, general
noise, music, and babble, at different signal-to-noise ratio (SNR) levels. The
obtained results show that the proposed approach using speech enhancement and
multi-stage attention models outperforms two strong baselines not using them in
most acoustic conditions in our experiments.Comment: Acceptted by Odyssey 202
Speaker Re-identification with Speaker Dependent Speech Enhancement
While the use of deep neural networks has significantly boosted speaker
recognition performance, it is still challenging to separate speakers in poor
acoustic environments. Here speech enhancement methods have traditionally
allowed improved performance. The recent works have shown that adapting speech
enhancement can lead to further gains. This paper introduces a novel approach
that cascades speech enhancement and speaker recognition. In the first step, a
speaker embedding vector is generated , which is used in the second step to
enhance the speech quality and re-identify the speakers. Models are trained in
an integrated framework with joint optimisation. The proposed approach is
evaluated using the Voxceleb1 dataset, which aims to assess speaker recognition
in real world situations. In addition three types of noise at different
signal-noise-ratios were added for this work. The obtained results show that
the proposed approach using speaker dependent speech enhancement can yield
better speaker recognition and speech enhancement performances than two
baselines in various noise conditions.Comment: Acceptted for presentation at Interspeech202
Contextual Joint Factor Acoustic Embeddings
Embedding acoustic information into fixed length representations is of
interest for a whole range of applications in speech and audio technology. Two
novel unsupervised approaches to generate acoustic embeddings by modelling of
acoustic context are proposed. The first approach is a contextual joint factor
synthesis encoder, where the encoder in an encoder/decoder framework is trained
to extract joint factors from surrounding audio frames to best generate the
target output. The second approach is a contextual joint factor analysis
encoder, where the encoder is trained to analyse joint factors from the source
signal that correlates best with the neighbouring audio. To evaluate the
effectiveness of our approaches compared to prior work, two tasks are conducted
-- phone classification and speaker recognition -- and test on different TIMIT
data sets. Experimental results show that one of the proposed approaches
outperforms phone classification baselines, yielding a classification accuracy
of 74.1%. When using additional out-of-domain data for training, an additional
3% improvements can be obtained, for both for phone classification and speaker
recognition tasks.Comment: Published at SLT202
Unsupervised Acoustic Unit Representation Learning for Voice Conversion using WaveNet Auto-encoders
Unsupervised representation learning of speech has been of keen interest in
recent years, which is for example evident in the wide interest of the
ZeroSpeech challenges. This work presents a new method for learning frame level
representations based on WaveNet auto-encoders. Of particular interest in the
ZeroSpeech Challenge 2019 were models with discrete latent variable such as the
Vector Quantized Variational Auto-Encoder (VQVAE). However these models
generate speech with relatively poor quality. In this work we aim to address
this with two approaches: first WaveNet is used as the decoder and to generate
waveform data directly from the latent representation; second, the low
complexity of latent representations is improved with two alternative
disentanglement learning methods, namely instance normalization and sliced
vector quantization. The method was developed and tested in the context of the
recent ZeroSpeech challenge 2020. The system output submitted to the challenge
obtained the top position for naturalness (Mean Opinion Score 4.06), top
position for intelligibility (Character Error Rate 0.15), and third position
for the quality of the representation (ABX test score 12.5). These and further
analysis in this paper illustrates that quality of the converted speech and the
acoustic units representation can be well balanced.Comment: To be presented in Interspeech 202
Supervised Speaker Embedding De-Mixing in Two-Speaker Environment
Separating different speaker properties from a multi-speaker environment is
challenging. Instead of separating a two-speaker signal in signal space like
speech source separation, a speaker embedding de-mixing approach is proposed.
The proposed approach separates different speaker properties from a two-speaker
signal in embedding space. The proposed approach contains two steps. In step
one, the clean speaker embeddings are learned and collected by a residual TDNN
based network. In step two, the two-speaker signal and the embedding of one of
the speakers are both input to a speaker embedding de-mixing network. The
de-mixing network is trained to generate the embedding of the other speaker by
reconstruction loss. Speaker identification accuracy and the cosine similarity
score between the clean embeddings and the de-mixed embeddings are used to
evaluate the quality of the obtained embeddings. Experiments are done in two
kind of data: artificial augmented two-speaker data (TIMIT) and real world
recording of two-speaker data (MC-WSJ). Six different speaker embedding
de-mixing architectures are investigated. Comparing with the performance on the
clean speaker embeddings, the obtained results show that one of the proposed
architectures obtained close performance, reaching 96.9% identification
accuracy and 0.89 cosine similarity.Comment: Published at SLT202
Experimental demonstration of composite stimulated Raman adiabatic passage
We experimentally demonstrate composite stimulated Raman adiabatic passage
(CSTIRAP), which combines the concepts of composite pulse sequences and
adiabatic passage. The technique is applied for population transfer in a
rare-earth doped solid. We compare the performance of CSTIRAP with conventional
single and repeated STIRAP, either in the resonant or the highly detuned
regime. In the latter case, CSTIRAP improves the peak transfer efficiency and
robustness, boosting the transfer efficiency substantially compared to repeated
STIRAP. We also propose and demonstrate a universal version of CSTIRAP, which
shows improved performance compared to the originally proposed composite
version. Our findings pave the way towards new STIRAP applications, which
require repeated excitation cycles, e.g., for momentum transfer in atom optics,
or dynamical decoupling to invert arbitrary superposition states in quantum
memories.Comment: 11 pages, 5 figure
MetricGAN+/-: Increasing Robustness of Noise Reduction on Unseen Data
Training of speech enhancement systems often does not incorporate knowledge
of human perception and thus can lead to unnatural sounding results.
Incorporating psychoacoustically motivated speech perception metrics as part of
model training via a predictor network has recently gained interest. However,
the performance of such predictors is limited by the distribution of metric
scores that appear in the training data. In this work, we propose MetricGAN+/-
(an extension of MetricGAN+, one such metric-motivated system) which introduces
an additional network - a "de-generator" which attempts to improve the
robustness of the prediction network (and by extension of the generator) by
ensuring observation of a wider range of metric scores in training.
Experimental results on the VoiceBank-DEMAND dataset show relative improvement
in PESQ score of 3.8% (3.05 vs 3.22 PESQ score), as well as better
generalisation to unseen noise and speech.Comment: 5 pages, 4 figures, Submitted to EUSIPCO 202
- …