15 research outputs found
Improving Multi-Scale Aggregation Using Feature Pyramid Module for Robust Speaker Verification of Variable-Duration Utterances
Currently, the most widely used approach for speaker verification is the deep
speaker embedding learning. In this approach, we obtain a speaker embedding
vector by pooling single-scale features that are extracted from the last layer
of a speaker feature extractor. Multi-scale aggregation (MSA), which utilizes
multi-scale features from different layers of the feature extractor, has
recently been introduced and shows superior performance for variable-duration
utterances. To increase the robustness dealing with utterances of arbitrary
duration, this paper improves the MSA by using a feature pyramid module. The
module enhances speaker-discriminative information of features from multiple
layers via a top-down pathway and lateral connections. We extract speaker
embeddings using the enhanced features that contain rich speaker information
with different time scales. Experiments on the VoxCeleb dataset show that the
proposed module improves previous MSA methods with a smaller number of
parameters. It also achieves better performance than state-of-the-art
approaches for both short and long utterances.Comment: Accepted to Interspeech 202
Self-Adaptive Soft Voice Activity Detection using Deep Neural Networks for Robust Speaker Verification
Voice activity detection (VAD), which classifies frames as speech or
non-speech, is an important module in many speech applications including
speaker verification. In this paper, we propose a novel method, called
self-adaptive soft VAD, to incorporate a deep neural network (DNN)-based VAD
into a deep speaker embedding system. The proposed method is a combination of
the following two approaches. The first approach is soft VAD, which performs a
soft selection of frame-level features extracted from a speaker feature
extractor. The frame-level features are weighted by their corresponding speech
posteriors estimated from the DNN-based VAD, and then aggregated to generate a
speaker embedding. The second approach is self-adaptive VAD, which fine-tunes
the pre-trained VAD on the speaker verification data to reduce the domain
mismatch. Here, we introduce two unsupervised domain adaptation (DA) schemes,
namely speech posterior-based DA (SP-DA) and joint learning-based DA (JL-DA).
Experiments on a Korean speech database demonstrate that the verification
performance is improved significantly in real-world environments by using
self-adaptive soft VAD.Comment: Accepted at 2019 IEEE Automatic Speech Recognition and Understanding
Workshop (ASRU 2019
Additional Shared Decoder on Siamese Multi-view Encoders for Learning Acoustic Word Embeddings
Acoustic word embeddings --- fixed-dimensional vector representations of
arbitrary-length words --- have attracted increasing interest in
query-by-example spoken term detection. Recently, on the fact that the
orthography of text labels partly reflects the phonetic similarity between the
words' pronunciation, a multi-view approach has been introduced that jointly
learns acoustic and text embeddings. It showed that it is possible to learn
discriminative embeddings by designing the objective which takes text labels as
well as word segments. In this paper, we propose a network architecture that
expands the multi-view approach by combining the Siamese multi-view encoders
with a shared decoder network to maximize the effect of the relationship
between acoustic and text embeddings in embedding space. Discriminatively
trained with multi-view triplet loss and decoding loss, our proposed approach
achieves better performance on acoustic word discrimination task with the WSJ
dataset, resulting in 11.1% relative improvement in average precision. We also
present experimental results on cross-view word discrimination and word level
speech recognition tasks.Comment: Accepted at 2019 IEEE Automatic Speech Recognition and Understanding
Workshop (ASRU 2019