Search CORE

15 research outputs found

Improving Multi-Scale Aggregation Using Feature Pyramid Module for Robust Speaker Verification of Variable-Duration Utterances

Author: Choi Yeunju
Jung Myunghun
Jung Youngmoon
Kim Hoirin
Kye Seong Min
Publication venue: 'International Speech Communication Association'
Publication date: 06/08/2020
Field of study

Currently, the most widely used approach for speaker verification is the deep speaker embedding learning. In this approach, we obtain a speaker embedding vector by pooling single-scale features that are extracted from the last layer of a speaker feature extractor. Multi-scale aggregation (MSA), which utilizes multi-scale features from different layers of the feature extractor, has recently been introduced and shows superior performance for variable-duration utterances. To increase the robustness dealing with utterances of arbitrary duration, this paper improves the MSA by using a feature pyramid module. The module enhances speaker-discriminative information of features from multiple layers via a top-down pathway and lateral connections. We extract speaker embeddings using the enhanced features that contain rich speaker information with different time scales. Experiments on the VoxCeleb dataset show that the proposed module improves previous MSA methods with a smaller number of parameters. It also achieves better performance than state-of-the-art approaches for both short and long utterances.Comment: Accepted to Interspeech 202

arXiv.org e-Print Archive

Crossref

Self-Adaptive Soft Voice Activity Detection using Deep Neural Networks for Robust Speaker Verification

Author: Choi Yeunju
Jung Youngmoon
Kim Hoirin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 26/09/2019
Field of study

Voice activity detection (VAD), which classifies frames as speech or non-speech, is an important module in many speech applications including speaker verification. In this paper, we propose a novel method, called self-adaptive soft VAD, to incorporate a deep neural network (DNN)-based VAD into a deep speaker embedding system. The proposed method is a combination of the following two approaches. The first approach is soft VAD, which performs a soft selection of frame-level features extracted from a speaker feature extractor. The frame-level features are weighted by their corresponding speech posteriors estimated from the DNN-based VAD, and then aggregated to generate a speaker embedding. The second approach is self-adaptive VAD, which fine-tunes the pre-trained VAD on the speaker verification data to reduce the domain mismatch. Here, we introduce two unsupervised domain adaptation (DA) schemes, namely speech posterior-based DA (SP-DA) and joint learning-based DA (JL-DA). Experiments on a Korean speech database demonstrate that the verification performance is improved significantly in real-world environments by using self-adaptive soft VAD.Comment: Accepted at 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019

arXiv.org e-Print Archive

Crossref

Additional Shared Decoder on Siamese Multi-view Encoders for Learning Acoustic Word Embeddings

Author: Goo Jahyun
Jung Myunghun
Jung Youngmoon
Kim Hoirin
Lim Hyungjun
Publication venue
Publication date: 01/10/2019
Field of study

Acoustic word embeddings --- fixed-dimensional vector representations of arbitrary-length words --- have attracted increasing interest in query-by-example spoken term detection. Recently, on the fact that the orthography of text labels partly reflects the phonetic similarity between the words' pronunciation, a multi-view approach has been introduced that jointly learns acoustic and text embeddings. It showed that it is possible to learn discriminative embeddings by designing the objective which takes text labels as well as word segments. In this paper, we propose a network architecture that expands the multi-view approach by combining the Siamese multi-view encoders with a shared decoder network to maximize the effect of the relationship between acoustic and text embeddings in embedding space. Discriminatively trained with multi-view triplet loss and decoding loss, our proposed approach achieves better performance on acoustic word discrimination task with the WSJ dataset, resulting in 11.1% relative improvement in average precision. We also present experimental results on cross-view word discrimination and word level speech recognition tasks.Comment: Accepted at 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019

arXiv.org e-Print Archive

Crossref