1,972 research outputs found
Utterance-level Aggregation For Speaker Recognition In The Wild
The objective of this paper is speaker recognition "in the wild"-where
utterances may be of variable length and also contain irrelevant signals.
Crucial elements in the design of deep networks for this task are the type of
trunk (frame level) network, and the method of temporal aggregation. We propose
a powerful speaker recognition deep network, using a "thin-ResNet" trunk
architecture, and a dictionary-based NetVLAD or GhostVLAD layer to aggregate
features across time, that can be trained end-to-end. We show that our network
achieves state of the art performance by a significant margin on the VoxCeleb1
test set for speaker recognition, whilst requiring fewer parameters than
previous methods. We also investigate the effect of utterance length on
performance, and conclude that for "in the wild" data, a longer length is
beneficial.Comment: To appear in: International Conference on Acoustics, Speech, and
Signal Processing (ICASSP), 2019. (Oral Presentation
Improving Multi-Scale Aggregation Using Feature Pyramid Module for Robust Speaker Verification of Variable-Duration Utterances
Currently, the most widely used approach for speaker verification is the deep
speaker embedding learning. In this approach, we obtain a speaker embedding
vector by pooling single-scale features that are extracted from the last layer
of a speaker feature extractor. Multi-scale aggregation (MSA), which utilizes
multi-scale features from different layers of the feature extractor, has
recently been introduced and shows superior performance for variable-duration
utterances. To increase the robustness dealing with utterances of arbitrary
duration, this paper improves the MSA by using a feature pyramid module. The
module enhances speaker-discriminative information of features from multiple
layers via a top-down pathway and lateral connections. We extract speaker
embeddings using the enhanced features that contain rich speaker information
with different time scales. Experiments on the VoxCeleb dataset show that the
proposed module improves previous MSA methods with a smaller number of
parameters. It also achieves better performance than state-of-the-art
approaches for both short and long utterances.Comment: Accepted to Interspeech 202
Combining Residual Networks with LSTMs for Lipreading
We propose an end-to-end deep learning architecture for word-level visual
speech recognition. The system is a combination of spatiotemporal
convolutional, residual and bidirectional Long Short-Term Memory networks. We
train and evaluate it on the Lipreading In-The-Wild benchmark, a challenging
database of 500-size target-words consisting of 1.28sec video excerpts from BBC
TV broadcasts. The proposed network attains word accuracy equal to 83.0,
yielding 6.8 absolute improvement over the current state-of-the-art, without
using information about word boundaries during training or testing.Comment: Submitted to Interspeech 201
- …