In this paper, we present a methodology for achieving robust multimodal
person representations optimized for open-set audio-visual speaker
verification. Distance Metric Learning (DML) approaches have typically
dominated this problem space, owing to strong performance on new and unseen
classes. In our work, we explored multitask learning techniques to further
boost performance of the DML approach and show that an auxiliary task with weak
labels can increase the compactness of the learned speaker representation. We
also extend the Generalized end-to-end loss (GE2E) to multimodal inputs and
demonstrate that it can achieve competitive performance in an audio-visual
space. Finally, we introduce a non-synchronous audio-visual sampling random
strategy during training time that has shown to improve generalization. Our
network achieves state of the art performance for speaker verification,
reporting 0.244%, 0.252%, 0.441% Equal Error Rate (EER) on the three official
trial lists of VoxCeleb1-O/E/H, which is to our knowledge, the best published
results on VoxCeleb1-E and VoxCeleb1-H