1,250 research outputs found
Audio-Visual Speaker Verification via Joint Cross-Attention
Speaker verification has been widely explored using speech signals, which has
shown significant improvement using deep models. Recently, there has been a
surge in exploring faces and voices as they can offer more complementary and
comprehensive information than relying only on a single modality of speech
signals. Though current methods in the literature on the fusion of faces and
voices have shown improvement over that of individual face or voice modalities,
the potential of audio-visual fusion is not fully explored for speaker
verification. Most of the existing methods based on audio-visual fusion either
rely on score-level fusion or simple feature concatenation. In this work, we
have explored cross-modal joint attention to fully leverage the inter-modal
complementary information and the intra-modal information for speaker
verification. Specifically, we estimate the cross-attention weights based on
the correlation between the joint feature presentation and that of the
individual feature representations in order to effectively capture both
intra-modal as well inter-modal relationships among the faces and voices. We
have shown that efficiently leveraging the intra- and inter-modal relationships
significantly improves the performance of audio-visual fusion for speaker
verification. The performance of the proposed approach has been evaluated on
the Voxceleb1 dataset. Results show that the proposed approach can
significantly outperform the state-of-the-art methods of audio-visual fusion
for speaker verification
Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification
In this paper, we present a methodology for achieving robust multimodal
person representations optimized for open-set audio-visual speaker
verification. Distance Metric Learning (DML) approaches have typically
dominated this problem space, owing to strong performance on new and unseen
classes. In our work, we explored multitask learning techniques to further
boost performance of the DML approach and show that an auxiliary task with weak
labels can increase the compactness of the learned speaker representation. We
also extend the Generalized end-to-end loss (GE2E) to multimodal inputs and
demonstrate that it can achieve competitive performance in an audio-visual
space. Finally, we introduce a non-synchronous audio-visual sampling random
strategy during training time that has shown to improve generalization. Our
network achieves state of the art performance for speaker verification,
reporting 0.244%, 0.252%, 0.441% Equal Error Rate (EER) on the three official
trial lists of VoxCeleb1-O/E/H, which is to our knowledge, the best published
results on VoxCeleb1-E and VoxCeleb1-H
CN-Celeb-AV: A Multi-Genre Audio-Visual Dataset for Person Recognition
Audio-visual person recognition (AVPR) has received extensive attention.
However, most datasets used for AVPR research so far are collected in
constrained environments, and thus cannot reflect the true performance of AVPR
systems in real-world scenarios. To meet the request for research on AVPR in
unconstrained conditions, this paper presents a multi-genre AVPR dataset
collected `in the wild', named CN-Celeb-AV. This dataset contains more than
419k video segments from 1,136 persons from public media. In particular, we put
more emphasis on two real-world complexities: (1) data in multiple genres; (2)
segments with partial information. A comprehensive study was conducted to
compare CN-Celeb-AV with two popular public AVPR benchmark datasets, and the
results demonstrated that CN-Celeb-AV is more in line with real-world scenarios
and can be regarded as a new benchmark dataset for AVPR research. The dataset
also involves a development set that can be used to boost the performance of
AVPR systems in real-life situations. The dataset is free for researchers and
can be downloaded from http://cnceleb.org/.Comment: INTERSPEECH 202
Recommended from our members
Evaluation and analysis of hybrid intelligent pattern recognition techniques for speaker identification
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.The rapid momentum of the technology progress in the recent years has led to a tremendous rise in the use of biometric authentication systems. The objective of this research is to investigate the problem
of identifying a speaker from its voice regardless of the content (i.e.
text-independent), and to design efficient methods of combining face and voice in producing a robust authentication system.
A novel approach towards speaker identification is developed using
wavelet analysis, and multiple neural networks including Probabilistic
Neural Network (PNN), General Regressive Neural Network (GRNN)and Radial Basis Function-Neural Network (RBF NN) with the AND
voting scheme. This approach is tested on GRID and VidTIMIT cor-pora and comprehensive test results have been validated with state-
of-the-art approaches. The system was found to be competitive and it improved the recognition rate by 15% as compared to the classical Mel-frequency Cepstral Coe±cients (MFCC), and reduced the recognition time by 40% compared to Back Propagation Neural Network (BPNN), Gaussian Mixture Models (GMM) and Principal Component Analysis (PCA).
Another novel approach using vowel formant analysis is implemented using Linear Discriminant Analysis (LDA). Vowel formant based speaker identification is best suitable for real-time implementation and requires only a few bytes of information to be stored for each speaker, making it both storage and time efficient. Tested on GRID and Vid-TIMIT, the proposed scheme was found to be 85.05% accurate when Linear Predictive Coding (LPC) is used to extract the vowel formants, which is much higher than the accuracy of BPNN and GMM. Since the proposed scheme does not require any training time other than creating a small database of vowel formants, it is faster as well. Furthermore, an increasing number of speakers makes it di±cult for BPNN and GMM to sustain their accuracy, but the proposed score-based methodology stays almost linear.
Finally, a novel audio-visual fusion based identification system is implemented using GMM and MFCC for speaker identi¯cation and PCA for face recognition. The results of speaker identification and face recognition are fused at different levels, namely the feature, score and decision levels. Both the score-level and decision-level (with OR voting) fusions were shown to outperform the feature-level fusion in terms of accuracy and error resilience. The result is in line with the distinct nature of the two modalities which lose themselves when combined at the feature-level. The GRID and VidTIMIT test results validate that
the proposed scheme is one of the best candidates for the fusion of
face and voice due to its low computational time and high recognition accuracy
CAD -- Contextual Multi-modal Alignment for Dynamic AVQA
In the context of Audio Visual Question Answering (AVQA) tasks, the audio
visual modalities could be learnt on three levels: 1) Spatial, 2) Temporal, and
3) Semantic. Existing AVQA methods suffer from two major shortcomings; the
audio-visual (AV) information passing through the network isn't aligned on
Spatial and Temporal levels; and, inter-modal (audio and visual) Semantic
information is often not balanced within a context; this results in poor
performance. In this paper, we propose a novel end-to-end Contextual
Multi-modal Alignment (CAD) network that addresses the challenges in AVQA
methods by i) introducing a parameter-free stochastic Contextual block that
ensures robust audio and visual alignment on the Spatial level; ii) proposing a
pre-training technique for dynamic audio and visual alignment on Temporal level
in a self-supervised setting, and iii) introducing a cross-attention mechanism
to balance audio and visual information on Semantic level. The proposed novel
CAD network improves the overall performance over the state-of-the-art methods
on average by 9.4% on the MUSIC-AVQA dataset. We also demonstrate that our
proposed contributions to AVQA can be added to the existing methods to improve
their performance without additional complexity requirements.Comment: Accepted to IEEE/CVF Winter Conference on Applications of Computer
Vision (WACV) 202
Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser
Audio-visual learning has been a major pillar of multi-modal machine
learning, where the community mostly focused on its modality-aligned setting,
i.e., the audio and visual modality are both assumed to signal the prediction
target. With the Look, Listen, and Parse dataset (LLP), we investigate the
under-explored unaligned setting, where the goal is to recognize audio and
visual events in a video with only weak labels observed. Such weak video-level
labels only tell what events happen without knowing the modality they are
perceived (audio, visual, or both). To enhance learning in this challenging
setting, we incorporate large-scale contrastively pre-trained models as the
modality teachers. A simple, effective, and generic method, termed Visual-Audio
Label Elaboration (VALOR), is innovated to harvest modality labels for the
training events. Empirical studies show that the harvested labels significantly
improve an attentional baseline by 8.0 in average F-score (Type@AV).
Surprisingly, we found that modality-independent teachers outperform their
modality-fused counterparts since they are noise-proof from the other
potentially unaligned modality. Moreover, our best model achieves the new
state-of-the-art on all metrics of LLP by a substantial margin (+5.4 F-score
for Type@AV). VALOR is further generalized to Audio-Visual Event Localization
and achieves the new state-of-the-art as well. Code is available at:
https://github.com/Franklin905/VALOR
- …