42 research outputs found
UPC multimodal speaker diarization system for the 2018 Albayzin challenge
This paper presents the UPC system proposed for the Multimodal Speaker Diarization task of the 2018 Albayzin Challenge. This approach works by processing individually the speech and the image signal. In the speech domain, speaker diarization is performed using identity embeddings created by a triplet loss DNN that uses i-vectors as input. The triplet DNN is trained with an additional regularization loss that minimizes the variance of both positive and negative distances. A sliding windows is then used to compare speech segments with enrollment speaker targets using cosine distance between the embeddings. To detect identities from the face modality, a face detector followed by a face tracker has been used on the videos. For each cropped face a feature vector is obtained using a Deep Neural Network based on the ResNet 34 architecture, trained using a metric learning triplet loss (available from dlib library). For each track the face feature vector is obtained by averaging the features obtained for each one of the frames of that track. Then, this feature vector is compared with the features extracted from the images of the enrollment identities. The proposed system is evaluated on the RTVE2018 database.Peer ReviewedPostprint (published version
UPC multimodal speaker diarization system for the 2018 Albayzin challenge
This paper presents the UPC system proposed for the Multimodal Speaker Diarization task of the 2018 Albayzin Challenge. This approach works by processing individually the speech and the image signal. In the speech domain, speaker diarization is performed using identity embeddings created by a triplet loss DNN that uses i-vectors as input. The triplet DNN is trained with an additional regularization loss that minimizes the variance of both positive and negative distances. A sliding windows is then used to compare speech segments with enrollment speaker targets using cosine distance between the embeddings. To detect identities from the face modality, a face detector followed by a face tracker has been used on the videos. For each cropped face a feature vector is obtained using a Deep Neural Network based on the ResNet 34 architecture, trained using a metric learning triplet loss (available from dlib library). For each track the face feature vector is obtained by averaging the features obtained for each one of the frames of that track. Then, this feature vector is compared with the features extracted from the images of the enrollment identities. The proposed system is evaluated on the RTVE2018 database.Peer ReviewedPostprint (published version
Multimodal speaker diarization using oriented optical flow histograms
Abstract Speaker diarization is the task of partitioning an input stream into speaker homogeneous regions, or in other words, to determine "who spoke when." While approaches to this problem have traditionally relied entirely on the audio stream, the availability of accompanying video streams in recent diarization corpora has prompted the study of methods based on multimodal audio-visual features. In this work, we propose the use of robust video features based on oriented optical flow histograms. Using the state-of-the art ICSI diarization system, we show that, when combined with standard audio features, these features improve the diarization error rate by 14% percent over an audio-only baseline
Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization
The scarcity of labeled audio-visual datasets is a constraint for training
superior audio-visual speaker diarization systems. To improve the performance
of audio-visual speaker diarization, we leverage pre-trained supervised and
self-supervised speech models for audio-visual speaker diarization.
Specifically, we adopt supervised~(ResNet and ECAPA-TDNN) and self-supervised
pre-trained models~(WavLM and HuBERT) as the speaker and audio embedding
extractors in an end-to-end audio-visual speaker diarization~(AVSD) system.
Then we explore the effectiveness of different frameworks, including
Transformer, Conformer, and cross-attention mechanism, in the audio-visual
decoder. To mitigate the degradation of performance caused by separate
training, we jointly train the audio encoder, speaker encoder, and audio-visual
decoder in the AVSD system. Experiments on the MISP dataset demonstrate that
the proposed method achieves superior performance and obtained third place in
MISP Challenge 2022
The Multimodal Information based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition
The Multi-modal Information based Speech Processing (MISP) challenge aims to
extend the application of signal processing technology in specific scenarios by
promoting the research into wake-up words, speaker diarization, speech
recognition, and other technologies. The MISP2022 challenge has two tracks: 1)
audio-visual speaker diarization (AVSD), aiming to solve ``who spoken when''
using both audio and visual data; 2) a novel audio-visual diarization and
recognition (AVDR) task that focuses on addressing ``who spoken what when''
with audio-visual speaker diarization results. Both tracks focus on the Chinese
language, and use far-field audio and video in real home-tv scenarios: 2-6
people communicating each other with TV noise in the background. This paper
introduces the dataset, track settings, and baselines of the MISP2022
challenge. Our analyses of experiments and examples indicate the good
performance of AVDR baseline system, and the potential difficulties in this
challenge due to, e.g., the far-field video quality, the presence of TV noise
in the background, and the indistinguishable speakers.Comment: 5 pages, 4 figures, to be published in ICASSP202