Search CORE

42 research outputs found

Multimodal Speaker Diarization Utilizing Face Clustering Information

Author: A Noulas
E El Khoury
G Stamou
H Baltzakis
MM Elmansori
O Zoidi
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 04/08/2015
Field of study

Crossref

Explore Bristol Research

UPC multimodal speaker diarization system for the 2018 Albayzin challenge

Author: Hernando Pericás Francisco Javier
India Massana Miquel Àngel
Morros Rubió Josep Ramon
Palau Puigdevall Ponç
Sagastiberri Itziar
Sayrol Clols Elisa
Publication venue: 'International Speech Communication Association'
Publication date: 01/01/2018
Field of study

This paper presents the UPC system proposed for the Multimodal Speaker Diarization task of the 2018 Albayzin Challenge. This approach works by processing individually the speech and the image signal. In the speech domain, speaker diarization is performed using identity embeddings created by a triplet loss DNN that uses i-vectors as input. The triplet DNN is trained with an additional regularization loss that minimizes the variance of both positive and negative distances. A sliding windows is then used to compare speech segments with enrollment speaker targets using cosine distance between the embeddings. To detect identities from the face modality, a face detector followed by a face tracker has been used on the videos. For each cropped face a feature vector is obtained using a Deep Neural Network based on the ResNet 34 architecture, trained using a metric learning triplet loss (available from dlib library). For each track the face feature vector is obtained by averaging the features obtained for each one of the frames of that track. Then, this feature vector is compared with the features extracted from the images of the enrollment identities. The proposed system is evaluated on the RTVE2018 database.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

UPC multimodal speaker diarization system for the 2018 Albayzin challenge

Author: India Massana Miquel Àngel
Sagastiberri Itziar
Palau Puigdevall Ponç
Sayrol Clols Elisa
Morros Rubió Josep Ramon
Hernando Pericás Francisco Javier
Publication venue: International Speech Communication Association (ISCA)
Publication date: 01/01/2010
Field of study

UPCommons. Portal del coneixement obert de la UPC

University of Maine

Multimodal speaker diarization using oriented optical flow histograms

Author: Gerald Friedland
Mary Tai Knox
Publication venue
Publication date: 01/01/2011
Field of study

Abstract Speaker diarization is the task of partitioning an input stream into speaker homogeneous regions, or in other words, to determine "who spoke when." While approaches to this problem have traditionally relied entirely on the audio stream, the availability of accompanying video streams in recent diarization corpora has prompted the study of methods based on multimodal audio-visual features. In this work, we propose the use of robust video features based on oriented optical flow histograms. Using the state-of-the art ICSI diarization system, we show that, when combined with standard audio features, these features improve the diarization error rate by 14% percent over an audio-only baseline

CiteSeerX

Audio-visual speaker diarization in challenging environments using deep-learning methods

Author: Φανάρας Κωνσταντίνος Α.
Publication venue
Publication date: 01/01/2021
Field of study

University of Thessaly Institutional Repository

Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization

Author: Li Yue
Rao Wei
Wang Hongji
Wang Qing
Wang Yannan
Xie Lei
Zhang Li
Zhao Huan
Publication venue
Publication date: 07/12/2023
Field of study

The scarcity of labeled audio-visual datasets is a constraint for training superior audio-visual speaker diarization systems. To improve the performance of audio-visual speaker diarization, we leverage pre-trained supervised and self-supervised speech models for audio-visual speaker diarization. Specifically, we adopt supervised~(ResNet and ECAPA-TDNN) and self-supervised pre-trained models~(WavLM and HuBERT) as the speaker and audio embedding extractors in an end-to-end audio-visual speaker diarization~(AVSD) system. Then we explore the effectiveness of different frameworks, including Transformer, Conformer, and cross-attention mechanism, in the audio-visual decoder. To mitigate the degradation of performance caused by separate training, we jointly train the audio encoder, speaker encoder, and audio-visual decoder in the AVSD system. Experiments on the MISP dataset demonstrate that the proposed method achieves superior performance and obtained third place in MISP Challenge 2022

arXiv.org e-Print Archive

The Multimodal Information based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition

Author: Chen Hang
Chen Jingdong
Du Jun
Gao Jianqing
He Mao-Kui
Lee Chin-Hui
Liu Cong
Liu Diyuan
Pan Jia
Scharenborg Odette
Siniscalchi Sabato
Wang Zhe
Watanabe Shinji
Wu Shilong
Yin Baocai
Publication venue
Publication date: 01/01/2023
Field of study

The Multi-modal Information based Speech Processing (MISP) challenge aims to extend the application of signal processing technology in specific scenarios by promoting the research into wake-up words, speaker diarization, speech recognition, and other technologies. The MISP2022 challenge has two tracks: 1) audio-visual speaker diarization (AVSD), aiming to solve ``who spoken when'' using both audio and visual data; 2) a novel audio-visual diarization and recognition (AVDR) task that focuses on addressing ``who spoken what when'' with audio-visual speaker diarization results. Both tracks focus on the Chinese language, and use far-field audio and video in real home-tv scenarios: 2-6 people communicating each other with TV noise in the background. This paper introduces the dataset, track settings, and baselines of the MISP2022 challenge. Our analyses of experiments and examples indicate the good performance of AVDR baseline system, and the potential difficulties in this challenge due to, e.g., the far-field video quality, the presence of TV noise in the background, and the indistinguishable speakers.Comment: 5 pages, 4 figures, to be published in ICASSP202

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università di Palermo