363 research outputs found
Improving speaker turn embedding by crossmodal transfer learning from face embedding
Learning speaker turn embeddings has shown considerable improvement in
situations where conventional speaker modeling approaches fail. However, this
improvement is relatively limited when compared to the gain observed in face
embedding learning, which has been proven very successful for face verification
and clustering tasks. Assuming that face and voices from the same identities
share some latent properties (like age, gender, ethnicity), we propose three
transfer learning approaches to leverage the knowledge from the face domain
(learned from thousands of images and identities) for tasks in the speaker
domain. These approaches, namely target embedding transfer, relative distance
transfer, and clustering structure transfer, utilize the structure of the
source face embedding space at different granularities to regularize the target
speaker turn embedding space as optimizing terms. Our methods are evaluated on
two public broadcast corpora and yield promising advances over competitive
baselines in verification and audio clustering tasks, especially when dealing
with short speaker utterances. The analysis of the results also gives insight
into characteristics of the embedding spaces and shows their potential
applications
Processing and Linking Audio Events in Large Multimedia Archives: The EU inEvent Project
In the inEvent EU project [1], we aim at structuring, retrieving, and sharing large archives of networked, and dynamically changing, multimedia recordings, mainly consisting of meetings, videoconferences, and lectures. More specifically, we are developing an integrated system that performs audiovisual processing of multimedia recordings, and labels them in terms of interconnected âhyper-events â (a notion inspired from hyper-texts). Each hyper-event is composed of simpler facets, including audio-video recordings and metadata, which are then easier to search, retrieve and share. In the present paper, we mainly cover the audio processing aspects of the system, including speech recognition, speaker diarization and linking (across recordings), the use of these features for hyper-event indexing and recommendation, and the search portal. We present initial results for feature extraction from lecture recordings using the TED talks. Index Terms: Networked multimedia events; audio processing: speech recognition; speaker diarization and linking; multimedia indexing and searching; hyper-events. 1
Unsupervised Speaker Identification in TV Broadcast Based on Written Names
International audienceIdentifying speakers in TV broadcast in an unsuper- vised way (i.e. without biometric models) is a solution for avoiding costly annotations. Existing methods usually use pronounced names, as a source of names, for identifying speech clusters provided by a diarization step but this source is too imprecise for having sufficient confidence. To overcome this issue, another source of names can be used: the names written in a title block in the image track. We first compared these two sources of names on their abilities to provide the name of the speakers in TV broadcast. This study shows that it is more interesting to use written names for their high precision for identifying the current speaker. We also propose two approaches for finding speaker identity based only on names written in the image track. With the "late naming" approach, we propose different propagations of written names onto clusters. Our second proposition, "Early naming", modifies the speaker diarization module (agglomerative clustering) by adding constraints preventing two clusters with different associated written names to be merged together. These methods were tested on the REPERE corpus phase 1, containing 3 hours of annotated videos. Our best "late naming" system reaches an F-measure of 73.1%. "early naming" improves over this result both in terms of identification error rate and of stability of the clustering stopping criterion. By comparison, a mono-modal, supervised speaker identification system with 535 speaker models trained on matching development data and additional TV and radio data only provided a 57.2% F-measure
The Multimodal Information based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition
The Multi-modal Information based Speech Processing (MISP) challenge aims to
extend the application of signal processing technology in specific scenarios by
promoting the research into wake-up words, speaker diarization, speech
recognition, and other technologies. The MISP2022 challenge has two tracks: 1)
audio-visual speaker diarization (AVSD), aiming to solve ``who spoken when''
using both audio and visual data; 2) a novel audio-visual diarization and
recognition (AVDR) task that focuses on addressing ``who spoken what when''
with audio-visual speaker diarization results. Both tracks focus on the Chinese
language, and use far-field audio and video in real home-tv scenarios: 2-6
people communicating each other with TV noise in the background. This paper
introduces the dataset, track settings, and baselines of the MISP2022
challenge. Our analyses of experiments and examples indicate the good
performance of AVDR baseline system, and the potential difficulties in this
challenge due to, e.g., the far-field video quality, the presence of TV noise
in the background, and the indistinguishable speakers.Comment: 5 pages, 4 figures, to be published in ICASSP202
BER: Balanced Error Rate For Speaker Diarization
DER is the primary metric to evaluate diarization performance while facing a
dilemma: the errors in short utterances or segments tend to be overwhelmed by
longer ones. Short segments, e.g., `yes' or `no,' still have semantic
information. Besides, DER overlooks errors in less-talked speakers. Although
JER balances speaker errors, it still suffers from the same dilemma.
Considering all those aspects, duration error, segment error, and
speaker-weighted error constituting a complete diarization evaluation, we
propose a Balanced Error Rate (BER) to evaluate speaker diarization. First, we
propose a segment-level error rate (SER) via connected sub-graphs and adaptive
IoU threshold to get accurate segment matching. Second, to evaluate diarization
in a unified way, we adopt a speaker-specific harmonic mean between duration
and segment, followed by a speaker-weighted average. Third, we analyze our
metric via the modularized system, EEND, and the multi-modal method on real
datasets. SER and BER are publicly available at https://github.com/X-LANCE/BER.Comment: 5 pages, 2 figure
- âŠ