6,300 research outputs found
AVA-AVD: Audio-Visual Speaker Diarization in the Wild
Audio-visual speaker diarization aims at detecting "who spoke when" using
both auditory and visual signals. Existing audio-visual diarization datasets
are mainly focused on indoor environments like meeting rooms or news studios,
which are quite different from in-the-wild videos in many scenarios such as
movies, documentaries, and audience sitcoms. To develop diarization methods for
these challenging videos, we create the AVA Audio-Visual Diarization (AVA-AVD)
dataset. Our experiments demonstrate that adding AVA-AVD into training set can
produce significantly better diarization models for in-the-wild videos despite
that the data is relatively small. Moreover, this benchmark is challenging due
to the diverse scenes, complicated acoustic conditions, and completely
off-screen speakers. As a first step towards addressing the challenges, we
design the Audio-Visual Relation Network (AVR-Net) which introduces a simple
yet effective modality mask to capture discriminative information based on face
visibility. Experiments show that our method not only can outperform
state-of-the-art methods but is more robust as varying the ratio of off-screen
speakers. Our data and code has been made publicly available at
https://github.com/showlab/AVA-AVD.Comment: ACMMM 202
One-Class Classification: Taxonomy of Study and Review of Techniques
One-class classification (OCC) algorithms aim to build classification models
when the negative class is either absent, poorly sampled or not well defined.
This unique situation constrains the learning of efficient classifiers by
defining class boundary just with the knowledge of positive class. The OCC
problem has been considered and applied under many research themes, such as
outlier/novelty detection and concept learning. In this paper we present a
unified view of the general problem of OCC by presenting a taxonomy of study
for OCC problems, which is based on the availability of training data,
algorithms used and the application domains applied. We further delve into each
of the categories of the proposed taxonomy and present a comprehensive
literature review of the OCC algorithms, techniques and methodologies with a
focus on their significance, limitations and applications. We conclude our
paper by discussing some open research problems in the field of OCC and present
our vision for future research.Comment: 24 pages + 11 pages of references, 8 figure
A Unified Framework for Modality-Agnostic Deepfakes Detection
As AI-generated content (AIGC) thrives, deepfakes have expanded from
single-modality falsification to cross-modal fake content creation, where
either audio or visual components can be manipulated. While using two unimodal
detectors can detect audio-visual deepfakes, cross-modal forgery clues could be
overlooked. Existing multimodal deepfake detection methods typically establish
correspondence between the audio and visual modalities for binary real/fake
classification, and require the co-occurrence of both modalities. However, in
real-world multi-modal applications, missing modality scenarios may occur where
either modality is unavailable. In such cases, audio-visual detection methods
are less practical than two independent unimodal methods. Consequently, the
detector can not always obtain the number or type of manipulated modalities
beforehand, necessitating a fake-modality-agnostic audio-visual detector. In
this work, we introduce a comprehensive framework that is agnostic to fake
modalities, which facilitates the identification of multimodal deepfakes and
handles situations with missing modalities, regardless of the manipulations
embedded in audio, video, or even cross-modal forms. To enhance the modeling of
cross-modal forgery clues, we employ audio-visual speech recognition (AVSR) as
a preliminary task. This efficiently extracts speech correlations across
modalities, a feature challenging for deepfakes to replicate. Additionally, we
propose a dual-label detection approach that follows the structure of AVSR to
support the independent detection of each modality. Extensive experiments on
three audio-visual datasets show that our scheme outperforms state-of-the-art
detection methods with promising performance on modality-agnostic audio/video
deepfakes.Comment: This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessibl
Toward Open-Set Face Recognition
Much research has been conducted on both face identification and face
verification, with greater focus on the latter. Research on face identification
has mostly focused on using closed-set protocols, which assume that all probe
images used in evaluation contain identities of subjects that are enrolled in
the gallery. Real systems, however, where only a fraction of probe sample
identities are enrolled in the gallery, cannot make this closed-set assumption.
Instead, they must assume an open set of probe samples and be able to
reject/ignore those that correspond to unknown identities. In this paper, we
address the widespread misconception that thresholding verification-like scores
is a good way to solve the open-set face identification problem, by formulating
an open-set face identification protocol and evaluating different strategies
for assessing similarity. Our open-set identification protocol is based on the
canonical labeled faces in the wild (LFW) dataset. Additionally to the known
identities, we introduce the concepts of known unknowns (known, but
uninteresting persons) and unknown unknowns (people never seen before) to the
biometric community. We compare three algorithms for assessing similarity in a
deep feature space under an open-set protocol: thresholded verification-like
scores, linear discriminant analysis (LDA) scores, and an extreme value machine
(EVM) probabilities. Our findings suggest that thresholding EVM probabilities,
which are open-set by design, outperforms thresholding verification-like
scores.Comment: Accepted for Publication in CVPR 2017 Biometrics Worksho
- …