21 research outputs found
Learning to detect video events from zero or very few video examples
In this work we deal with the problem of high-level event detection in video.
Specifically, we study the challenging problems of i) learning to detect video
events from solely a textual description of the event, without using any
positive video examples, and ii) additionally exploiting very few positive
training samples together with a small number of ``related'' videos. For
learning only from an event's textual description, we first identify a general
learning framework and then study the impact of different design choices for
various stages of this framework. For additionally learning from example
videos, when true positive training samples are scarce, we employ an extension
of the Support Vector Machine that allows us to exploit ``related'' event
videos by automatically introducing different weights for subsets of the videos
in the overall training set. Experimental evaluations performed on the
large-scale TRECVID MED 2014 video dataset provide insight on the effectiveness
of the proposed methods.Comment: Image and Vision Computing Journal, Elsevier, 2015, accepted for
publicatio
Maximum Margin Learning Under Uncertainty
PhDIn this thesis we study the problem of learning under uncertainty using the statistical
learning paradigm. We rst propose a linear maximum margin classi er that deals
with uncertainty in data input. More speci cally, we reformulate the standard Support
Vector Machine (SVM) framework such that each training example can be modeled
by a multi-dimensional Gaussian distribution described by its mean vector and its
covariance matrix { the latter modeling the uncertainty. We address the classi cation
problem and de ne a cost function that is the expected value of the classical SVM
cost when data samples are drawn from the multi-dimensional Gaussian distributions
that form the set of the training examples. Our formulation approximates the classical
SVM formulation when the training examples are isotropic Gaussians with variance
tending to zero. We arrive at a convex optimization problem, which we solve e -
ciently in the primal form using a stochastic gradient descent approach. The resulting
classi er, which we name SVM with Gaussian Sample Uncertainty (SVM-GSU), is
tested on synthetic data and ve publicly available and popular datasets; namely, the
MNIST, WDBC, DEAP, TV News Channel Commercial Detection, and TRECVID
MED datasets. Experimental results verify the e ectiveness of the proposed method.
Next, we extended the aforementioned linear classi er so as to lead to non-linear decision
boundaries, using the RBF kernel. This extension, where we use isotropic input
uncertainty and we name Kernel SVM with Isotropic Gaussian Sample Uncertainty
(KSVM-iGSU), is used in the problems of video event detection and video aesthetic
quality assessment. The experimental results show that exploiting input uncertainty,
especially in problems where only a limited number of positive training examples are
provided, can lead to better classi cation, detection, or retrieval performance. Finally,
we present a preliminary study on how the above ideas can be used under the deep
convolutional neural networks learning paradigm so as to exploit inherent sources of
uncertainty, such as spatial pooling operations, that are usually used in deep networks
Improving Fairness using Vision-Language Driven Image Augmentation
Fairness is crucial when training a deep-learning discriminative model,
especially in the facial domain. Models tend to correlate specific
characteristics (such as age and skin color) with unrelated attributes
(downstream tasks), resulting in biases which do not correspond to reality. It
is common knowledge that these correlations are present in the data and are
then transferred to the models during training. This paper proposes a method to
mitigate these correlations to improve fairness. To do so, we learn
interpretable and meaningful paths lying in the semantic space of a pre-trained
diffusion model (DiffAE) -- such paths being supervised by contrastive text
dipoles. That is, we learn to edit protected characteristics (age and skin
color). These paths are then applied to augment images to improve the fairness
of a given dataset. We test the proposed method on CelebA-HQ and UTKFace on
several downstream tasks with age and skin color as protected characteristics.
As a proxy for fairness, we compute the difference in accuracy with respect to
the protected characteristics. Quantitative results show how the augmented
images help the model improve the overall accuracy, the aforementioned metric,
and the disparity of equal opportunity. Code is available at:
https://github.com/Moreno98/Vision-Language-Bias-Control.Comment: Accepted for publication in WACV 202
HyperReenact: One-Shot Reenactment via Jointly Learning to Refine and Retarget Faces
In this paper, we present our method for neural face reenactment, called
HyperReenact, that aims to generate realistic talking head images of a source
identity, driven by a target facial pose. Existing state-of-the-art face
reenactment methods train controllable generative models that learn to
synthesize realistic facial images, yet producing reenacted faces that are
prone to significant visual artifacts, especially under the challenging
condition of extreme head pose changes, or requiring expensive few-shot
fine-tuning to better preserve the source identity characteristics. We propose
to address these limitations by leveraging the photorealistic generation
ability and the disentangled properties of a pretrained StyleGAN2 generator, by
first inverting the real images into its latent space and then using a
hypernetwork to perform: (i) refinement of the source identity characteristics
and (ii) facial pose re-targeting, eliminating this way the dependence on
external editing methods that typically produce artifacts. Our method operates
under the one-shot setting (i.e., using a single source frame) and allows for
cross-subject reenactment, without requiring any subject-specific fine-tuning.
We compare our method both quantitatively and qualitatively against several
state-of-the-art techniques on the standard benchmarks of VoxCeleb1 and
VoxCeleb2, demonstrating the superiority of our approach in producing
artifact-free images, exhibiting remarkable robustness even under extreme head
pose changes. We make the code and the pretrained models publicly available at:
https://github.com/StelaBou/HyperReenact .Comment: Accepted for publication in ICCV 2023. Project page:
https://stelabou.github.io/hyperreenact.github.io/ Code:
https://github.com/StelaBou/HyperReenac
HyperReenact: One-Shot Reenactment via Jointly Learning to Refine and Retarget Faces
In this paper, we present our method for neural face reenactment, called HyperReenact, that aims to generate realistic talking head images of a source identity, driven by a target facial pose. Existing state-of-the-art face reenactment methods train controllable generative models that learn to synthesize realistic facial images, yet producing reenacted faces that are prone to significant visual artifacts, especially under the challenging condition of extreme head pose changes, or requiring expensive few-shot fine-tuning to better preserve the source identity characteristics. We propose to address these limitations by leveraging the photorealistic generation ability and the disentangled properties of a pretrained StyleGAN2 generator, by first inverting the real images into its latent space and then using a hypernetwork to perform:(i) refinement of the source identity characteristics and (ii) facial pose re-targeting, eliminating this way the dependence on external editing methods that typically produce artifacts. Our method operates under the one-shot setting (ie, using a single source frame) and allows for cross-subject reenactment, without requiring any subject-specific fine-tuning. We compare our method both quantitatively and qualitatively against several state-of-the-art techniques on the standard benchmarks of VoxCeleb1 and VoxCeleb2, demonstrating the superiority of our approach in producing artifact-free images, exhibiting remarkable robustness even under extreme head pose changes
DnS: Distill-and-Select for Efficient and Accurate Video Indexing and Retrieval
In this paper, we address the problem of high performance and computationally
efficient content-based video retrieval in large-scale datasets. Current
methods typically propose either: (i) fine-grained approaches employing
spatio-temporal representations and similarity calculations, achieving high
performance at a high computational cost or (ii) coarse-grained approaches
representing/indexing videos as global vectors, where the spatio-temporal
structure is lost, providing low performance but also having low computational
cost. In this work, we propose a Knowledge Distillation framework, which we
call Distill-and-Select (DnS), that starting from a well-performing
fine-grained Teacher Network learns: a) Student Networks at different retrieval
performance and computational efficiency trade-offs and b) a Selection Network
that at test time rapidly directs samples to the appropriate student to
maintain both high retrieval performance and high computational efficiency. We
train several students with different architectures and arrive at different
trade-offs of performance and efficiency, i.e., speed and storage requirements,
including fine-grained students that store index videos using binary
representations. Importantly, the proposed scheme allows Knowledge Distillation
in large, unlabelled datasets -- this leads to good students. We evaluate DnS
on five public datasets on three different video retrieval tasks and
demonstrate a) that our students achieve state-of-the-art performance in
several cases and b) that our DnS framework provides an excellent trade-off
between retrieval performance, computational speed, and storage space. In
specific configurations, our method achieves similar mAP with the teacher but
is 20 times faster and requires 240 times less storage space. Our collected
dataset and implementation are publicly available:
https://github.com/mever-team/distill-and-select
Parts of Speech-Grounded Subspaces in Vision-Language Models
Latent image representations arising from vision-language models have proved
immensely useful for a variety of downstream tasks. However, their utility is
limited by their entanglement with respect to different visual attributes. For
instance, recent work has shown that CLIP image representations are often
biased toward specific visual properties (such as objects or actions) in an
unpredictable manner. In this paper, we propose to separate representations of
the different visual modalities in CLIP's joint vision-language space by
leveraging the association between parts of speech and specific visual modes of
variation (e.g. nouns relate to objects, adjectives describe appearance). This
is achieved by formulating an appropriate component analysis model that learns
subspaces capturing variability corresponding to a specific part of speech,
while jointly minimising variability to the rest. Such a subspace yields
disentangled representations of the different visual properties of an image or
text in closed form while respecting the underlying geometry of the manifold on
which the representations lie. What's more, we show the proposed model
additionally facilitates learning subspaces corresponding to specific visual
appearances (e.g. artists' painting styles), which enables the selective
removal of entire visual themes from CLIP-based text-to-image synthesis. We
validate the model both qualitatively, by visualising the subspace projections
with a text-to-image model and by preventing the imitation of artists' styles,
and quantitatively, through class invariance metrics and improvements to
baseline zero-shot classification.Comment: Accepted at NeurIPS 202
Recommended from our members
HyperReenact : one-shot reenactment via jointly learning to refine and retarget faces
In this paper, we present our method for neural face reenactment, called HyperReenact, that aims to generate realistic talking head images of a source identity, driven by a target facial pose. Existing state-of-the-art face reenactment methods train controllable generative models that learn to synthesize realistic facial images, yet producing reenacted faces that are prone to significant visual artifacts, especially under the challenging condition of extreme head pose changes, or requiring expensive few-shot fine-tuning to better preserve the source identity characteristics. We propose to address these limitations by leveraging the photorealistic generation ability and the disentangled properties of a pretrained StyleGAN2 generator, by first inverting the real images into its latent space and then using a hypernetwork to perform: (i) refinement of the source identity characteristics and (ii) facial pose re-targeting, eliminating this way the dependence on external editing methods that typically produce artifacts. Our method operates under the one-shot setting (i.e., using a single source frame) and allows for cross-subject reenactment, without requiring any subject-specific fine-tuning. We compare our method both quantitatively and qualitatively against several state-of-the-art techniques on the standard benchmarks of VoxCeleb1 and VoxCeleb2, demonstrating the superiority of our approach in producing artifact-free images, exhibiting remarkable robustness even under extreme head pose changes. We make the code and the pretrained models publicly available at: https://github.com/StelaBou/HyperReenact