260 research outputs found
Neural Network Based Reinforcement Learning for Audio-Visual Gaze Control in Human-Robot Interaction
This paper introduces a novel neural network-based reinforcement learning
approach for robot gaze control. Our approach enables a robot to learn and to
adapt its gaze control strategy for human-robot interaction neither with the
use of external sensors nor with human supervision. The robot learns to focus
its attention onto groups of people from its own audio-visual experiences,
independently of the number of people, of their positions and of their physical
appearances. In particular, we use a recurrent neural network architecture in
combination with Q-learning to find an optimal action-selection policy; we
pre-train the network using a simulated environment that mimics realistic
scenarios that involve speaking/silent participants, thus avoiding the need of
tedious sessions of a robot interacting with people. Our experimental
evaluation suggests that the proposed method is robust against parameter
estimation, i.e. the parameter values yielded by the method do not have a
decisive impact on the performance. The best results are obtained when both
audio and visual information is jointly used. Experiments with the Nao robot
indicate that our framework is a step forward towards the autonomous learning
of socially acceptable gaze behavior.Comment: Paper submitted to Pattern Recognition Letter
Face Aging via Diffusion-based Editing
In this paper, we address the problem of face aging: generating past or
future facial images by incorporating age-related changes to the given face.
Previous aging methods rely solely on human facial image datasets and are thus
constrained by their inherent scale and bias. This restricts their application
to a limited generatable age range and the inability to handle large age gaps.
We propose FADING, a novel approach to address Face Aging via DIffusion-based
editiNG. We go beyond existing methods by leveraging the rich prior of
large-scale language-image diffusion models. First, we specialize a pre-trained
diffusion model for the task of face age editing by using an age-aware
fine-tuning scheme. Next, we invert the input image to latent noise and obtain
optimized null text embeddings. Finally, we perform text-guided local age
editing via attention control. The quantitative and qualitative analyses
demonstrate that our method outperforms existing approaches with respect to
aging accuracy, attribute preservation, and aging quality.Comment: accepted at BMVC 202
CANU-ReID: A Conditional Adversarial Network for Unsupervised person Re-IDentification
Unsupervised person re-ID is the task of identifying people on a target data
set for which the ID labels are unavailable during training. In this paper, we
propose to unify two trends in unsupervised person re-ID: clustering &
fine-tuning and adversarial learning. On one side, clustering groups training
images into pseudo-ID labels, and uses them to fine-tune the feature extractor.
On the other side, adversarial learning is used, inspired by domain adaptation,
to match distributions from different domains. Since target data is distributed
across different camera viewpoints, we propose to model each camera as an
independent domain, and aim to learn domain-independent features.
Straightforward adversarial learning yields negative transfer, we thus
introduce a conditioning vector to mitigate this undesirable effect. In our
framework, the centroid of the cluster to which the visual sample belongs is
used as conditioning vector of our conditional adversarial network, where the
vector is permutation invariant (clusters ordering does not matter) and its
size is independent of the number of clusters. To our knowledge, we are the
first to propose the use of conditional adversarial networks for unsupervised
person re-ID. We evaluate the proposed architecture on top of two
state-of-the-art clustering-based unsupervised person re-identification (re-ID)
methods on four different experimental settings with three different data sets
and set the new state-of-the-art performance on all four of them. Our code and
model will be made publicly available at
https://team.inria.fr/perception/canu-reid/
Predictive Coding For Animation-Based Video Compression
We address the problem of efficiently compressing video for conferencing-type
applications. We build on recent approaches based on image animation, which can
achieve good reconstruction quality at very low bitrate by representing face
motions with a compact set of sparse keypoints. However, these methods encode
video in a frame-by-frame fashion, i.e. each frame is reconstructed from a
reference frame, which limits the reconstruction quality when the bandwidth is
larger. Instead, we propose a predictive coding scheme which uses image
animation as a predictor, and codes the residual with respect to the actual
target frame. The residuals can be in turn coded in a predictive manner, thus
removing efficiently temporal dependencies. Our experiments indicate a
significant bitrate gain, in excess of 70% compared to the HEVC video standard
and over 30% compared to VVC, on a datasetof talking-head videosComment: Accepted paper: ICIP 202
Budget-Aware Adapters for Multi-Domain Learning
Multi-Domain Learning (MDL) refers to the problem of learning a set of models
derived from a common deep architecture, each one specialized to perform a task
in a certain domain (e.g., photos, sketches, paintings). This paper tackles MDL
with a particular interest in obtaining domain-specific models with an
adjustable budget in terms of the number of network parameters and
computational complexity. Our intuition is that, as in real applications the
number of domains and tasks can be very large, an effective MDL approach should
not only focus on accuracy but also on having as few parameters as possible. To
implement this idea we derive specialized deep models for each domain by
adapting a pre-trained architecture but, differently from other methods, we
propose a novel strategy to automatically adjust the computational complexity
of the network. To this aim, we introduce Budget-Aware Adapters that select the
most relevant feature channels to better handle data from a novel domain. Some
constraints on the number of active switches are imposed in order to obtain a
network respecting the desired complexity budget. Experimentally, we show that
our approach leads to recognition accuracy competitive with state-of-the-art
approaches but with much lighter networks both in terms of storage and
computation.Comment: ICCV 201
- …