4,872 research outputs found
SynthDistill: Face Recognition with Knowledge Distillation from Synthetic Data
State-of-the-art face recognition networks are often computationally
expensive and cannot be used for mobile applications. Training lightweight face
recognition models also requires large identity-labeled datasets. Meanwhile,
there are privacy and ethical concerns with collecting and using large face
recognition datasets. While generating synthetic datasets for training face
recognition models is an alternative option, it is challenging to generate
synthetic data with sufficient intra-class variations. In addition, there is
still a considerable gap between the performance of models trained on real and
synthetic data. In this paper, we propose a new framework (named SynthDistill)
to train lightweight face recognition models by distilling the knowledge of a
pretrained teacher face recognition model using synthetic data. We use a
pretrained face generator network to generate synthetic face images and use the
synthesized images to learn a lightweight student network. We use synthetic
face images without identity labels, mitigating the problems in the intra-class
variation generation of synthetic datasets. Instead, we propose a novel dynamic
sampling strategy from the intermediate latent space of the face generator
network to include new variations of the challenging images while further
exploring new face images in the training batch. The results on five different
face recognition datasets demonstrate the superiority of our lightweight model
compared to models trained on previous synthetic datasets, achieving a
verification accuracy of 99.52% on the LFW dataset with a lightweight network.
The results also show that our proposed framework significantly reduces the gap
between training with real and synthetic data. The source code for replicating
the experiments is publicly released.Comment: Accepted in the IEEE International Joint Conference on Biometrics
(IJCB 2023
Audio-Visual Learning for Scene Understanding
Multimodal deep learning aims at combining the complementary information of different modalities. Among all modalities, audio and video are the predominant ones that humans use to explore the world. In this thesis, we decided to focus our study on audio-visual deep learning to mimic with our networks how humans perceive the world.
Our research includes images, audio signals and acoustic images. The latter provide spatial audio information and are obtained from a planar array of microphones combining their raw audios with the beamforming algorithm. They better mimic human auditory systems, which cannot be replicated using just one microphone, not able alone to give spatial sound cues.
However, as microphones arrays are not so widespread, we also study how to handle the missing spatialized audio modality at test time.
As a solution, we propose to distill acoustic images content to audio features during the training in order to handle their absence at test time. This is done for supervised audio classification using the generalized distillation framework, which we also extend for self-supervised learning.
Next, we devise a method for reconstructing acoustic images given a single microphone and an RGB frame. Therefore, in case we just dispose of a standard video, we are able to synthesize spatial audio, which is useful for many audio-visual tasks, including sound localization.
Lastly, as another example of restoring one modality from available ones, we inpaint degraded images providing audio features, to reconstruct the missing region not only to be visually plausible but also semantically consistent with the related sound. This includes also cross-modal generation, in the limit case of completely missing or hidden visual modality: our method naturally deals with it, being able to generate images from sound.
In summary we show how audio can help visual learning and vice versa, by transferring knowledge between the two modalities at training time, in order to distill, reconstruct, or restore the missing modality at test time
Self-Supervised GAN Compression
Deep learning's success has led to larger and larger models to handle more
and more complex tasks; trained models can contain millions of parameters.
These large models are compute- and memory-intensive, which makes it a
challenge to deploy them with minimized latency, throughput, and storage
requirements. Some model compression methods have been successfully applied to
image classification and detection or language models, but there has been very
little work compressing generative adversarial networks (GANs) performing
complex tasks. In this paper, we show that a standard model compression
technique, weight pruning, cannot be applied to GANs using existing methods. We
then develop a self-supervised compression technique which uses the trained
discriminator to supervise the training of a compressed generator. We show that
this framework has a compelling performance to high degrees of sparsity, can be
easily applied to new tasks and models, and enables meaningful comparisons
between different pruning granularities.Comment: The appendix for this paper is in the following repository
https://gitlab.com/dxxz/Self-Supervised-GAN-Compression-Appendi
Massively Parallel Video Networks
We introduce a class of causal video understanding models that aims to
improve efficiency of video processing by maximising throughput, minimising
latency, and reducing the number of clock cycles. Leveraging operation
pipelining and multi-rate clocks, these models perform a minimal amount of
computation (e.g. as few as four convolutional layers) for each frame per
timestep to produce an output. The models are still very deep, with dozens of
such operations being performed but in a pipelined fashion that enables
depth-parallel computation. We illustrate the proposed principles by applying
them to existing image architectures and analyse their behaviour on two video
tasks: action recognition and human keypoint localisation. The results show
that a significant degree of parallelism, and implicitly speedup, can be
achieved with little loss in performance.Comment: Fixed typos in densenet model definition in appendi
A Survey of Face Recognition
Recent years witnessed the breakthrough of face recognition with deep
convolutional neural networks. Dozens of papers in the field of FR are
published every year. Some of them were applied in the industrial community and
played an important role in human life such as device unlock, mobile payment,
and so on. This paper provides an introduction to face recognition, including
its history, pipeline, algorithms based on conventional manually designed
features or deep learning, mainstream training, evaluation datasets, and
related applications. We have analyzed and compared state-of-the-art works as
many as possible, and also carefully designed a set of experiments to find the
effect of backbone size and data distribution. This survey is a material of the
tutorial named The Practical Face Recognition Technology in the Industrial
World in the FG2023
Survey on Controlable Image Synthesis with Deep Learning
Image synthesis has attracted emerging research interests in academic and
industry communities. Deep learning technologies especially the generative
models greatly inspired controllable image synthesis approaches and
applications, which aim to generate particular visual contents with latent
prompts. In order to further investigate low-level controllable image synthesis
problem which is crucial for fine image rendering and editing tasks, we present
a survey of some recent works on 3D controllable image synthesis using deep
learning. We first introduce the datasets and evaluation indicators for 3D
controllable image synthesis. Then, we review the state-of-the-art research for
geometrically controllable image synthesis in two aspects: 1)
Viewpoint/pose-controllable image synthesis; 2) Structure/shape-controllable
image synthesis. Furthermore, the photometrically controllable image synthesis
approaches are also reviewed for 3D re-lighting researches. While the emphasis
is on 3D controllable image synthesis algorithms, the related applications,
products and resources are also briefly summarized for practitioners.Comment: 19 pages, 17 figure
- …