3,282 research outputs found
Detach and Adapt: Learning Cross-Domain Disentangled Deep Representation
While representation learning aims to derive interpretable features for
describing visual data, representation disentanglement further results in such
features so that particular image attributes can be identified and manipulated.
However, one cannot easily address this task without observing ground truth
annotation for the training data. To address this problem, we propose a novel
deep learning model of Cross-Domain Representation Disentangler (CDRD). By
observing fully annotated source-domain data and unlabeled target-domain data
of interest, our model bridges the information across data domains and
transfers the attribute information accordingly. Thus, cross-domain joint
feature disentanglement and adaptation can be jointly performed. In the
experiments, we provide qualitative results to verify our disentanglement
capability. Moreover, we further confirm that our model can be applied for
solving classification tasks of unsupervised domain adaptation, and performs
favorably against state-of-the-art image disentanglement and translation
methods.Comment: CVPR 2018 Spotligh
Disentangled Speech Embeddings using Cross-modal Self-supervision
The objective of this paper is to learn representations of speaker identity
without access to manually annotated data. To do so, we develop a
self-supervised learning objective that exploits the natural cross-modal
synchrony between faces and audio in video. The key idea behind our approach is
to tease apart--without annotation--the representations of linguistic content
and speaker identity. We construct a two-stream architecture which: (1) shares
low-level features common to both representations; and (2) provides a natural
mechanism for explicitly disentangling these factors, offering the potential
for greater generalisation to novel combinations of content and identity and
ultimately producing speaker identity representations that are more robust. We
train our method on a large-scale audio-visual dataset of talking heads `in the
wild', and demonstrate its efficacy by evaluating the learned speaker
representations for standard speaker recognition performance.Comment: ICASSP 2020. The first three authors contributed equally to this wor
Empirically Analyzing the Effect of Dataset Biases on Deep Face Recognition Systems
It is unknown what kind of biases modern in the wild face datasets have
because of their lack of annotation. A direct consequence of this is that total
recognition rates alone only provide limited insight about the generalization
ability of a Deep Convolutional Neural Networks (DCNNs). We propose to
empirically study the effect of different types of dataset biases on the
generalization ability of DCNNs. Using synthetically generated face images, we
study the face recognition rate as a function of interpretable parameters such
as face pose and light. The proposed method allows valuable details about the
generalization performance of different DCNN architectures to be observed and
compared. In our experiments, we find that: 1) Indeed, dataset bias has a
significant influence on the generalization performance of DCNNs. 2) DCNNs can
generalize surprisingly well to unseen illumination conditions and large
sampling gaps in the pose variation. 3) Using the presented methodology we
reveal that the VGG-16 architecture outperforms the AlexNet architecture at
face recognition tasks because it can much better generalize to unseen face
poses, although it has significantly more parameters. 4) We uncover a main
limitation of current DCNN architectures, which is the difficulty to generalize
when different identities to not share the same pose variation. 5) We
demonstrate that our findings on synthetic data also apply when learning from
real-world data. Our face image generator is publicly available to enable the
community to benchmark other DCNN architectures.Comment: Accepted to CVPR 2018 Workshop on Analysis and Modeling of Faces and
Gestures (AMFG
Emergence of Invariance and Disentanglement in Deep Representations
Using established principles from Statistics and Information Theory, we show
that invariance to nuisance factors in a deep neural network is equivalent to
information minimality of the learned representation, and that stacking layers
and injecting noise during training naturally bias the network towards learning
invariant representations. We then decompose the cross-entropy loss used during
training and highlight the presence of an inherent overfitting term. We propose
regularizing the loss by bounding such a term in two equivalent ways: One with
a Kullbach-Leibler term, which relates to a PAC-Bayes perspective; the other
using the information in the weights as a measure of complexity of a learned
model, yielding a novel Information Bottleneck for the weights. Finally, we
show that invariance and independence of the components of the representation
learned by the network are bounded above and below by the information in the
weights, and therefore are implicitly optimized during training. The theory
enables us to quantify and predict sharp phase transitions between underfitting
and overfitting of random labels when using our regularized loss, which we
verify in experiments, and sheds light on the relation between the geometry of
the loss function, invariance properties of the learned representation, and
generalization error.Comment: Deep learning, neural network, representation, flat minima,
information bottleneck, overfitting, generalization, sufficiency, minimality,
sensitivity, information complexity, stochastic gradient descent,
regularization, total correlation, PAC-Baye
Disentangling Factors of Variation by Mixing Them
We propose an approach to learn image representations that consist of
disentangled factors of variation without exploiting any manual labeling or
data domain knowledge. A factor of variation corresponds to an image attribute
that can be discerned consistently across a set of images, such as the pose or
color of objects. Our disentangled representation consists of a concatenation
of feature chunks, each chunk representing a factor of variation. It supports
applications such as transferring attributes from one image to another, by
simply mixing and unmixing feature chunks, and classification or retrieval
based on one or several attributes, by considering a user-specified subset of
feature chunks. We learn our representation without any labeling or knowledge
of the data domain, using an autoencoder architecture with two novel training
objectives: first, we propose an invariance objective to encourage that
encoding of each attribute, and decoding of each chunk, are invariant to
changes in other attributes and chunks, respectively; second, we include a
classification objective, which ensures that each chunk corresponds to a
consistently discernible attribute in the represented image, hence avoiding
degenerate feature mappings where some chunks are completely ignored. We
demonstrate the effectiveness of our approach on the MNIST, Sprites, and CelebA
datasets.Comment: CVPR 201
- …