23 research outputs found
Understanding Deep Networks via Extremal Perturbations and Smooth Masks
The problem of attribution is concerned with identifying the parts of an
input that are responsible for a model's output. An important family of
attribution methods is based on measuring the effect of perturbations applied
to the input. In this paper, we discuss some of the shortcomings of existing
approaches to perturbation analysis and address them by introducing the concept
of extremal perturbations, which are theoretically grounded and interpretable.
We also introduce a number of technical innovations to compute extremal
perturbations, including a new area constraint and a parametric family of
smooth perturbations, which allow us to remove all tunable hyper-parameters
from the optimization problem. We analyze the effect of perturbations as a
function of their area, demonstrating excellent sensitivity to the spatial
properties of the deep neural network under stimulation. We also extend
perturbation analysis to the intermediate layers of a network. This application
allows us to identify the salient channels necessary for classification, which,
when visualized using feature inversion, can be used to elucidate model
behavior. Lastly, we introduce TorchRay, an interpretability library built on
PyTorch.Comment: Accepted at ICCV 2019 as oral; supp mat at
http://ruthcfong.github.io/files/fong19_extremal_supps.pd
Labelling unlabelled videos from scratch with multi-modal self-supervision
A large part of the current success of deep learning lies in the
effectiveness of data -- more precisely: labelled data. Yet, labelling a
dataset with human annotation continues to carry high costs, especially for
videos. While in the image domain, recent methods have allowed to generate
meaningful (pseudo-) labels for unlabelled datasets without supervision, this
development is missing for the video domain where learning feature
representations is the current focus. In this work, we a) show that
unsupervised labelling of a video dataset does not come for free from strong
feature encoders and b) propose a novel clustering method that allows
pseudo-labelling of a video dataset without any human annotations, by
leveraging the natural correspondence between the audio and visual modalities.
An extensive analysis shows that the resulting clusters have high semantic
overlap to ground truth human labels. We further introduce the first
benchmarking results on unsupervised labelling of common video datasets
Kinetics, Kinetics-Sound, VGG-Sound and AVE.Comment: Accepted to NeurIPS 2020. Project page:
https://www.robots.ox.ac.uk/~vgg/research/selavi, code:
https://github.com/facebookresearch/selav
Support-set bottlenecks for video-text representation learning
The dominant paradigm for learning video-text representations -- noise
contrastive learning -- increases the similarity of the representations of
pairs of samples that are known to be related, such as text and video from the
same sample, and pushes away the representations of all other pairs. We posit
that this last behaviour is too strict, enforcing dissimilar representations
even for samples that are semantically-related -- for example, visually similar
videos or ones that share the same depicted action. In this paper, we propose a
novel method that alleviates this by leveraging a generative model to naturally
push these related samples together: each sample's caption must be
reconstructed as a weighted combination of other support samples' visual
representations. This simple idea ensures that representations are not
overly-specialized to individual samples, are reusable across the dataset, and
results in representations that explicitly encode semantics shared between
samples, unlike noise contrastive learning. Our proposed method outperforms
others by a large margin on MSR-VTT, VATEX and ActivityNet, and MSVD for
video-to-text and text-to-video retrieval.Comment: Accepted as spotlight paper at the International Conference on
Learning Representations (ICLR) 202
Learning and interpreting deep representations from multi-modal data
Deep learning has resulted in ground-breaking progress in a variety of domains, from core machine learning tasks such as image, language, and video understanding, to real-world industries such as medicine, autonomous driving, and agriculture. Its success has been driven by providing neural networks with manual supervision from large-scale labelled datasets such as ImageNet to automatically learn hierarchical data representations. However, obtaining large-scale labelled data is often a very time-consuming and expensive process. To address this challenge, we push the limits of self-supervision from multi-modal video data. Video data usually contain multiple modalities such as images, audio, transcribed speech and textual captions freely available. These modalities often share redundant semantic information and therefore can serve as pseudo-labels to supervise each other for representation learning without necessitating the use of manual human labels. Without the reliance on labelled data, we are able to train these deep representations on very large-scale video data of millions of video clips collected from the Internet. We show the scalability benefits of multi-modal self supervision by establishing a new state-of-the-art performance in a variety of domains: video action recognition, text-to-video retrieval, text-to-image retrieval and audio classification. We also introduce other technical innovations in terms of data transformations, model architecture and loss functions to further improve learning these deep video representations using multi-modal self-supervision. A secondary contribution of this thesis is new tools to improve the interpretability of deep representations, given that it is notoriously difficult to decipher the key features encoded in these deep representations. For images, we show how perturbation analysis can be used to analyze the intermediate representations of a network. For videos, we propose a novel clustering method using the Sinkhorn-Knopp algorithm to map deep video representations to human interpretable semantic pseudo-labels. The contributions in this thesis are steps to unlocking both the scalability and interpretability of deep video representation learning
On Compositions of Transformations in Contrastive Self-Supervised Learning
In the image domain, excellent representations can be learned by inducing
invariance to content-preserving transformations via noise contrastive
learning. In this paper, we generalize contrastive learning to a wider set of
transformations, and their compositions, for which either invariance or
distinctiveness is sought. We show that it is not immediately obvious how
existing methods such as SimCLR can be extended to do so. Instead, we introduce
a number of formal requirements that all contrastive formulations must satisfy,
and propose a practical construction which satisfies these requirements. In
order to maximise the reach of this analysis, we express all components of
noise contrastive formulations as the choice of certain generalized
transformations of the data (GDTs), including data sampling. We then consider
videos as an example of data in which a large variety of transformations are
applicable, accounting for the extra modalities -- for which we analyze audio
and text -- and the dimension of time. We find that being invariant to certain
transformations and distinctive to others is critical to learning effective
video representations, improving the state-of-the-art for multiple benchmarks
by a large margin, and even surpassing supervised pretraining.Comment: Accepted to ICCV 2021. Code and pretrained models are available at
https://github.com/facebookresearch/GD
Axis patterning by BMPs: cnidarian network reveals evolutionary constraints
BMP signaling plays a crucial role in the establishment of the dorso-ventral body axis in bilaterally symmetric animals. However, the topologies of the bone morphogenetic protein (BMP) signaling networks vary drastically in different animal groups, raising questions about the evolutionary constraints and evolvability of BMP signaling systems. Using loss-of-function analysis and mathematical modeling, we show that two signaling centers expressing different BMPs and BMP antagonists maintain the secondary axis of the sea anemone Nematostella. We demonstrate that BMP signaling is required for asymmetric Hox gene expression and mesentery formation. Computational analysis reveals that network parameters related to BMP4 and Chordin are constrained both in Nematostella and Xenopus, while those describing the BMP signaling modulators can vary significantly. Notably, only chordin, but not bmp4 expression needs to be spatially restricted for robust signaling gradient formation. Our data provide an explanation of the evolvability of BMP signaling systems in axis formation throughout Eumetazoa
Space-Time Crop & Attend: improving cross-modal video representation learning
The quality of the image representations obtained from self-supervised learning depends strongly on the type of data augmentations used in the learning formulation. Recent papers have ported these methods from still images to videos and found that leveraging both audio and video signals yields strong gains; however, they did not find that spatial augmentations such as cropping, which are very important for still images, work as well for videos. In this paper, we improve these formulations in two ways unique to the spatio-temporal aspect of videos. First, for space, we show that spatial augmentations such as cropping do work well for videos too, but that previous implementations, due to the high processing and memory cost, could not do this at a scale sufficient for it to work well. To address this issue, we first introduce Feature Crop, a method to simulate such augmentations much more efficiently directly in feature space. Second, we show that as opposed to naïve average pooling, the use of transformer-based attention improves performance significantly, and is well suited for processing feature crops. Combining both of our discoveries into a new method, Space-Time Crop & Attend (STiCA) we achieve state-of-the-art performance across multiple video-representation learning benchmarks. In particular, we achieve new state-of-the-art accuracies of 67.0% on HMDB-51 and 93.1% on UCF-101 when pre-training on Kinetics-400. Code and pretrained models are available 1