23 research outputs found

    Understanding Deep Networks via Extremal Perturbations and Smooth Masks

    Full text link
    The problem of attribution is concerned with identifying the parts of an input that are responsible for a model's output. An important family of attribution methods is based on measuring the effect of perturbations applied to the input. In this paper, we discuss some of the shortcomings of existing approaches to perturbation analysis and address them by introducing the concept of extremal perturbations, which are theoretically grounded and interpretable. We also introduce a number of technical innovations to compute extremal perturbations, including a new area constraint and a parametric family of smooth perturbations, which allow us to remove all tunable hyper-parameters from the optimization problem. We analyze the effect of perturbations as a function of their area, demonstrating excellent sensitivity to the spatial properties of the deep neural network under stimulation. We also extend perturbation analysis to the intermediate layers of a network. This application allows us to identify the salient channels necessary for classification, which, when visualized using feature inversion, can be used to elucidate model behavior. Lastly, we introduce TorchRay, an interpretability library built on PyTorch.Comment: Accepted at ICCV 2019 as oral; supp mat at http://ruthcfong.github.io/files/fong19_extremal_supps.pd

    Labelling unlabelled videos from scratch with multi-modal self-supervision

    Full text link
    A large part of the current success of deep learning lies in the effectiveness of data -- more precisely: labelled data. Yet, labelling a dataset with human annotation continues to carry high costs, especially for videos. While in the image domain, recent methods have allowed to generate meaningful (pseudo-) labels for unlabelled datasets without supervision, this development is missing for the video domain where learning feature representations is the current focus. In this work, we a) show that unsupervised labelling of a video dataset does not come for free from strong feature encoders and b) propose a novel clustering method that allows pseudo-labelling of a video dataset without any human annotations, by leveraging the natural correspondence between the audio and visual modalities. An extensive analysis shows that the resulting clusters have high semantic overlap to ground truth human labels. We further introduce the first benchmarking results on unsupervised labelling of common video datasets Kinetics, Kinetics-Sound, VGG-Sound and AVE.Comment: Accepted to NeurIPS 2020. Project page: https://www.robots.ox.ac.uk/~vgg/research/selavi, code: https://github.com/facebookresearch/selav

    Support-set bottlenecks for video-text representation learning

    Full text link
    The dominant paradigm for learning video-text representations -- noise contrastive learning -- increases the similarity of the representations of pairs of samples that are known to be related, such as text and video from the same sample, and pushes away the representations of all other pairs. We posit that this last behaviour is too strict, enforcing dissimilar representations even for samples that are semantically-related -- for example, visually similar videos or ones that share the same depicted action. In this paper, we propose a novel method that alleviates this by leveraging a generative model to naturally push these related samples together: each sample's caption must be reconstructed as a weighted combination of other support samples' visual representations. This simple idea ensures that representations are not overly-specialized to individual samples, are reusable across the dataset, and results in representations that explicitly encode semantics shared between samples, unlike noise contrastive learning. Our proposed method outperforms others by a large margin on MSR-VTT, VATEX and ActivityNet, and MSVD for video-to-text and text-to-video retrieval.Comment: Accepted as spotlight paper at the International Conference on Learning Representations (ICLR) 202

    Learning and interpreting deep representations from multi-modal data

    No full text
    Deep learning has resulted in ground-breaking progress in a variety of domains, from core machine learning tasks such as image, language, and video understanding, to real-world industries such as medicine, autonomous driving, and agriculture. Its success has been driven by providing neural networks with manual supervision from large-scale labelled datasets such as ImageNet to automatically learn hierarchical data representations. However, obtaining large-scale labelled data is often a very time-consuming and expensive process. To address this challenge, we push the limits of self-supervision from multi-modal video data. Video data usually contain multiple modalities such as images, audio, transcribed speech and textual captions freely available. These modalities often share redundant semantic information and therefore can serve as pseudo-labels to supervise each other for representation learning without necessitating the use of manual human labels. Without the reliance on labelled data, we are able to train these deep representations on very large-scale video data of millions of video clips collected from the Internet. We show the scalability benefits of multi-modal self supervision by establishing a new state-of-the-art performance in a variety of domains: video action recognition, text-to-video retrieval, text-to-image retrieval and audio classification. We also introduce other technical innovations in terms of data transformations, model architecture and loss functions to further improve learning these deep video representations using multi-modal self-supervision. A secondary contribution of this thesis is new tools to improve the interpretability of deep representations, given that it is notoriously difficult to decipher the key features encoded in these deep representations. For images, we show how perturbation analysis can be used to analyze the intermediate representations of a network. For videos, we propose a novel clustering method using the Sinkhorn-Knopp algorithm to map deep video representations to human interpretable semantic pseudo-labels. The contributions in this thesis are steps to unlocking both the scalability and interpretability of deep video representation learning

    On Compositions of Transformations in Contrastive Self-Supervised Learning

    Full text link
    In the image domain, excellent representations can be learned by inducing invariance to content-preserving transformations via noise contrastive learning. In this paper, we generalize contrastive learning to a wider set of transformations, and their compositions, for which either invariance or distinctiveness is sought. We show that it is not immediately obvious how existing methods such as SimCLR can be extended to do so. Instead, we introduce a number of formal requirements that all contrastive formulations must satisfy, and propose a practical construction which satisfies these requirements. In order to maximise the reach of this analysis, we express all components of noise contrastive formulations as the choice of certain generalized transformations of the data (GDTs), including data sampling. We then consider videos as an example of data in which a large variety of transformations are applicable, accounting for the extra modalities -- for which we analyze audio and text -- and the dimension of time. We find that being invariant to certain transformations and distinctive to others is critical to learning effective video representations, improving the state-of-the-art for multiple benchmarks by a large margin, and even surpassing supervised pretraining.Comment: Accepted to ICCV 2021. Code and pretrained models are available at https://github.com/facebookresearch/GD

    Axis patterning by BMPs: cnidarian network reveals evolutionary constraints

    Get PDF
    BMP signaling plays a crucial role in the establishment of the dorso-ventral body axis in bilaterally symmetric animals. However, the topologies of the bone morphogenetic protein (BMP) signaling networks vary drastically in different animal groups, raising questions about the evolutionary constraints and evolvability of BMP signaling systems. Using loss-of-function analysis and mathematical modeling, we show that two signaling centers expressing different BMPs and BMP antagonists maintain the secondary axis of the sea anemone Nematostella. We demonstrate that BMP signaling is required for asymmetric Hox gene expression and mesentery formation. Computational analysis reveals that network parameters related to BMP4 and Chordin are constrained both in Nematostella and Xenopus, while those describing the BMP signaling modulators can vary significantly. Notably, only chordin, but not bmp4 expression needs to be spatially restricted for robust signaling gradient formation. Our data provide an explanation of the evolvability of BMP signaling systems in axis formation throughout Eumetazoa

    Space-Time Crop & Attend: improving cross-modal video representation learning

    No full text
    The quality of the image representations obtained from self-supervised learning depends strongly on the type of data augmentations used in the learning formulation. Recent papers have ported these methods from still images to videos and found that leveraging both audio and video signals yields strong gains; however, they did not find that spatial augmentations such as cropping, which are very important for still images, work as well for videos. In this paper, we improve these formulations in two ways unique to the spatio-temporal aspect of videos. First, for space, we show that spatial augmentations such as cropping do work well for videos too, but that previous implementations, due to the high processing and memory cost, could not do this at a scale sufficient for it to work well. To address this issue, we first introduce Feature Crop, a method to simulate such augmentations much more efficiently directly in feature space. Second, we show that as opposed to naïve average pooling, the use of transformer-based attention improves performance significantly, and is well suited for processing feature crops. Combining both of our discoveries into a new method, Space-Time Crop & Attend (STiCA) we achieve state-of-the-art performance across multiple video-representation learning benchmarks. In particular, we achieve new state-of-the-art accuracies of 67.0% on HMDB-51 and 93.1% on UCF-101 when pre-training on Kinetics-400. Code and pretrained models are available 1
    corecore