Search CORE

23 research outputs found

Understanding Deep Networks via Extremal Perturbations and Smooth Masks

Author: Fong Ruth
Patrick Mandela
Vedaldi Andrea
Publication venue
Publication date: 18/10/2019
Field of study

The problem of attribution is concerned with identifying the parts of an input that are responsible for a model's output. An important family of attribution methods is based on measuring the effect of perturbations applied to the input. In this paper, we discuss some of the shortcomings of existing approaches to perturbation analysis and address them by introducing the concept of extremal perturbations, which are theoretically grounded and interpretable. We also introduce a number of technical innovations to compute extremal perturbations, including a new area constraint and a parametric family of smooth perturbations, which allow us to remove all tunable hyper-parameters from the optimization problem. We analyze the effect of perturbations as a function of their area, demonstrating excellent sensitivity to the spatial properties of the deep neural network under stimulation. We also extend perturbation analysis to the intermediate layers of a network. This application allows us to identify the salient channels necessary for classification, which, when visualized using feature inversion, can be used to elucidate model behavior. Lastly, we introduce TorchRay, an interpretability library built on PyTorch.Comment: Accepted at ICCV 2019 as oral; supp mat at http://ruthcfong.github.io/files/fong19_extremal_supps.pd

arXiv.org e-Print Archive

Crossref

Oxford University Research Archive

Labelling unlabelled videos from scratch with multi-modal self-supervision

Author: Asano Yuki M.
Patrick Mandela
Rupprecht Christian
Vedaldi Andrea
Publication venue
Publication date: 01/01/2020
Field of study

A large part of the current success of deep learning lies in the effectiveness of data -- more precisely: labelled data. Yet, labelling a dataset with human annotation continues to carry high costs, especially for videos. While in the image domain, recent methods have allowed to generate meaningful (pseudo-) labels for unlabelled datasets without supervision, this development is missing for the video domain where learning feature representations is the current focus. In this work, we a) show that unsupervised labelling of a video dataset does not come for free from strong feature encoders and b) propose a novel clustering method that allows pseudo-labelling of a video dataset without any human annotations, by leveraging the natural correspondence between the audio and visual modalities. An extensive analysis shows that the resulting clusters have high semantic overlap to ground truth human labels. We further introduce the first benchmarking results on unsupervised labelling of common video datasets Kinetics, Kinetics-Sound, VGG-Sound and AVE.Comment: Accepted to NeurIPS 2020. Project page: https://www.robots.ox.ac.uk/~vgg/research/selavi, code: https://github.com/facebookresearch/selav

arXiv.org e-Print Archive

Oxford University Research Archive

Support-set bottlenecks for video-text representation learning

Author: Asano Yuki
Hauptmann Alexander
Henriques João
Huang Po-Yao
Metze Florian
Patrick Mandela
Vedaldi Andrea
Publication venue
Publication date: 01/01/2021
Field of study

The dominant paradigm for learning video-text representations -- noise contrastive learning -- increases the similarity of the representations of pairs of samples that are known to be related, such as text and video from the same sample, and pushes away the representations of all other pairs. We posit that this last behaviour is too strict, enforcing dissimilar representations even for samples that are semantically-related -- for example, visually similar videos or ones that share the same depicted action. In this paper, we propose a novel method that alleviates this by leveraging a generative model to naturally push these related samples together: each sample's caption must be reconstructed as a weighted combination of other support samples' visual representations. This simple idea ensures that representations are not overly-specialized to individual samples, are reusable across the dataset, and results in representations that explicitly encode semantics shared between samples, unlike noise contrastive learning. Our proposed method outperforms others by a large margin on MSR-VTT, VATEX and ActivityNet, and MSVD for video-to-text and text-to-video retrieval.Comment: Accepted as spotlight paper at the International Conference on Learning Representations (ICLR) 202

arXiv.org e-Print Archive

Oxford University Research Archive

Learning and interpreting deep representations from multi-modal data

Author: Patrick Mandela
Publication venue
Publication date: 02/09/2021
Field of study

Deep learning has resulted in ground-breaking progress in a variety of domains, from core machine learning tasks such as image, language, and video understanding, to real-world industries such as medicine, autonomous driving, and agriculture. Its success has been driven by providing neural networks with manual supervision from large-scale labelled datasets such as ImageNet to automatically learn hierarchical data representations. However, obtaining large-scale labelled data is often a very time-consuming and expensive process. To address this challenge, we push the limits of self-supervision from multi-modal video data. Video data usually contain multiple modalities such as images, audio, transcribed speech and textual captions freely available. These modalities often share redundant semantic information and therefore can serve as pseudo-labels to supervise each other for representation learning without necessitating the use of manual human labels. Without the reliance on labelled data, we are able to train these deep representations on very large-scale video data of millions of video clips collected from the Internet. We show the scalability benefits of multi-modal self supervision by establishing a new state-of-the-art performance in a variety of domains: video action recognition, text-to-video retrieval, text-to-image retrieval and audio classification. We also introduce other technical innovations in terms of data transformations, model architecture and loss functions to further improve learning these deep video representations using multi-modal self-supervision. A secondary contribution of this thesis is new tools to improve the interpretability of deep representations, given that it is notoriously difficult to decipher the key features encoded in these deep representations. For images, we show how perturbation analysis can be used to analyze the intermediate representations of a network. For videos, we propose a novel clustering method using the Sinkhorn-Knopp algorithm to map deep video representations to human interpretable semantic pseudo-labels. The contributions in this thesis are steps to unlocking both the scalability and interpretability of deep video representation learning

On Compositions of Transformations in Contrastive Self-Supervised Learning

Author: Asano Yuki M.
Fong Ruth
Henriques João F.
Kuznetsova Polina
Patrick Mandela
Vedaldi Andrea
Zweig Geoffrey
Publication venue
Publication date: 27/10/2021
Field of study

In the image domain, excellent representations can be learned by inducing invariance to content-preserving transformations via noise contrastive learning. In this paper, we generalize contrastive learning to a wider set of transformations, and their compositions, for which either invariance or distinctiveness is sought. We show that it is not immediately obvious how existing methods such as SimCLR can be extended to do so. Instead, we introduce a number of formal requirements that all contrastive formulations must satisfy, and propose a practical construction which satisfies these requirements. In order to maximise the reach of this analysis, we express all components of noise contrastive formulations as the choice of certain generalized transformations of the data (GDTs), including data sampling. We then consider videos as an example of data in which a large variety of transformations are applicable, accounting for the extra modalities -- for which we analyze audio and text -- and the dimension of time. We find that being invariant to certain transformations and distinctive to others is critical to learning effective video representations, improving the state-of-the-art for multiple benchmarks by a large margin, and even surpassing supervised pretraining.Comment: Accepted to ICCV 2021. Code and pretrained models are available at https://github.com/facebookresearch/GD

arXiv.org e-Print Archive

Oxford University Research Archive

Axis patterning by BMPs: cnidarian network reveals evolutionary constraints

Author: Fredman David
Fried Patrick
Genikhovich Grigory
Gilles Anna F.
Iber Dagmar
Meier Karin
Prünster M. Mandela
Schinko Johannes B.
Technau Ulrich
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

BMP signaling plays a crucial role in the establishment of the dorso-ventral body axis in bilaterally symmetric animals. However, the topologies of the bone morphogenetic protein (BMP) signaling networks vary drastically in different animal groups, raising questions about the evolutionary constraints and evolvability of BMP signaling systems. Using loss-of-function analysis and mathematical modeling, we show that two signaling centers expressing different BMPs and BMP antagonists maintain the secondary axis of the sea anemone Nematostella. We demonstrate that BMP signaling is required for asymmetric Hox gene expression and mesentery formation. Computational analysis reveals that network parameters related to BMP4 and Chordin are constrained both in Nematostella and Xenopus, while those describing the BMP signaling modulators can vary significantly. Notably, only chordin, but not bmp4 expression needs to be spatially restricted for robust signaling gradient formation. Our data provide an explanation of the evolvability of BMP signaling systems in axis formation throughout Eumetazoa

Repository for Publications and Research Data

University of Bergen

Elsevier - Publisher Connector

Crossref

Directory of Open Access Journals

PubMed Central

NORA - Norwegian Open Research Archives

Space-Time Crop & Attend: improving cross-modal video representation learning

Author: Asano Yuki M
Henriques Joao
Huang Po-Yao
Metze Florian
Misra Ishan
Patrick Mandela
Vedaldi Andrea
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 28/02/2022
Field of study

The quality of the image representations obtained from self-supervised learning depends strongly on the type of data augmentations used in the learning formulation. Recent papers have ported these methods from still images to videos and found that leveraging both audio and video signals yields strong gains; however, they did not find that spatial augmentations such as cropping, which are very important for still images, work as well for videos. In this paper, we improve these formulations in two ways unique to the spatio-temporal aspect of videos. First, for space, we show that spatial augmentations such as cropping do work well for videos too, but that previous implementations, due to the high processing and memory cost, could not do this at a scale sufficient for it to work well. To address this issue, we first introduce Feature Crop, a method to simulate such augmentations much more efficiently directly in feature space. Second, we show that as opposed to naïve average pooling, the use of transformer-based attention improves performance significantly, and is well suited for processing feature crops. Combining both of our discoveries into a new method, Space-Time Crop & Attend (STiCA) we achieve state-of-the-art performance across multiple video-representation learning benchmarks. In particular, we achieve new state-of-the-art accuracies of 67.0% on HMDB-51 and 93.1% on UCF-101 when pre-training on Kinetics-400. Code and pretrained models are available 1

Cryptic Rhetoric: The ANC and Anti-Americanization

Author: Bellow Saul
Bond Patrick
Booth Wayne C
Campbell James T
De Beer AS
Ferguson Niall
Ferguson Niall
Glenn Ian
Ian Glenn
Mandela Nelson
Moriarty Thomas A
Salazar Philippe J
Scheckels Theodore F
Stow GW
Publication venue: 'Informa UK Limited'
Publication date
Field of study

Crossref