Search CORE

8 research outputs found

Recommended from our members

Disability-first Dataset Creation: Lessons from Constructing a Dataset for Teachable Object Recognition with Blind and Low Vision Data Collectors

Author: Cutrell E.
Harris M. T.
Hofmann K.
Massiceti D.
Morrison C.
Stumpf S.
Theodorou L.
Zintgraf L.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 17/10/2021
Field of study

Artificial Intelligence (AI) for accessibility is a rapidly growing area, requiring datasets that are inclusive of the disabled users thatassistive technology aims to serve. We offer insights from a multi-disciplinary project that constructed a dataset for teachable objectrecognition with people who are blind or low vision. Teachable object recognition enables users to teach a model objects that are ofinterest to them, e.g., their white cane or own sunglasses, by providing example images or videos of objects. In this paper, we make thefollowing contributions: 1) a disability-first procedure to support blind and low vision data collectors to produce good quality data,using video rather than images; 2) a validation and evolution of this procedure through a series of data collection phases and 3) a set ofquestions to orient researchers involved in creating datasets toward reflecting on the needs of their participant community

City Research Online

Enlighten

FLIPDIAL: A generative model for two-way visual dialogue

Author: Dokania PK
Massiceti D
Narayanaswamy S
Torr PH
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

We present FLIPDIAL, a generative model for Visual Dialogue that simultaneously plays the role of both participants in a visually-grounded dialogue. Given context in the form of an image and an associated caption summarising the contents of the image, FLIPDIAL learns both to answer questions and put forward questions, capable of generating entire sequences of dialogue (question-answer pairs) which are diverse and relevant to the image. To do this, FLIPDIAL relies on a simple but surprisingly powerful idea: it uses convolutional neural networks (CNNs) to encode entire dialogues directly, implicitly capturing dialogue context, and conditional VAEs to learn the generative model, FLIPDIAL outperforms the state-of-the-art model in the sequential answering task (1VD) on the VisDial dataset by 5 points in Mean Rank using the generated answers. We are the first to extend this paradigm to full two-way visual dialogue (2VD), where our model is capable of generating both questions and answers in sequence based on a visual input, for which we propose a set of novel evaluation measures and metrics

Oxford University Research Archive

FLIPDIAL: A generative model for two-way visual dialogue

Author: Dokania PK
Massiceti D
Narayanaswamy S
Torr PH
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 17/12/2018
Field of study

Oxford University Research Archive

Visual dialogue without vision or dialogue

Author: Dokania P
Massiceti D
Siddharth N
Torr P
Publication venue: Neural Information Processing Systems Foundation
Publication date: 01/01/2018
Field of study

We characterise some of the quirks and shortcomings in the exploration of visual dialogue (VD)—a sequential question-answering task where the questions and corresponding answers are related through given visual stimuli. To do so, we develop an embarrassingly simple method based on canonical correlation analysis (CCA) that, on the standard dataset, achieves near state-of-the-art performance on mean rank (MR). In direct contrast to current complex and over-parametrised architectures that are both compute and time intensive, our method ignores the visual stimuli, ignores the sequencing of dialogue, does not need gradients, uses off-the-shelf feature extractors, has at least an order of magnitude fewer parameters, and learns in practically no time. We argue that these results are indicative of issues in current approaches to visual dialogue and conduct analyses to highlight implicit dataset biases and effects of over-constrained evaluation metrics. Our code is publicly available

Oxford University Research Archive

Random forests versus neural networks - What's best for camera localization?

Author: Brachmann E
Krull A
Massiceti D
Rother C
Torr PHS
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

This work addresses the task of camera localization in a known 3D scene given a single input RGB image. State-of-the-art approaches accomplish this in two steps: firstly, regressing for every pixel in the image its 3D scene coordinate and subsequently, using these coordinates to estimate the final 6D camera pose via RANSAC. To solve the first step, Random Forests (RFs) are typically used. On the other hand, Neural Networks (NNs) reign in many dense regression tasks, but are not test-time efficient. We ask the question: which of the two is best for camera localization? To address this, we make two method contributions: (1) a test-time efficient NN architecture which we term a ForestNet that is derived and initialized from a RF, and (2) a new fully-differentiable robust averaging technique for regression ensembles which can be trained endto- end with a NN. Our experimental findings show that for scene coordinate regression, traditional NN architectures are superior to test-time efficient RFs and ForestNets, however, this does not translate to final 6D camera pose accuracy where RFs and ForestNets perform slightly better. To summarize, our best method, a ForestNet with a robust average, which has an equivalent fast and lightweight RF, improves over the state-of-the-art for camera localization on the 7-Scenes dataset [1]. While this work focuses on scene coordinate regression for camera localization, our innovations may also be applied to other continuous regression tasks

Oxford University Research Archive

Bottom-up top-down cues for weakly-supervised semantic segmentation

Author: Cheng M-M
Dokania PK
Hou Q
Massiceti D
Torr PHS
Wei Y
Publication venue: Springer, Cham
Publication date: 09/04/2017
Field of study

We consider the task of learning a classifier for semantic segmentation using weak supervision in the form of image labels specifying objects present in the image. Our method uses deep convolutional neural networks (cnns) and adopts an Expectation-Maximization (EM) based approach. We focus on the following three aspects of EM: (i) initialization; (ii) latent posterior estimation (E-step) and (iii) the parameter update (M-step). We show that saliency and attention maps, bottom-up and top-down cues respectively, of images with single objects (simple images) provide highly reliable cues to learn an initialization for the EM. Intuitively, given weak supervisions, we first learn to segment simple images and then move towards the complex ones. Next, for updating the parameters (M step), we propose to minimize the combination of the standard softmax loss and the KL divergence between the latent posterior distribution (obtained using the E-step) and the likelihood given by the cnn. This combination is more robust to wrong predictions made by the E step of the EM algorithm. Extensive experiments and discussions show that our method is very simple and intuitive, and outperforms the state-of-the-art method with a very high margin of 3.7% and 3.9% on the PASCAL VOC12 train and test sets respectively, thus setting new state-of-the-art results

arXiv.org e-Print Archive

Oxford University Research Archive

Memory Efficient Meta-Learning with Large Images

Author: Bronskill J
Hofmann K
Massiceti D
Nowozin S
Patacchiola M
Turner RE
Publication venue
Publication date
Field of study

Meta learning approaches to few-shot classification are computationally efficient at test time requiring just a few optimization steps or single forward pass to learn a new task, but they remain highly memory-intensive to train. This limitation arises because a task's entire support set, which can contain up to 1000 images, must be processed before an optimization step can be taken. Harnessing the performance gains offered by large images thus requires either parallelizing the meta-learner across multiple GPUs, which may not be available, or trade-offs between task and image size when memory constraints apply. We improve on both options by proposing LITE, a general and memory efficient episodic training scheme that enables meta-training on large tasks composed of large images on a single GPU. We achieve this by observing that the gradients for a task can be decomposed into a sum of gradients over the task's training images. This enables us to perform a forward pass on a task's entire training set but realize significant memory savings by back-propagating only a random subset of these images which we show is an unbiased approximation of the full gradient. We use LITE to train meta-learners and demonstrate new state-of-the-art accuracy on the real-world ORBIT benchmark and 3 of the 4 parts of the challenging VTAB+MD benchmark relative to leading meta-learners. LITE also enables meta-learners to be competitive with transfer learning approaches but at a fraction of the test-time computational cost, thus serving as a counterpoint to the recent narrative that transfer learning is all you need for few-shot classification

CUED - Cambridge University Engineering Department

Stereosonic vision: Exploring visual-to-auditory sensory substitution mappings in an immersive virtual reality navigation paradigm

Author: A Amedi
A Black
A Pasqualotto
AE Patla
AJ Kolarik
AJ Kolarik
AJ Smith
B Brown
B Hughes
B Röder
BJ Mohler
BN Walker
BN Walker
C Veraart
D Guth
D Oertel
DA Waters
Daniela Massiceti
DR Begault
DR Chebat
DR Geruschat
EL Lamoureux
EL Lamoureux
F Gougoux
F Gougoux
GA Stevens
GI Kempen
GL Goodrich
H Sauzéon
I Fine
J Liao
J Sohl-Dickstein
J Stelmack
J Sánchez
JA MacDonald
JA Marron
JF Lapointe
JJ van Rheede
JM Loomis
JM Loomis
Joram Jacob van Rheede
KA Turano
KA Turano
Krish Sathian
L Kay
L Kay
L Picinali
L Reich
L Thaler
LA Cushman
M Supa
MD Yang
N Lessard
NA Giudice
NA Giudice
NA Giudice
NF Gould
P Bach-y Rita
P Bach-y-Rita
P Bach-y-Rita
P Voss
P Worchel
PB Meijer
R Manduchi
R Passini
RA Ruddle
RG Long
RG Long
RI García-Betances
RJ Matheis
RJ Reed-Jones
RQ Ivers
S Abboud
S Haymes
S Maidenbaum
S Maidenbaum
S Maidenbaum
S Teng
SA Haymes
SL Hicks
SR Flaxman
SS Chance
Stephen Lloyd Hicks
T Iachini
T Ifukube
T Kuyk
T Vercillo
T Whelan
UR Roentgen
VA Prisacariu
VR Schinazi
WR Wiener
Z Zhang
Publication venue: 'Public Library of Science (PLoS)'
Publication date
Field of study

Crossref