32,753 research outputs found
Multimodal One-Shot Learning of Speech and Images
Imagine a robot is shown new concepts visually together with spoken tags,
e.g. "milk", "eggs", "butter". After seeing one paired audio-visual example per
class, it is shown a new set of unseen instances of these objects, and asked to
pick the "milk". Without receiving any hard labels, could it learn to match the
new continuous speech input to the correct visual instance? Although unimodal
one-shot learning has been studied, where one labelled example in a single
modality is given per class, this example motivates multimodal one-shot
learning. Our main contribution is to formally define this task, and to propose
several baseline and advanced models. We use a dataset of paired spoken and
visual digits to specifically investigate recent advances in Siamese
convolutional neural networks. Our best Siamese model achieves twice the
accuracy of a nearest neighbour model using pixel-distance over images and
dynamic time warping over speech in 11-way cross-modal matching.Comment: 5 pages, 1 figure, 3 tables; accepted to ICASSP 201
One-Shot Learning for Semantic Segmentation
Low-shot learning methods for image classification support learning from
sparse data. We extend these techniques to support dense semantic image
segmentation. Specifically, we train a network that, given a small set of
annotated images, produces parameters for a Fully Convolutional Network (FCN).
We use this FCN to perform dense pixel-level prediction on a test image for the
new semantic class. Our architecture shows a 25% relative meanIoU improvement
compared to the best baseline methods for one-shot segmentation on unseen
classes in the PASCAL VOC 2012 dataset and is at least 3 times faster.Comment: To appear in the proceedings of the British Machine Vision Conference
(BMVC) 2017. The code is available at https://github.com/lzzcd001/OSLS
- …