34,747 research outputs found
The Devil is in the Tails: Fine-grained Classification in the Wild
The world is long-tailed. What does this mean for computer vision and visual
recognition? The main two implications are (1) the number of categories we need
to consider in applications can be very large, and (2) the number of training
examples for most categories can be very small. Current visual recognition
algorithms have achieved excellent classification accuracy. However, they
require many training examples to reach peak performance, which suggests that
long-tailed distributions will not be dealt with well. We analyze this question
in the context of eBird, a large fine-grained classification dataset, and a
state-of-the-art deep network classification algorithm. We find that (a) peak
classification performance on well-represented categories is excellent, (b)
given enough data, classification performance suffers only minimally from an
increase in the number of classes, (c) classification performance decays
precipitously as the number of training examples decreases, (d) surprisingly,
transfer learning is virtually absent in current methods. Our findings suggest
that our community should come to grips with the question of long tails
VoxCeleb2: Deep Speaker Recognition
The objective of this paper is speaker recognition under noisy and
unconstrained conditions.
We make two key contributions. First, we introduce a very large-scale
audio-visual speaker recognition dataset collected from open-source media.
Using a fully automated pipeline, we curate VoxCeleb2 which contains over a
million utterances from over 6,000 speakers. This is several times larger than
any publicly available speaker recognition dataset.
Second, we develop and compare Convolutional Neural Network (CNN) models and
training strategies that can effectively recognise identities from voice under
various conditions. The models trained on the VoxCeleb2 dataset surpass the
performance of previous works on a benchmark dataset by a significant margin.Comment: To appear in Interspeech 2018. The audio-visual dataset can be
downloaded from http://www.robots.ox.ac.uk/~vgg/data/voxceleb2 .
1806.05622v2: minor fixes; 5 page
- …