1,343 research outputs found
The Devil is in the Tails: Fine-grained Classification in the Wild
The world is long-tailed. What does this mean for computer vision and visual
recognition? The main two implications are (1) the number of categories we need
to consider in applications can be very large, and (2) the number of training
examples for most categories can be very small. Current visual recognition
algorithms have achieved excellent classification accuracy. However, they
require many training examples to reach peak performance, which suggests that
long-tailed distributions will not be dealt with well. We analyze this question
in the context of eBird, a large fine-grained classification dataset, and a
state-of-the-art deep network classification algorithm. We find that (a) peak
classification performance on well-represented categories is excellent, (b)
given enough data, classification performance suffers only minimally from an
increase in the number of classes, (c) classification performance decays
precipitously as the number of training examples decreases, (d) surprisingly,
transfer learning is virtually absent in current methods. Our findings suggest
that our community should come to grips with the question of long tails
The More You Know: Using Knowledge Graphs for Image Classification
One characteristic that sets humans apart from modern learning-based computer
vision algorithms is the ability to acquire knowledge about the world and use
that knowledge to reason about the visual world. Humans can learn about the
characteristics of objects and the relationships that occur between them to
learn a large variety of visual concepts, often with few examples. This paper
investigates the use of structured prior knowledge in the form of knowledge
graphs and shows that using this knowledge improves performance on image
classification. We build on recent work on end-to-end learning on graphs,
introducing the Graph Search Neural Network as a way of efficiently
incorporating large knowledge graphs into a vision classification pipeline. We
show in a number of experiments that our method outperforms standard neural
network baselines for multi-label classification.Comment: CVPR 201
Action Recognition in Video Using Sparse Coding and Relative Features
This work presents an approach to category-based action recognition in video
using sparse coding techniques. The proposed approach includes two main
contributions: i) A new method to handle intra-class variations by decomposing
each video into a reduced set of representative atomic action acts or
key-sequences, and ii) A new video descriptor, ITRA: Inter-Temporal Relational
Act Descriptor, that exploits the power of comparative reasoning to capture
relative similarity relations among key-sequences. In terms of the method to
obtain key-sequences, we introduce a loss function that, for each video, leads
to the identification of a sparse set of representative key-frames capturing
both, relevant particularities arising in the input video, as well as relevant
generalities arising in the complete class collection. In terms of the method
to obtain the ITRA descriptor, we introduce a novel scheme to quantify relative
intra and inter-class similarities among local temporal patterns arising in the
videos. The resulting ITRA descriptor demonstrates to be highly effective to
discriminate among action categories. As a result, the proposed approach
reaches remarkable action recognition performance on several popular benchmark
datasets, outperforming alternative state-of-the-art techniques by a large
margin.Comment: Accepted to CVPR 201
Learning from the Scene and Borrowing from the Rich: Tackling the Long Tail in Scene Graph Generation
Despite the huge progress in scene graph generation in recent years, its
long-tail distribution in object relationships remains a challenging and
pestering issue. Existing methods largely rely on either external knowledge or
statistical bias information to alleviate this problem. In this paper, we
tackle this issue from another two aspects: (1) scene-object interaction aiming
at learning specific knowledge from a scene via an additive attention
mechanism; and (2) long-tail knowledge transfer which tries to transfer the
rich knowledge learned from the head into the tail. Extensive experiments on
the benchmark dataset Visual Genome on three tasks demonstrate that our method
outperforms current state-of-the-art competitors
- …