608 research outputs found
Incremental Learning of Object Detectors without Catastrophic Forgetting
Despite their success for object detection, convolutional neural networks are
ill-equipped for incremental learning, i.e., adapting the original model
trained on a set of classes to additionally detect objects of new classes, in
the absence of the initial training data. They suffer from "catastrophic
forgetting" - an abrupt degradation of performance on the original set of
classes, when the training objective is adapted to the new classes. We present
a method to address this issue, and learn object detectors incrementally, when
neither the original training data nor annotations for the original classes in
the new training set are available. The core of our proposed solution is a loss
function to balance the interplay between predictions on the new classes and a
new distillation loss which minimizes the discrepancy between responses for old
classes from the original and the updated networks. This incremental learning
can be performed multiple times, for a new set of classes in each step, with a
moderate drop in performance compared to the baseline network trained on the
ensemble of data. We present object detection results on the PASCAL VOC 2007
and COCO datasets, along with a detailed empirical analysis of the approach.Comment: To appear in ICCV 201
Expanded Parts Model for Semantic Description of Humans in Still Images
We introduce an Expanded Parts Model (EPM) for recognizing human attributes
(e.g. young, short hair, wearing suit) and actions (e.g. running, jumping) in
still images. An EPM is a collection of part templates which are learnt
discriminatively to explain specific scale-space regions in the images (in
human centric coordinates). This is in contrast to current models which consist
of a relatively few (i.e. a mixture of) 'average' templates. EPM uses only a
subset of the parts to score an image and scores the image sparsely in space,
i.e. it ignores redundant and random background in an image. To learn our
model, we propose an algorithm which automatically mines parts and learns
corresponding discriminative templates together with their respective locations
from a large number of candidate parts. We validate our method on three recent
challenging datasets of human attributes and actions. We obtain convincing
qualitative and state-of-the-art quantitative results on the three datasets.Comment: Accepted for publication in IEEE Transactions on Pattern Analysis and
Machine Intelligence (TPAMI
Learning Video Object Segmentation with Visual Memory
This paper addresses the task of segmenting moving objects in unconstrained
videos. We introduce a novel two-stream neural network with an explicit memory
module to achieve this. The two streams of the network encode spatial and
temporal features in a video sequence respectively, while the memory module
captures the evolution of objects over time. The module to build a "visual
memory" in video, i.e., a joint representation of all the video frames, is
realized with a convolutional recurrent unit learned from a small number of
training video sequences. Given a video frame as input, our approach assigns
each pixel an object or background label based on the learned spatio-temporal
features as well as the "visual memory" specific to the video, acquired
automatically without any manually-annotated frames. The visual memory is
implemented with convolutional gated recurrent units, which allows to propagate
spatial information over time. We evaluate our method extensively on two
benchmarks, DAVIS and Freiburg-Berkeley motion segmentation datasets, and show
state-of-the-art results. For example, our approach outperforms the top method
on the DAVIS dataset by nearly 6%. We also provide an extensive ablative
analysis to investigate the influence of each component in the proposed
framework
On the Importance of Visual Context for Data Augmentation in Scene Understanding
Performing data augmentation for learning deep neural networks is known to be
important for training visual recognition systems. By artificially increasing
the number of training examples, it helps reducing overfitting and improves
generalization. While simple image transformations can already improve
predictive performance in most vision tasks, larger gains can be obtained by
leveraging task-specific prior knowledge. In this work, we consider object
detection, semantic and instance segmentation and augment the training images
by blending objects in existing scenes, using instance segmentation
annotations. We observe that randomly pasting objects on images hurts the
performance, unless the object is placed in the right context. To resolve this
issue, we propose an explicit context model by using a convolutional neural
network, which predicts whether an image region is suitable for placing a given
object or not. In our experiments, we show that our approach is able to improve
object detection, semantic and instance segmentation on the PASCAL VOC12 and
COCO datasets, with significant gains in a limited annotation scenario, i.e.
when only one category is annotated. We also show that the method is not
limited to datasets that come with expensive pixel-wise instance annotations
and can be used when only bounding boxes are available, by employing
weakly-supervised learning for instance masks approximation.Comment: Updated the experimental section. arXiv admin note: substantial text
overlap with arXiv:1807.0742
Online Object Tracking with Proposal Selection
Tracking-by-detection approaches are some of the most successful object
trackers in recent years. Their success is largely determined by the detector
model they learn initially and then update over time. However, under
challenging conditions where an object can undergo transformations, e.g.,
severe rotation, these methods are found to be lacking. In this paper, we
address this problem by formulating it as a proposal selection task and making
two contributions. The first one is introducing novel proposals estimated from
the geometric transformations undergone by the object, and building a rich
candidate set for predicting the object location. The second one is devising a
novel selection strategy using multiple cues, i.e., detection score and
edgeness score computed from state-of-the-art object edges and motion
boundaries. We extensively evaluate our approach on the visual object tracking
2014 challenge and online tracking benchmark datasets, and show the best
performance.Comment: ICCV 201
Unsupervised Learning of Artistic Styles with Archetypal Style Analysis
In this paper, we introduce an unsupervised learning approach to
automatically discover, summarize, and manipulate artistic styles from large
collections of paintings. Our method is based on archetypal analysis, which is
an unsupervised learning technique akin to sparse coding with a geometric
interpretation. When applied to deep image representations from a collection of
artworks, it learns a dictionary of archetypal styles, which can be easily
visualized. After training the model, the style of a new image, which is
characterized by local statistics of deep visual features, is approximated by a
sparse convex combination of archetypes. This enables us to interpret which
archetypal styles are present in the input image, and in which proportion.
Finally, our approach allows us to manipulate the coefficients of the latent
archetypal decomposition, and achieve various special effects such as style
enhancement, transfer, and interpolation between multiple archetypes.Comment: Accepted at NIPS 2018, Montr\'eal, Canad
P-CNN: Pose-based CNN Features for Action Recognition
This work targets human action recognition in video. While recent methods
typically represent actions by statistics of local video features, here we
argue for the importance of a representation derived from human pose. To this
end we propose a new Pose-based Convolutional Neural Network descriptor (P-CNN)
for action recognition. The descriptor aggregates motion and appearance
information along tracks of human body parts. We investigate different schemes
of temporal aggregation and experiment with P-CNN features obtained both for
automatically estimated and manually annotated human poses. We evaluate our
method on the recent and challenging JHMDB and MPII Cooking datasets. For both
datasets our method shows consistent improvement over the state of the art.Comment: ICCV, December 2015, Santiago, Chil
Weakly Supervised Object Localization with Multi-fold Multiple Instance Learning
Object category localization is a challenging problem in computer vision.
Standard supervised training requires bounding box annotations of object
instances. This time-consuming annotation process is sidestepped in weakly
supervised learning. In this case, the supervised information is restricted to
binary labels that indicate the absence/presence of object instances in the
image, without their locations. We follow a multiple-instance learning approach
that iteratively trains the detector and infers the object locations in the
positive training images. Our main contribution is a multi-fold multiple
instance learning procedure, which prevents training from prematurely locking
onto erroneous object locations. This procedure is particularly important when
using high-dimensional representations, such as Fisher vectors and
convolutional neural network features. We also propose a window refinement
method, which improves the localization accuracy by incorporating an objectness
prior. We present a detailed experimental evaluation using the PASCAL VOC 2007
dataset, which verifies the effectiveness of our approach.Comment: To appear in IEEE Transactions on Pattern Analysis and Machine
Intelligence (TPAMI
- …