5,635 research outputs found
Robust Visual Tracking via Convolutional Networks
Deep networks have been successfully applied to visual tracking by learning a
generic representation offline from numerous training images. However the
offline training is time-consuming and the learned generic representation may
be less discriminative for tracking specific objects. In this paper we present
that, even without offline training with a large amount of auxiliary data,
simple two-layer convolutional networks can be powerful enough to develop a
robust representation for visual tracking. In the first frame, we employ the
k-means algorithm to extract a set of normalized patches from the target region
as fixed filters, which integrate a series of adaptive contextual filters
surrounding the target to define a set of feature maps in the subsequent
frames. These maps measure similarities between each filter and the useful
local intensity patterns across the target, thereby encoding its local
structural information. Furthermore, all the maps form together a global
representation, which is built on mid-level features, thereby remaining close
to image-level information, and hence the inner geometric layout of the target
is also well preserved. A simple soft shrinkage method with an adaptive
threshold is employed to de-noise the global representation, resulting in a
robust sparse representation. The representation is updated via a simple and
effective online strategy, allowing it to robustly adapt to target appearance
variations. Our convolution networks have surprisingly lightweight structure,
yet perform favorably against several state-of-the-art methods on the CVPR2013
tracking benchmark dataset with 50 challenging videos
MAVOT: Memory-Augmented Video Object Tracking
We introduce a one-shot learning approach for video object tracking. The
proposed algorithm requires seeing the object to be tracked only once, and
employs an external memory to store and remember the evolving features of the
foreground object as well as backgrounds over time during tracking. With the
relevant memory retrieved and updated in each tracking, our tracking model is
capable of maintaining long-term memory of the object, and thus can naturally
deal with hard tracking scenarios including partial and total occlusion, motion
changes and large scale and shape variations. In our experiments we use the
ImageNet ILSVRC2015 video detection dataset to train and use the VOT-2016
benchmark to test and compare our Memory-Augmented Video Object Tracking
(MAVOT) model. From the results, we conclude that given its oneshot property
and simplicity in design, MAVOT is an attractive approach in visual tracking
because it shows good performance on VOT-2016 benchmark and is among the top 5
performers in accuracy and robustness in occlusion, motion changes and empty
target.Comment: Submitted to CVPR201
Selectivity or Invariance: Boundary-aware Salient Object Detection
Typically, a salient object detection (SOD) model faces opposite requirements
in processing object interiors and boundaries. The features of interiors should
be invariant to strong appearance change so as to pop-out the salient object as
a whole, while the features of boundaries should be selective to slight
appearance change to distinguish salient objects and background. To address
this selectivity-invariance dilemma, we propose a novel boundary-aware network
with successive dilation for image-based SOD. In this network, the feature
selectivity at boundaries is enhanced by incorporating a boundary localization
stream, while the feature invariance at interiors is guaranteed with a complex
interior perception stream. Moreover, a transition compensation stream is
adopted to amend the probable failures in transitional regions between
interiors and boundaries. In particular, an integrated successive dilation
module is proposed to enhance the feature invariance at interiors and
transitional regions. Extensive experiments on six datasets show that the
proposed approach outperforms 16 state-of-the-art methods
Decoupled Classification Refinement: Hard False Positive Suppression for Object Detection
In this paper, we analyze failure cases of state-of-the-art detectors and
observe that most hard false positives result from classification instead of
localization and they have a large negative impact on the performance of object
detectors. We conjecture there are three factors: (1) Shared feature
representation is not optimal due to the mismatched goals of feature learning
for classification and localization; (2) multi-task learning helps, yet
optimization of the multi-task loss may result in sub-optimal for individual
tasks; (3) large receptive field for different scales leads to redundant
context information for small objects. We demonstrate the potential of detector
classification power by a simple, effective, and widely-applicable Decoupled
Classification Refinement (DCR) network. In particular, DCR places a separate
classification network in parallel with the localization network (base
detector). With ROI Pooling placed on the early stage of the classification
network, we enforce an adaptive receptive field in DCR. During training, DCR
samples hard false positives from the base detector and trains a strong
classifier to refine classification results. During testing, DCR refines all
boxes from the base detector. Experiments show competitive results on PASCAL
VOC and COCO without any bells and whistles. Our codes are available at:
https://github.com/bowenc0221/Decoupled-Classification-Refinement.Comment: under review. arXiv admin note: text overlap with arXiv:1803.0679
Twin-GAN -- Unpaired Cross-Domain Image Translation with Weight-Sharing GANs
We present a framework for translating unlabeled images from one domain into
analog images in another domain. We employ a progressively growing
skip-connected encoder-generator structure and train it with a GAN loss for
realistic output, a cycle consistency loss for maintaining same-domain
translation identity, and a semantic consistency loss that encourages the
network to keep the input semantic features in the output. We apply our
framework on the task of translating face images, and show that it is capable
of learning semantic mappings for face images with no supervised one-to-one
image mapping
Deep Matching and Validation Network -- An End-to-End Solution to Constrained Image Splicing Localization and Detection
Image splicing is a very common image manipulation technique that is
sometimes used for malicious purposes. A splicing detec- tion and localization
algorithm usually takes an input image and produces a binary decision
indicating whether the input image has been manipulated, and also a
segmentation mask that corre- sponds to the spliced region. Most existing
splicing detection and localization pipelines suffer from two main
shortcomings: 1) they use handcrafted features that are not robust against
subsequent processing (e.g., compression), and 2) each stage of the pipeline is
usually optimized independently. In this paper we extend the formulation of the
underlying splicing problem to consider two input images, a query image and a
potential donor image. Here the task is to estimate the probability that the
donor image has been used to splice the query image, and obtain the splicing
masks for both the query and donor images. We introduce a novel deep
convolutional neural network architecture, called Deep Matching and Validation
Network (DMVN), which simultaneously localizes and detects image splicing. The
proposed approach does not depend on handcrafted features and uses raw input
images to create deep learned representations. Furthermore, the DMVN is
end-to-end op- timized to produce the probability estimates and the
segmentation masks. Our extensive experiments demonstrate that this approach
outperforms state-of-the-art splicing detection methods by a large margin in
terms of both AUC score and speed.Comment: 9 pages, 10 figure
What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis
Many new proposals for scene text recognition (STR) models have been
introduced in recent years. While each claim to have pushed the boundary of the
technology, a holistic and fair comparison has been largely missing in the
field due to the inconsistent choices of training and evaluation datasets. This
paper addresses this difficulty with three major contributions. First, we
examine the inconsistencies of training and evaluation datasets, and the
performance gap results from inconsistencies. Second, we introduce a unified
four-stage STR framework that most existing STR models fit into. Using this
framework allows for the extensive evaluation of previously proposed STR
modules and the discovery of previously unexplored module combinations. Third,
we analyze the module-wise contributions to performance in terms of accuracy,
speed, and memory demand, under one consistent set of training and evaluation
datasets. Such analyses clean up the hindrance on the current comparisons to
understand the performance gain of the existing modules.Comment: Oral paper at ICCV'19. Our code is publicly available.
(https://github.com/clovaai/deep-text-recognition-benchmark
Temporal Recurrent Networks for Online Action Detection
Most work on temporal action detection is formulated as an offline problem,
in which the start and end times of actions are determined after the entire
video is fully observed. However, important real-time applications including
surveillance and driver assistance systems require identifying actions as soon
as each video frame arrives, based only on current and historical observations.
In this paper, we propose a novel framework, Temporal Recurrent Network (TRN),
to model greater temporal context of a video frame by simultaneously performing
online action detection and anticipation of the immediate future. At each
moment in time, our approach makes use of both accumulated historical evidence
and predicted future information to better recognize the action that is
currently occurring, and integrates both of these into a unified end-to-end
architecture. We evaluate our approach on two popular online action detection
datasets, HDD and TVSeries, as well as another widely used dataset, THUMOS'14.
The results show that TRN significantly outperforms the state-of-the-art
Fast detection of multiple objects in traffic scenes with a common detection framework
Traffic scene perception (TSP) aims to real-time extract accurate on-road
environment information, which in- volves three phases: detection of objects of
interest, recognition of detected objects, and tracking of objects in motion.
Since recognition and tracking often rely on the results from detection, the
ability to detect objects of interest effectively plays a crucial role in TSP.
In this paper, we focus on three important classes of objects: traffic signs,
cars, and cyclists. We propose to detect all the three important objects in a
single learning based detection framework. The proposed framework consists of a
dense feature extractor and detectors of three important classes. Once the
dense features have been extracted, these features are shared with all
detectors. The advantage of using one common framework is that the detection
speed is much faster, since all dense features need only to be evaluated once
in the testing phase. In contrast, most previous works have designed specific
detectors using different features for each of these objects. To enhance the
feature robustness to noises and image deformations, we introduce spatially
pooled features as a part of aggregated channel features. In order to further
improve the generalization performance, we propose an object subcategorization
method as a means of capturing intra-class variation of objects. We
experimentally demonstrate the effectiveness and efficiency of the proposed
framework in three detection applications: traffic sign detection, car
detection, and cyclist detection. The proposed framework achieves the
competitive performance with state-of- the-art approaches on several benchmark
datasets.Comment: Appearing in IEEE Transactions on Intelligent Transportation System
Learning Deep Feature Representations with Domain Guided Dropout for Person Re-identification
Learning generic and robust feature representations with data from multiple
domains for the same problem is of great value, especially for the problems
that have multiple datasets but none of them are large enough to provide
abundant data variations. In this work, we present a pipeline for learning deep
feature representations from multiple domains with Convolutional Neural
Networks (CNNs). When training a CNN with data from all the domains, some
neurons learn representations shared across several domains, while some others
are effective only for a specific one. Based on this important observation, we
propose a Domain Guided Dropout algorithm to improve the feature learning
procedure. Experiments show the effectiveness of our pipeline and the proposed
algorithm. Our methods on the person re-identification problem outperform
state-of-the-art methods on multiple datasets by large margins.Comment: To appear in CVPR201
- …