563 research outputs found
Rethinking ImageNet Pre-training
We report competitive results on object detection and instance segmentation
on the COCO dataset using standard models trained from random initialization.
The results are no worse than their ImageNet pre-training counterparts even
when using the hyper-parameters of the baseline system (Mask R-CNN) that were
optimized for fine-tuning pre-trained models, with the sole exception of
increasing the number of training iterations so the randomly initialized models
may converge. Training from random initialization is surprisingly robust; our
results hold even when: (i) using only 10% of the training data, (ii) for
deeper and wider models, and (iii) for multiple tasks and metrics. Experiments
show that ImageNet pre-training speeds up convergence early in training, but
does not necessarily provide regularization or improve final target task
accuracy. To push the envelope we demonstrate 50.9 AP on COCO object detection
without using any external data---a result on par with the top COCO 2017
competition results that used ImageNet pre-training. These observations
challenge the conventional wisdom of ImageNet pre-training for dependent tasks
and we expect these discoveries will encourage people to rethink the current de
facto paradigm of `pre-training and fine-tuning' in computer vision.Comment: Technical repor
A Pursuit of Temporal Accuracy in General Activity Detection
Detecting activities in untrimmed videos is an important but challenging
task. The performance of existing methods remains unsatisfactory, e.g., they
often meet difficulties in locating the beginning and end of a long complex
action. In this paper, we propose a generic framework that can accurately
detect a wide variety of activities from untrimmed videos. Our first
contribution is a novel proposal scheme that can efficiently generate
candidates with accurate temporal boundaries. The other contribution is a
cascaded classification pipeline that explicitly distinguishes between
relevance and completeness of a candidate instance. On two challenging temporal
activity detection datasets, THUMOS14 and ActivityNet, the proposed framework
significantly outperforms the existing state-of-the-art methods, demonstrating
superior accuracy and strong adaptivity in handling activities with various
temporal structures
Bounding Box Regression with Uncertainty for Accurate Object Detection
Large-scale object detection datasets (e.g., MS-COCO) try to define the
ground truth bounding boxes as clear as possible. However, we observe that
ambiguities are still introduced when labeling the bounding boxes. In this
paper, we propose a novel bounding box regression loss for learning bounding
box transformation and localization variance together. Our loss greatly
improves the localization accuracies of various architectures with nearly no
additional computation. The learned localization variance allows us to merge
neighboring bounding boxes during non-maximum suppression (NMS), which further
improves the localization performance. On MS-COCO, we boost the Average
Precision (AP) of VGG-16 Faster R-CNN from 23.6% to 29.1%. More importantly,
for ResNet-50-FPN Mask R-CNN, our method improves the AP and AP90 by 1.8% and
6.2% respectively, which significantly outperforms previous state-of-the-art
bounding box refinement methods. Our code and models are available at:
github.com/yihui-he/KL-LossComment: CVPR 201
High-Resolution Representations for Labeling Pixels and Regions
High-resolution representation learning plays an essential role in many
vision problems, e.g., pose estimation and semantic segmentation. The
high-resolution network (HRNet)~\cite{SunXLW19}, recently developed for human
pose estimation, maintains high-resolution representations through the whole
process by connecting high-to-low resolution convolutions in \emph{parallel}
and produces strong high-resolution representations by repeatedly conducting
fusions across parallel convolutions.
In this paper, we conduct a further study on high-resolution representations
by introducing a simple yet effective modification and apply it to a wide range
of vision tasks. We augment the high-resolution representation by aggregating
the (upsampled) representations from all the parallel convolutions rather than
only the representation from the high-resolution convolution as done
in~\cite{SunXLW19}. This simple modification leads to stronger representations,
evidenced by superior results. We show top results in semantic segmentation on
Cityscapes, LIP, and PASCAL Context, and facial landmark detection on AFLW,
COFW, W, and WFLW. In addition, we build a multi-level representation from
the high-resolution representation and apply it to the Faster R-CNN object
detection framework and the extended frameworks. The proposed approach achieves
superior results to existing single-model networks on COCO object detection.
The code and models have been publicly available at
\url{https://github.com/HRNet}
Learning with Rethinking: Recurrently Improving Convolutional Neural Networks through Feedback
Recent years have witnessed the great success of convolutional neural network
(CNN) based models in the field of computer vision. CNN is able to learn
hierarchically abstracted features from images in an end-to-end training
manner. However, most of the existing CNN models only learn features through a
feedforward structure and no feedback information from top to bottom layers is
exploited to enable the networks to refine themselves. In this paper, we
propose a "Learning with Rethinking" algorithm. By adding a feedback layer and
producing the emphasis vector, the model is able to recurrently boost the
performance based on previous prediction. Particularly, it can be employed to
boost any pre-trained models. This algorithm is tested on four object
classification benchmark datasets: CIFAR-100, CIFAR-10, MNIST-background-image
and ILSVRC-2012 dataset. These results have demonstrated the advantage of
training CNN models with the proposed feedback mechanism
SFD: Single Shot Scale-invariant Face Detector
This paper presents a real-time face detector, named Single Shot
Scale-invariant Face Detector (SFD), which performs superiorly on various
scales of faces with a single deep neural network, especially for small faces.
Specifically, we try to solve the common problem that anchor-based detectors
deteriorate dramatically as the objects become smaller. We make contributions
in the following three aspects: 1) proposing a scale-equitable face detection
framework to handle different scales of faces well. We tile anchors on a wide
range of layers to ensure that all scales of faces have enough features for
detection. Besides, we design anchor scales based on the effective receptive
field and a proposed equal proportion interval principle; 2) improving the
recall rate of small faces by a scale compensation anchor matching strategy; 3)
reducing the false positive rate of small faces via a max-out background label.
As a consequence, our method achieves state-of-the-art detection performance on
all the common face detection benchmarks, including the AFW, PASCAL face, FDDB
and WIDER FACE datasets, and can run at 36 FPS on a Nvidia Titan X (Pascal) for
VGA-resolution images.Comment: Accepted by ICCV 2017 + its supplementary materials; Updated the
latest results on WIDER FAC
Object Detection in 20 Years: A Survey
Object detection, as of one the most fundamental and challenging problems in
computer vision, has received great attention in recent years. Its development
in the past two decades can be regarded as an epitome of computer vision
history. If we think of today's object detection as a technical aesthetics
under the power of deep learning, then turning back the clock 20 years we would
witness the wisdom of cold weapon era. This paper extensively reviews 400+
papers of object detection in the light of its technical evolution, spanning
over a quarter-century's time (from the 1990s to 2019). A number of topics have
been covered in this paper, including the milestone detectors in history,
detection datasets, metrics, fundamental building blocks of the detection
system, speed up techniques, and the recent state of the art detection methods.
This paper also reviews some important detection applications, such as
pedestrian detection, face detection, text detection, etc, and makes an in-deep
analysis of their challenges as well as technical improvements in recent years.Comment: This work has been submitted to the IEEE TPAMI for possible
publicatio
Rethinking the Faster R-CNN Architecture for Temporal Action Localization
We propose TAL-Net, an improved approach to temporal action localization in
video that is inspired by the Faster R-CNN object detection framework. TAL-Net
addresses three key shortcomings of existing approaches: (1) we improve
receptive field alignment using a multi-scale architecture that can accommodate
extreme variation in action durations; (2) we better exploit the temporal
context of actions for both proposal generation and action classification by
appropriately extending receptive fields; and (3) we explicitly consider
multi-stream feature fusion and demonstrate that fusing motion late is
important. We achieve state-of-the-art performance for both action proposal and
localization on THUMOS'14 detection benchmark and competitive performance on
ActivityNet challenge.Comment: Accepted in CVPR 201
Revisiting the Sibling Head in Object Detector
The ``shared head for classification and localization'' (sibling head),
firstly denominated in Fast RCNN~\cite{girshick2015fast}, has been leading the
fashion of the object detection community in the past five years. This paper
provides the observation that the spatial misalignment between the two object
functions in the sibling head can considerably hurt the training process, but
this misalignment can be resolved by a very simple operator called task-aware
spatial disentanglement (TSD). Considering the classification and regression,
TSD decouples them from the spatial dimension by generating two disentangled
proposals for them, which are estimated by the shared proposal. This is
inspired by the natural insight that for one instance, the features in some
salient area may have rich information for classification while these around
the boundary may be good at bounding box regression. Surprisingly, this simple
design can boost all backbones and models on both MS COCO and Google OpenImage
consistently by ~3% mAP. Further, we propose a progressive constraint to
enlarge the performance margin between the disentangled and the shared
proposals, and gain ~1% more mAP. We show the \algname{} breaks through the
upper bound of nowadays single-model detector by a large margin (mAP 49.4 with
ResNet-101, 51.2 with SENet154), and is the core model of our 1st place
solution on the Google OpenImage Challenge 2019.Comment: Accept to CVPR 2020 & Method of Champion of OpenImage Challenge 2019,
detection trac
Object Detection from Scratch with Deep Supervision
We propose Deeply Supervised Object Detectors (DSOD), an object detection
framework that can be trained from scratch. Recent advances in object detection
heavily depend on the off-the-shelf models pre-trained on large-scale
classification datasets like ImageNet and OpenImage. However, one problem is
that adopting pre-trained models from classification to detection task may
incur learning bias due to the different objective function and diverse
distributions of object categories. Techniques like fine-tuning on detection
task could alleviate this issue to some extent but are still not fundamental.
Furthermore, transferring these pre-trained models across discrepant domains
will be more difficult (e.g., from RGB to depth images). Thus, a better
solution to handle these critical problems is to train object detectors from
scratch, which motivates our proposed method. Previous efforts on this
direction mainly failed by reasons of the limited training data and naive
backbone network structures for object detection. In DSOD, we contribute a set
of design principles for learning object detectors from scratch. One of the key
principles is the deep supervision, enabled by layer-wise dense connections in
both backbone networks and prediction layers, plays a critical role in learning
good detectors from scratch. After involving several other principles, we build
our DSOD based on the single-shot detection framework (SSD). We evaluate our
method on PASCAL VOC 2007, 2012 and COCO datasets. DSOD achieves consistently
better results than the state-of-the-art methods with much more compact models.
Specifically, DSOD outperforms baseline method SSD on all three benchmarks,
while requiring only 1/2 parameters. We also observe that DSOD can achieve
comparable/slightly better results than Mask RCNN + FPN (under similar input
size) with only 1/3 parameters, using no extra data or pre-trained models.Comment: More results and analysis in this version. This is an extension of
our previous conference paper: arXiv:1708.0124
- …