120 research outputs found
Frame-to-Frame Aggregation of Active Regions in Web Videos for Weakly Supervised Semantic Segmentation
When a deep neural network is trained on data with only image-level labeling,
the regions activated in each image tend to identify only a small region of the
target object. We propose a method of using videos automatically harvested from
the web to identify a larger region of the target object by using temporal
information, which is not present in the static image. The temporal variations
in a video allow different regions of the target object to be activated. We
obtain an activated region in each frame of a video, and then aggregate the
regions from successive frames into a single image, using a warping technique
based on optical flow. The resulting localization maps cover more of the target
object, and can then be used as proxy ground-truth to train a segmentation
network. This simple approach outperforms existing methods under the same level
of supervision, and even approaches relying on extra annotations. Based on
VGG-16 and ResNet 101 backbones, our method achieves the mIoU of 65.0 and 67.4,
respectively, on PASCAL VOC 2012 test images, which represents a new
state-of-the-art.Comment: ICCV 201
Exploring Bottom-up and Top-down Cues with Attentive Learning for Webly Supervised Object Detection
Fully supervised object detection has achieved great success in recent years.
However, abundant bounding boxes annotations are needed for training a detector
for novel classes. To reduce the human labeling effort, we propose a novel
webly supervised object detection (WebSOD) method for novel classes which only
requires the web images without further annotations. Our proposed method
combines bottom-up and top-down cues for novel class detection. Within our
approach, we introduce a bottom-up mechanism based on the well-trained fully
supervised object detector (i.e. Faster RCNN) as an object region estimator for
web images by recognizing the common objectiveness shared by base and novel
classes. With the estimated regions on the web images, we then utilize the
top-down attention cues as the guidance for region classification. Furthermore,
we propose a residual feature refinement (RFR) block to tackle the domain
mismatch between web domain and the target domain. We demonstrate our proposed
method on PASCAL VOC dataset with three different novel/base splits. Without
any target-domain novel-class images and annotations, our proposed webly
supervised object detection model is able to achieve promising performance for
novel classes. Moreover, we also conduct transfer learning experiments on large
scale ILSVRC 2013 detection dataset and achieve state-of-the-art performance
- …