1 research outputs found
Frame-to-Frame Aggregation of Active Regions in Web Videos for Weakly Supervised Semantic Segmentation
When a deep neural network is trained on data with only image-level labeling,
the regions activated in each image tend to identify only a small region of the
target object. We propose a method of using videos automatically harvested from
the web to identify a larger region of the target object by using temporal
information, which is not present in the static image. The temporal variations
in a video allow different regions of the target object to be activated. We
obtain an activated region in each frame of a video, and then aggregate the
regions from successive frames into a single image, using a warping technique
based on optical flow. The resulting localization maps cover more of the target
object, and can then be used as proxy ground-truth to train a segmentation
network. This simple approach outperforms existing methods under the same level
of supervision, and even approaches relying on extra annotations. Based on
VGG-16 and ResNet 101 backbones, our method achieves the mIoU of 65.0 and 67.4,
respectively, on PASCAL VOC 2012 test images, which represents a new
state-of-the-art.Comment: ICCV 201