2,725 research outputs found
Deep Interactive Region Segmentation and Captioning
With recent innovations in dense image captioning, it is now possible to
describe every object of the scene with a caption while objects are determined
by bounding boxes. However, interpretation of such an output is not trivial due
to the existence of many overlapping bounding boxes. Furthermore, in current
captioning frameworks, the user is not able to involve personal preferences to
exclude out of interest areas. In this paper, we propose a novel hybrid deep
learning architecture for interactive region segmentation and captioning where
the user is able to specify an arbitrary region of the image that should be
processed. To this end, a dedicated Fully Convolutional Network (FCN) named
Lyncean FCN (LFCN) is trained using our special training data to isolate the
User Intention Region (UIR) as the output of an efficient segmentation. In
parallel, a dense image captioning model is utilized to provide a wide variety
of captions for that region. Then, the UIR will be explained with the caption
of the best match bounding box. To the best of our knowledge, this is the first
work that provides such a comprehensive output. Our experiments show the
superiority of the proposed approach over state-of-the-art interactive
segmentation methods on several well-known datasets. In addition, replacement
of the bounding boxes with the result of the interactive segmentation leads to
a better understanding of the dense image captioning output as well as accuracy
enhancement for the object detection in terms of Intersection over Union (IoU).Comment: 17, pages, 9 figure
Deep Extreme Cut: From Extreme Points to Object Segmentation
This paper explores the use of extreme points in an object (left-most,
right-most, top, bottom pixels) as input to obtain precise object segmentation
for images and videos. We do so by adding an extra channel to the image in the
input of a convolutional neural network (CNN), which contains a Gaussian
centered in each of the extreme points. The CNN learns to transform this
information into a segmentation of an object that matches those extreme points.
We demonstrate the usefulness of this approach for guided segmentation
(grabcut-style), interactive segmentation, video object segmentation, and dense
segmentation annotation. We show that we obtain the most precise results to
date, also with less user input, in an extensive and varied selection of
benchmarks and datasets. All our models and code are publicly available on
http://www.vision.ee.ethz.ch/~cvlsegmentation/dextr/.Comment: CVPR 2018 camera ready. Project webpage and code:
http://www.vision.ee.ethz.ch/~cvlsegmentation/dextr
Lucid Data Dreaming for Video Object Segmentation
Convolutional networks reach top quality in pixel-level video object
segmentation but require a large amount of training data (1k~100k) to deliver
such results. We propose a new training strategy which achieves
state-of-the-art results across three evaluation datasets while using 20x~1000x
less annotated data than competing methods. Our approach is suitable for both
single and multiple object segmentation. Instead of using large training sets
hoping to generalize across domains, we generate in-domain training data using
the provided annotation on the first frame of each video to synthesize ("lucid
dream") plausible future video frames. In-domain per-video training data allows
us to train high quality appearance- and motion-based models, as well as tune
the post-processing stage. This approach allows to reach competitive results
even when training from only a single annotated frame, without ImageNet
pre-training. Our results indicate that using a larger training set is not
automatically better, and that for the video object segmentation task a smaller
training set that is closer to the target domain is more effective. This
changes the mindset regarding how many training samples and general
"objectness" knowledge are required for the video object segmentation task.Comment: Accepted in International Journal of Computer Vision (IJCV
- …