We present a novel form of interactive video object segmentation where a few
clicks by the user helps the system produce a full spatio-temporal segmentation
of the object of interest. Whereas conventional interactive pipelines take the
user's initialization as a starting point, we show the value in the system
taking the lead even in initialization. In particular, for a given video frame,
the system precomputes a ranked list of thousands of possible segmentation
hypotheses (also referred to as object region proposals) using image and motion
cues. Then, the user looks at the top ranked proposals, and clicks on the
object boundary to carve away erroneous ones. This process iterates (typically
2-3 times), and each time the system revises the top ranked proposal set, until
the user is satisfied with a resulting segmentation mask. Finally, the mask is
propagated across the video to produce a spatio-temporal object tube. On three
challenging datasets, we provide extensive comparisons with both existing work
and simpler alternative methods. In all, the proposed Click Carving approach
strikes an excellent balance of accuracy and human effort. It outperforms all
similarly fast methods, and is competitive or better than those requiring 2 to
12 times the effort.Comment: A preliminary version of the material in this document was filed as
University of Texas technical report no. UT AI16-0