1,527 research outputs found
Detecting and removing visual distractors for video aesthetic enhancement
Personal videos often contain visual distractors, which are objects that are accidentally captured that can distract viewers from focusing on the main subjects. We propose a method to automatically detect and localize these distractors through learning from a manually labeled dataset. To achieve spatially and temporally coherent detection, we propose extracting features at the Temporal-Superpixel (TSP) level using a traditional SVM-based learning framework. We also experiment with end-to-end learning using Convolutional Neural Networks (CNNs), which achieves slightly higher performance than other methods. The classification result is further refined in a post-processing step based on graph-cut optimization. Experimental results show that our method achieves an accuracy of 81% and a recall of 86%. We demonstrate several ways of removing the detected distractors to improve the video quality, including video hole filling; video frame replacement; and camera path re-planning. The user study results show that our method can significantly improve the aesthetic quality of videos
Breaking the "Object" in Video Object Segmentation
The appearance of an object can be fleeting when it transforms. As eggs are
broken or paper is torn, their color, shape and texture can change
dramatically, preserving virtually nothing of the original except for the
identity itself. Yet, this important phenomenon is largely absent from existing
video object segmentation (VOS) benchmarks. In this work, we close the gap by
collecting a new dataset for Video Object Segmentation under Transformations
(VOST). It consists of more than 700 high-resolution videos, captured in
diverse environments, which are 20 seconds long on average and densely labeled
with instance masks. A careful, multi-step approach is adopted to ensure that
these videos focus on complex object transformations, capturing their full
temporal extent. We then extensively evaluate state-of-the-art VOS methods and
make a number of important discoveries. In particular, we show that existing
methods struggle when applied to this novel task and that their main limitation
lies in over-reliance on static appearance cues. This motivates us to propose a
few modifications for the top-performing baseline that improve its capabilities
by better modeling spatio-temporal information. But more broadly, the hope is
to stimulate discussion on learning more robust video object representations
- …