92 research outputs found

    Video Salient Object Detection via Fully Convolutional Networks

    Get PDF
    This paper proposes a deep learning model to efficiently detect salient regions in videos. It addresses two important issues: 1) deep video saliency model training with the absence of sufficiently large and pixel-wise annotated video data and 2) fast video saliency training and detection. The proposed deep video saliency network consists of two modules, for capturing the spatial and temporal saliency information, respectively. The dynamic saliency model, explicitly incorporating saliency estimates from the static saliency model, directly produces spatiotemporal saliency inference without time-consuming optical flow computation. We further propose a novel data augmentation technique that simulates video training data from existing annotated image data sets, which enables our network to learn diverse saliency information and prevents overfitting with the limited number of training videos. Leveraging our synthetic video data (150K video sequences) and real videos, our deep video saliency model successfully learns both spatial and temporal saliency cues, thus producing accurate spatiotemporal saliency estimate. We advance the state-of-the-art on the densely annotated video segmentation data set (MAE of .06) and the Freiburg-Berkeley Motion Segmentation data set (MAE of .07), and do so with much improved speed (2 fps with all steps)

    Consistent Video Saliency Using Local Gradient Flow Optimization and Global Refinement

    Get PDF
    We present a novel spatiotemporal saliency detection method to estimate salient regions in videos based on the gradient flow field and energy optimization. The proposed gradient flow field incorporates two distinctive features: 1) intra-frame boundary information and 2) inter-frame motion information together for indicating the salient regions. Based on the effective utilization of both intra-frame and inter-frame information in the gradient flow field, our algorithm is robust enough to estimate the object and background in complex scenes with various motion patterns and appearances. Then, we introduce local as well as global contrast saliency measures using the foreground and background information estimated from the gradient flow field. These enhanced contrast saliency cues uniformly highlight an entire object. We further propose a new energy function to encourage the spatiotemporal consistency of the output saliency maps, which is seldom explored in previous video saliency methods. The experimental results show that the proposed algorithm outperforms state-of-the-art video saliency detection methods

    Bird's-Eye-View Scene Graph for Vision-Language Navigation

    Full text link
    Vision-language navigation (VLN), which entails an agent to navigate 3D environments following human instructions, has shown great advances. However, current agents are built upon panoramic observations, which hinders their ability to perceive 3D scene geometry and easily leads to ambiguous selection of panoramic view. To address these limitations, we present a BEV Scene Graph (BSG), which leverages multi-step BEV representations to encode scene layouts and geometric cues of indoor environment under the supervision of 3D detection. During navigation, BSG builds a local BEV representation at each step and maintains a BEV-based global scene map, which stores and organizes all the online collected local BEV representations according to their topological relations. Based on BSG, the agent predicts a local BEV grid-level decision score and a global graph-level decision score, combined with a sub-view selection score on panoramic views, for more accurate action prediction. Our approach significantly outperforms state-of-the-art methods on REVERIE, R2R, and R4R, showing the potential of BEV perception in VLN.Comment: Accepted at ICCV 2023; Project page: https://github.com/DefaultRui/BEV-Scene-Grap
    • …
    corecore