98 research outputs found
Examining CNN Representations with respect to Dataset Bias
Given a pre-trained CNN without any testing samples, this paper proposes a
simple yet effective method to diagnose feature representations of the CNN. We
aim to discover representation flaws caused by potential dataset bias. More
specifically, when the CNN is trained to estimate image attributes, we mine
latent relationships between representations of different attributes inside the
CNN. Then, we compare the mined attribute relationships with ground-truth
attribute relationships to discover the CNN's blind spots and failure modes due
to dataset bias. In fact, representation flaws caused by dataset bias cannot be
examined by conventional evaluation strategies based on testing images, because
testing images may also have a similar bias. Experiments have demonstrated the
effectiveness of our method.Comment: in AAAI 201
Consistent Video Saliency Using Local Gradient Flow Optimization and Global Refinement
We present a novel spatiotemporal saliency detection method to estimate salient regions in videos based on the gradient flow field and energy optimization. The proposed gradient flow field incorporates two distinctive features: 1) intra-frame boundary information and 2) inter-frame motion information together for indicating the salient regions. Based on the effective utilization of both intra-frame and inter-frame information in the gradient flow field, our algorithm is robust enough to estimate the object and background in complex scenes with various motion patterns and appearances. Then, we introduce local as well as global contrast saliency measures using the foreground and background information estimated from the gradient flow field. These enhanced contrast saliency cues uniformly highlight an entire object. We further propose a new energy function to encourage the spatiotemporal consistency of the output saliency maps, which is seldom explored in previous video saliency methods. The experimental results show that the proposed algorithm outperforms state-of-the-art video saliency detection methods
Video Salient Object Detection via Fully Convolutional Networks
This paper proposes a deep learning model to efficiently detect salient regions in videos. It addresses two important issues: 1) deep video saliency model training with the absence of sufficiently large and pixel-wise annotated video data and 2) fast video saliency training and detection. The proposed deep video saliency network consists of two modules, for capturing the spatial and temporal saliency information, respectively. The dynamic saliency model, explicitly incorporating saliency estimates from the static saliency model, directly produces spatiotemporal saliency inference without time-consuming optical flow computation. We further propose a novel data augmentation technique that simulates video training data from existing annotated image data sets, which enables our network to learn diverse saliency information and prevents overfitting with the limited number of training videos. Leveraging our synthetic video data (150K video sequences) and real videos, our deep video saliency model successfully learns both spatial and temporal saliency cues, thus producing accurate spatiotemporal saliency estimate. We advance the state-of-the-art on the densely annotated video segmentation data set (MAE of .06) and the Freiburg-Berkeley Motion Segmentation data set (MAE of .07), and do so with much improved speed (2 fps with all steps)
Bird's-Eye-View Scene Graph for Vision-Language Navigation
Vision-language navigation (VLN), which entails an agent to navigate 3D
environments following human instructions, has shown great advances. However,
current agents are built upon panoramic observations, which hinders their
ability to perceive 3D scene geometry and easily leads to ambiguous selection
of panoramic view. To address these limitations, we present a BEV Scene Graph
(BSG), which leverages multi-step BEV representations to encode scene layouts
and geometric cues of indoor environment under the supervision of 3D detection.
During navigation, BSG builds a local BEV representation at each step and
maintains a BEV-based global scene map, which stores and organizes all the
online collected local BEV representations according to their topological
relations. Based on BSG, the agent predicts a local BEV grid-level decision
score and a global graph-level decision score, combined with a sub-view
selection score on panoramic views, for more accurate action prediction. Our
approach significantly outperforms state-of-the-art methods on REVERIE, R2R,
and R4R, showing the potential of BEV perception in VLN.Comment: Accepted at ICCV 2023; Project page:
https://github.com/DefaultRui/BEV-Scene-Grap
- …