15 research outputs found
Panoramic Vision Transformer for Saliency Detection in 360{\deg} Videos
360 video saliency detection is one of the challenging benchmarks for
360 video understanding since non-negligible distortion and
discontinuity occur in the projection of any format of 360 videos, and
capture-worthy viewpoint in the omnidirectional sphere is ambiguous by nature.
We present a new framework named Panoramic Vision Transformer (PAVER). We
design the encoder using Vision Transformer with deformable convolution, which
enables us not only to plug pretrained models from normal videos into our
architecture without additional modules or finetuning but also to perform
geometric approximation only once, unlike previous deep CNN-based approaches.
Thanks to its powerful encoder, PAVER can learn the saliency from three simple
relative relations among local patch features, outperforming state-of-the-art
models for the Wild360 benchmark by large margins without supervision or
auxiliary information like class activation. We demonstrate the utility of our
saliency prediction model with the omnidirectional video quality assessment
task in VQA-ODV, where we consistently improve performance without any form of
supervision, including head movement.Comment: Published to ECCV202
Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal Distillation
Sound can convey significant information for spatial reasoning in our daily
lives. To endow deep networks with such ability, we address the challenge of
dense indoor prediction with sound in both 2D and 3D via cross-modal knowledge
distillation. In this work, we propose a Spatial Alignment via Matching (SAM)
distillation framework that elicits local correspondence between the two
modalities in vision-to-audio knowledge transfer. SAM integrates audio features
with visually coherent learnable spatial embeddings to resolve inconsistencies
in multiple layers of a student model. Our approach does not rely on a specific
input representation, allowing for flexibility in the input shapes or
dimensions without performance degradation. With a newly curated benchmark
named Dense Auditory Prediction of Surroundings (DAPS), we are the first to
tackle dense indoor prediction of omnidirectional surroundings in both 2D and
3D with audio observations. Specifically, for audio-based depth estimation,
semantic segmentation, and challenging 3D scene reconstruction, the proposed
distillation framework consistently achieves state-of-the-art performance
across various metrics and backbone architectures.Comment: Published to ICCV202
A Mobile Robot Generating Video Summaries of Seniors' Indoor Activities
We develop a system which generates summaries from seniors' indoor-activity
videos captured by a social robot to help remote family members know their
seniors' daily activities at home. Unlike the traditional video summarization
datasets, indoor videos captured from a moving robot poses additional
challenges, namely, (i) the video sequences are very long (ii) a significant
number of video-frames contain no-subject or with subjects at ill-posed
locations and scales (iii) most of the well-posed frames contain highly
redundant information. To address this problem, we propose to \hl{exploit} pose
estimation \hl{for detecting} people in frames\hl{. This guides the robot} to
follow the user and capture effective videos. We use person identification to
distinguish a target senior from other people. We \hl{also make use of} action
recognition to analyze seniors' major activities at different moments, and
develop a video summarization method to select diverse and representative
keyframes as summaries.Comment: accepted by MobileHCI'1
Pano-AVQA: Grounded Audio-Visual Question Answering on 360◦ Videos
© 2021 IEEE360◦ videos convey holistic views for the surroundings of a scene. It provides audio-visual cues beyond predetermined normal field of views and displays distinctive spatial relations on a sphere. However, previous benchmark tasks for panoramic videos are still limited to evaluate the semantic understanding of audio-visual relationships or spherical spatial property in surroundings. We propose a novel benchmark named Pano-AVQA as a large-scale grounded audio-visual question answering dataset on panoramic videos. Using 5.4K 360◦ video clips harvested online, we collect two types of novel question-answer pairs with bounding-box grounding: spherical spatial relation QAs and audio-visual relation QAs. We train several transformer-based models from Pano-AVQA, where the results suggest that our proposed spherical spatial embeddings and multimodal training objectives fairly contribute to a better semantic understanding of the panoramic surroundings on the dataset.N
Transitional adaptation of pretrained models for visual storytelling
© 2021 IEEEPrevious models for vision-to-language generation tasks usually pretrain a visual encoder and a language generator in the respective domains and jointly finetune them with the target task. However, this direct transfer practice may suffer from the discord between visual specificity and language fluency since they are often separately trained from large corpora of visual and text data with no common ground. In this work, we claim that a transitional adaptation task is required between pretraining and finetuning to harmonize the visual encoder and the language model for challenging downstream target tasks like visual storytelling. We propose a novel approach named Transitional Adaptation of Pretrained Model (TAPM) that adapts the multi-modal modules to each other with a simpler alignment task between visual inputs only with no need for text labels. Through extensive experiments, we show that the adaptation step significantly improves the performance of multiple language models for sequential video and image captioning tasks. We achieve new state-of-the-art performance on both language metrics and human evaluation in the multi-sentence description task of LSMDC 2019 [50] and the image storytelling task of VIST [18]. Our experiments reveal that this improvement in caption quality does not depend on the specific choice of language models.N
Trapped Gravitational Waves in Jackiw–Teitelboim Gravity
We discuss the possibility that gravitational fluctuations (“gravitational-waves”) are trapped in space by gravitational interactions in two dimensional Jackiw–Teitelboim gravity. In the standard geon (gravitational electromagnetic entity) approach, the effective energy is entirely deposited in a thin layer, the active region, that achieves spatial self-confinement and raises doubts about the geon’s stability. In this paper we relinquish the “active region” approach and obtain self-confinement of “gravitational waves” that are trapped by the vacuum geometry and can be stable against the backreaction due to metric fluctuations
AWMC: Abnormal-Weather Monitoring and Curation Service Based on Dynamic Graph Embedding
This paper presents a system, namely, the abnormal-weather monitoring and curation service (AWMC), which provides people with a better understanding of abnormal weather conditions. The service can analyze a set of multivariate weather datasets (i.e., 7 meteorological datasets from 18 cities in Korea) and show (i) which dates are mostly abnormal in a certain city, and (ii) which cities are mostly abnormal on a certain date. In particular, the dynamic graph-embedding-based anomaly detection method was employed to measure anomaly scores. We implemented the service and conducted evaluations. Regarding the results of monitoring abnormal weather, AWMC shows that the average precision was approximately 90.9%, recall was 93.2%, and F1 score was 92.1% for all the cities