4,068 research outputs found
Panoramic Vision Transformer for Saliency Detection in 360{\deg} Videos
360 video saliency detection is one of the challenging benchmarks for
360 video understanding since non-negligible distortion and
discontinuity occur in the projection of any format of 360 videos, and
capture-worthy viewpoint in the omnidirectional sphere is ambiguous by nature.
We present a new framework named Panoramic Vision Transformer (PAVER). We
design the encoder using Vision Transformer with deformable convolution, which
enables us not only to plug pretrained models from normal videos into our
architecture without additional modules or finetuning but also to perform
geometric approximation only once, unlike previous deep CNN-based approaches.
Thanks to its powerful encoder, PAVER can learn the saliency from three simple
relative relations among local patch features, outperforming state-of-the-art
models for the Wild360 benchmark by large margins without supervision or
auxiliary information like class activation. We demonstrate the utility of our
saliency prediction model with the omnidirectional video quality assessment
task in VQA-ODV, where we consistently improve performance without any form of
supervision, including head movement.Comment: Published to ECCV202
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn
Recent research on Large Language Models (LLMs) has led to remarkable
advancements in general NLP AI assistants. Some studies have further explored
the use of LLMs for planning and invoking models or APIs to address more
general multi-modal user queries. Despite this progress, complex visual-based
tasks still remain challenging due to the diverse nature of visual tasks. This
diversity is reflected in two aspects: 1) Reasoning paths. For many real-life
applications, it is hard to accurately decompose a query simply by examining
the query itself. Planning based on the specific visual content and the results
of each step is usually required. 2) Flexible inputs and intermediate results.
Input forms could be flexible for in-the-wild cases, and involves not only a
single image or video but a mixture of videos and images, e.g., a user-view
image with some reference videos. Besides, a complex reasoning process will
also generate diverse multimodal intermediate results, e.g., video narrations,
segmented video clips, etc. To address such general cases, we propose a
multi-modal AI assistant, AssistGPT, with an interleaved code and language
reasoning approach called Plan, Execute, Inspect, and Learn (PEIL) to integrate
LLMs with various tools. Specifically, the Planner is capable of using natural
language to plan which tool in Executor should do next based on the current
reasoning progress. Inspector is an efficient memory manager to assist the
Planner to feed proper visual information into a specific tool. Finally, since
the entire reasoning process is complex and flexible, a Learner is designed to
enable the model to autonomously explore and discover the optimal solution. We
conducted experiments on A-OKVQA and NExT-QA benchmarks, achieving
state-of-the-art results. Moreover, showcases demonstrate the ability of our
system to handle questions far more complex than those found in the benchmarks.Comment: Project page: https://showlab.github.io/assistgpt
Evaluating User Experience in Multisensory Meditative Virtual Reality: A Pilot Study
Virtual Reality (VR) is known for its ability to immerse users in a parallel universe. Accordingly, VR offers great potential for mindfulness therapy, especially in a post pandemic world. However, the extent to which our senses should be recruited to yield an optimal feeling of presence in the Virtual Environment (VE) remains unclear. This study investigates lived and perceived effects of adding auditory and motor components to VR experiences, through narration and head movements respectively. Twelve participants experienced four nature-based VR videos in a within-subjects research design. The study employed a mixed method approach of psychometric and neurophysiological measures. Results support a significant relationship between positive affect and presence. While statistical support was not obtained for the remaining relationships, this study provides a feasibility assessment of utilizing NeuroIS methods in evaluating immersive user experiences, along with qualitative insights that extend our understanding towards optimized VE designs
Video Question Answering: Datasets, Algorithms and Challenges
Video Question Answering (VideoQA) aims to answer natural language questions
according to the given videos. It has earned increasing attention with recent
research trends in joint vision and language understanding. Yet, compared with
ImageQA, VideoQA is largely underexplored and progresses slowly. Although
different algorithms have continually been proposed and shown success on
different VideoQA datasets, we find that there lacks a meaningful survey to
categorize them, which seriously impedes its advancements. This paper thus
provides a clear taxonomy and comprehensive analyses to VideoQA, focusing on
the datasets, algorithms, and unique challenges. We then point out the research
trend of studying beyond factoid QA to inference QA towards the cognition of
video contents, Finally, we conclude some promising directions for future
exploration.Comment: Accepted by EMNLP 202
WinDB: HMD-free and Distortion-free Panoptic Video Fixation Learning
To date, the widely-adopted way to perform fixation collection in panoptic
video is based on a head-mounted display (HMD), where participants' fixations
are collected while wearing an HMD to explore the given panoptic scene freely.
However, this widely-used data collection method is insufficient for training
deep models to accurately predict which regions in a given panoptic are most
important when it contains intermittent salient events. The main reason is that
there always exist "blind zooms" when using HMD to collect fixations since the
participants cannot keep spinning their heads to explore the entire panoptic
scene all the time. Consequently, the collected fixations tend to be trapped in
some local views, leaving the remaining areas to be the "blind zooms".
Therefore, fixation data collected using HMD-based methods that accumulate
local views cannot accurately represent the overall global importance of
complex panoramic scenes. This paper introduces the auxiliary Window with a
Dynamic Blurring (WinDB) fixation collection approach for panoptic video, which
doesn't need HMD and is blind-zoom-free. Thus, the collected fixations can well
reflect the regional-wise importance degree. Using our WinDB approach, we have
released a new PanopticVideo-300 dataset, containing 300 panoptic clips
covering over 225 categories. Besides, we have presented a simple baseline
design to take full advantage of PanopticVideo-300 to handle the
blind-zoom-free attribute-induced fixation shifting problem
UniVTG: Towards Unified Video-Language Temporal Grounding
Video Temporal Grounding (VTG), which aims to ground target clips from videos
(such as consecutive intervals or disjoint shots) according to custom language
queries (e.g., sentences or words), is key for video browsing on social media.
Most methods in this direction develop taskspecific models that are trained
with type-specific labels, such as moment retrieval (time interval) and
highlight detection (worthiness curve), which limits their abilities to
generalize to various VTG tasks and labels. In this paper, we propose to Unify
the diverse VTG labels and tasks, dubbed UniVTG, along three directions:
Firstly, we revisit a wide range of VTG labels and tasks and define a unified
formulation. Based on this, we develop data annotation schemes to create
scalable pseudo supervision. Secondly, we develop an effective and flexible
grounding model capable of addressing each task and making full use of each
label. Lastly, thanks to the unified framework, we are able to unlock temporal
grounding pretraining from large-scale diverse labels and develop stronger
grounding abilities e.g., zero-shot grounding. Extensive experiments on three
tasks (moment retrieval, highlight detection and video summarization) across
seven datasets (QVHighlights, Charades-STA, TACoS, Ego4D, YouTube Highlights,
TVSum, and QFVS) demonstrate the effectiveness and flexibility of our proposed
framework. The codes are available at https://github.com/showlab/UniVTG.Comment: Accepted by ICCV 2023. 16 pages, 10 figures, 13 tables. Code:
https://github.com/showlab/UniVT
- …