716 research outputs found
Learning Temporal Sentence Grounding From Narrated EgoVideos
The onset of long-form egocentric datasets such as Ego4D and EPIC-Kitchens
presents a new challenge for the task of Temporal Sentence Grounding (TSG).
Compared to traditional benchmarks on which this task is evaluated, these
datasets offer finer-grained sentences to ground in notably longer videos. In
this paper, we develop an approach for learning to ground sentences in these
datasets using only narrations and their corresponding rough narration
timestamps. We propose to artificially merge clips to train for temporal
grounding in a contrastive manner using text-conditioning attention. This Clip
Merging (CliMer) approach is shown to be effective when compared with a high
performing TSG method -- e.g. mean R@1 improves from 3.9 to 5.7 on Ego4D and
from 10.7 to 13.0 on EPIC-Kitchens. Code and data splits available from:
https://github.com/keflanagan/CliMerComment: Accepted in BMVC 202
Video Referring Expression Comprehension via Transformer with Content-conditioned Query
Video Referring Expression Comprehension (REC) aims to localize a target
object in videos based on the queried natural language. Recent improvements in
video REC have been made using Transformer-based methods with learnable
queries. However, we contend that this naive query design is not ideal given
the open-world nature of video REC brought by text supervision. With numerous
potential semantic categories, relying on only a few slow-updated queries is
insufficient to characterize them. Our solution to this problem is to create
dynamic queries that are conditioned on both the input video and language to
model the diverse objects referred to. Specifically, we place a fixed number of
learnable bounding boxes throughout the frame and use corresponding region
features to provide prior information. Also, we noticed that current query
features overlook the importance of cross-modal alignment. To address this, we
align specific phrases in the sentence with semantically relevant visual areas,
annotating them in existing video datasets (VID-Sentence and VidSTG). By
incorporating these two designs, our proposed model (called ConFormer)
outperforms other models on widely benchmarked datasets. For example, in the
testing split of VID-Sentence dataset, ConFormer achieves 8.75% absolute
improvement on [email protected] compared to the previous state-of-the-art model.Comment: Accepted to ACM International Conference on Multimedia Workshop (ACM
MM), 2023. arXiv admin note: substantial text overlap with arXiv:2210.0295
Shatter and Gather: Learning Referring Image Segmentation with Text Supervision
Referring image segmentation, the task of segmenting any arbitrary entities
described in free-form texts, opens up a variety of vision applications.
However, manual labeling of training data for this task is prohibitively
costly, leading to lack of labeled data for training. We address this issue by
a weakly supervised learning approach using text descriptions of training
images as the only source of supervision. To this end, we first present a new
model that discovers semantic entities in input image and then combines such
entities relevant to text query to predict the mask of the referent. We also
present a new loss function that allows the model to be trained without any
further supervision. Our method was evaluated on four public benchmarks for
referring image segmentation, where it clearly outperformed the existing method
for the same task and recent open-vocabulary segmentation models on all the
benchmarks.Comment: Accepted to ICCV 2023, Project page:
https://southflame.github.io/sag
Language-Driven Video Understanding
Video understanding has advanced quite a long way in the past decade, accomplishing tasks including low-level segmentation and tracking that study objects as pixel-level segments or bounding boxes to more high-level activity recognition or classification tasks that classify a video scene to a categorical action label. Despite the progress that has been made, much of this work remains a proxy for an eventual task or application that requires a holistic view of the video, such as objects, actions, attributes, and other semantic components.
In this dissertation, we argue that language could deliver the required holistic representation. It plays a significant role in video understanding by allowing machines to communicate with humans and to understand our requests, as shown in tasks such as text-to-video search engine, voice-guided robot manipulation, to name a few. Our language-driven video understanding focuses on two specific problems: video description and visual grounding. What marks our viewpoint different from prior literature is twofold. First, we propose a bottom-up structured learning scheme by decomposing a long video into individual procedure steps and representing each step with a description. Second, we propose to have both explicit (i.e., supervised) and implicit (i.e., weakly-supervised and self-supervised) grounding between words and visual concepts which enables interpretable modeling of the two spaces.
We start by drawing attention to the shortage of large benchmarks on long video-language and propose the largest-of-its-kind YouCook2 dataset and ActivityNet-Entities dataset in Chap. II and III. The rest of the chapters circle around two main problems: video description and visual grounding. For video description, we first address the problem of decomposing a long video into compact and self-contained event segments in Chap. IV. Given an event segment or short video clip in general, we propose a non-recurrent approach (i.e., Transformer) for video description generation in Chap. V as opposed to prior RNN-based methods and demonstrate superior performance. Moving forward, we notice one potential issue in end-to-end video description generation, i.e., lack of visual grounding ability and model interpretability that would allow humans to directly interact with machine vision models. To address this issue, we transition our focus from end-to-end, video-to-text systems to systems that could explicitly capture the grounding between the two modalities, with a novel grounded video description framework in Chap. VI. So far, all the methods are fully-supervised, i.e., the model training signal comes directly from heavy & expensive human annotations. In the following chapter, we answer the question "Can we perform visual grounding without explicit supervision?" with a weakly-supervised framework where models learn grounding from (weak) description signal. Finally, in Chap. VIII, we conclude the technical work by exploring a self-supervised grounding approach—vision-language pre-training—that implicitly learns visual grounding from web multi-modal data. This mimics how humans obtain their commonsense from the environment through multi-modal interactions.PHDRoboticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/155174/1/luozhou_1.pd
- …