66 research outputs found
Temporal Sentence Grounding in Videos: A Survey and Future Directions
Temporal sentence grounding in videos (TSGV), \aka natural language video
localization (NLVL) or video moment retrieval (VMR), aims to retrieve a
temporal moment that semantically corresponds to a language query from an
untrimmed video. Connecting computer vision and natural language, TSGV has
drawn significant attention from researchers in both communities. This survey
attempts to provide a summary of fundamental concepts in TSGV and current
research status, as well as future research directions. As the background, we
present a common structure of functional components in TSGV, in a tutorial
style: from feature extraction from raw video and language query, to answer
prediction of the target moment. Then we review the techniques for multimodal
understanding and interaction, which is the key focus of TSGV for effective
alignment between the two modalities. We construct a taxonomy of TSGV
techniques and elaborate the methods in different categories with their
strengths and weaknesses. Lastly, we discuss issues with the current TSGV
research and share our insights about promising research directions.Comment: 29 pages, 32 figures, 9 table
Visual-Semantic Learning
Visual-semantic learning is an attractive and challenging research direction aiming to understand complex semantics of heterogeneous data from two domains, i.e., visual signals (i.e., images and videos) and natural language (i.e., captions and questions). It requires memorizing the rich information in a single modality and a joint comprehension of multiple modalities. Artificial intelligence (AI) systems with human-level intelligence are claimed to learn like humans, such as efficiently leveraging brain memory for better comprehension, rationally incorporating common-sense knowledge into reasoning, quickly gaining in-depth understanding given a few samples, and analyzing relationships among abundant and informative events. However, these intelligence capacities are effortless for humans but challenging for machines. To bridge the discrepancy between human-level intelligence and present-day visual-semantic learning, we start from its basic understanding ability by studying the visual question answering (e.g., Image-QA and Video-QA) tasks from the perspectives of memory augmentation and common-sense knowledge incorporation. Furthermore, we stretch it to a more challenging situation with limited and partially unlabeled training data (i.e., Few-shot Visual-Semantic Learning) to imitate the fast learning ability of humans. Finally, to further enhance visual-semantic performance in natural videos with numerous spatio-temporal dynamics, we investigate exploiting event-correlated information for a comprehensive understanding of cross-modal semantics.
To study the essential visual-semantic understanding ability of the human brain with memory, we first propose a novel Memory Augmented Deep Recurrent Neural Network (i.e., MA-DRNN) model for Video-QA, which features a new method for encoding videos and questions, and memory augmentation using the emerging Differentiable Neural Computer (i.e., DNC). Specifically, we encode semantic (i.e., questions) information before visual (i.e., videos) information, which leads to better visual-semantic representations. Moreover, we leverage Differentiable Neural Computer (with external memory) to store and retrieve valuable information in questions and videos and model the long-term visual-semantic dependency.
In addition to basic understanding, to tackle visual-semantic reasoning that requires external knowledge beyond visible contents (e.g., KB-Image-QA), we propose a novel framework that endows the model with capabilities of answering more general questions and achieves better exploitation of external knowledge through generating Multiple Clues for Reasoning with Memory Neural Networks (i.e., MCR-MemNN). Specifically, a well-defined detector is adopted to predict image-question-related relation phrases, each delivering two complementary clues to retrieve the supporting facts from an external knowledge base (i.e., KB). These facts are encoded into a continuous embedding space using a content-addressable memory. Afterward, mutual interactions between visual-semantic representation and the supporting facts stored in memory are captured to distill the most relevant information in three modalities (i.e., image, question, and KB). Finally, the optimal answer is predicted by choosing the supporting fact with the highest score.
Furthermore, to enable a fast, in-depth understanding given a small number of samples, especially with heterogeneity in the multi-modal scenarios such as image question answering (i.e., Image-QA) and image captioning (i.e., IC), we study the few-shot visual-semantic learning and present the Hierarchical Graph ATtention Network (i.e., HGAT). This two-stage network models the intra- and inter-modal relationships with limited image-text samples. The main contributions of HGAT can be summarized as follows: 1) it sheds light on tackling few-shot multi-modal learning problems, which focuses primarily, but not exclusively, on visual and semantic modalities, through better exploitation of the intra-relationship of each modality and an attention-based co-learning framework between modalities using a hierarchical graph-based architecture; 2) it achieves superior performance on both visual question answering and image captioning in the few-shot setting; 3) it can be easily extended to the semi-supervised setting where image-text samples are partially unlabeled.
Although various attention mechanisms have been utilized to manage contextualized representations by modeling intra- and inter-modal relationships of the two modalities, one limitation of the predominant visual-semantic methods is the lack of reasoning with event correlation, sensing, and analyzing relationships among abundant and informative events contained in the video. To this end, we introduce the dense caption modality as a new auxiliary and distill event-correlated information to infer the correct answer. We propose a novel end-to-end trainable model, Event-Correlated Graph Neural Networks (EC-GNNs), to perform cross-modal reasoning over information from the three modalities (i.e., caption, video, and question). Besides exploiting a new modality, we employ cross-modal reasoning modules to explicitly model inter-modal relationships and aggregate relevant information across different modalities. We propose a question-guided self-adaptive multi-modal fusion module to collect the question-oriented and event-correlated evidence through multi-step reasoning.
To evaluate our proposed models, we conduct extensive experiments on VTW, MSVD-QA, and TGIF-QA datasets for Video-QA task, Toronto COCO-QA, Visual Genome-QA datasets for few-shot Image-QA task, COCO-FITB dataset for few-shot IC task, and FVQA, Visual7W + ConceptNet datasets for KB-Image-QA task. The experimental results justify these models’ effectiveness and superiority over baseline methods
2D Object Detection with Transformers: A Review
Astounding performance of Transformers in natural language processing (NLP)
has delighted researchers to explore their utilization in computer vision
tasks. Like other computer vision tasks, DEtection TRansformer (DETR)
introduces transformers for object detection tasks by considering the detection
as a set prediction problem without needing proposal generation and
post-processing steps. It is a state-of-the-art (SOTA) method for object
detection, particularly in scenarios where the number of objects in an image is
relatively small. Despite the success of DETR, it suffers from slow training
convergence and performance drops for small objects. Therefore, many
improvements are proposed to address these issues, leading to immense
refinement in DETR. Since 2020, transformer-based object detection has
attracted increasing interest and demonstrated impressive performance. Although
numerous surveys have been conducted on transformers in vision in general, a
review regarding advancements made in 2D object detection using transformers is
still missing. This paper gives a detailed review of twenty-one papers about
recent developments in DETR. We begin with the basic modules of Transformers,
such as self-attention, object queries and input features encoding. Then, we
cover the latest advancements in DETR, including backbone modification, query
design and attention refinement. We also compare all detection transformers in
terms of performance and network design. We hope this study will increase the
researcher's interest in solving existing challenges towards applying
transformers in the object detection domain. Researchers can follow newer
improvements in detection transformers on this webpage available at:
https://github.com/mindgarage-shan/trans_object_detection_surve
Multimodal Character Representation for Visual Story Understanding
Stories are one of the main tools that humans use to make sense of the world around them. This ability is conjectured to be uniquely human, and concepts of agency and interaction have been found to develop during childhood. However, state-of-the-art artificial intelligence models still find it very challenging to represent or understand such information about the world. Over the past few years, there has been a lot of research into building systems that can understand the contents of images, videos, and text. Despite several advances made, computers still struggle to understand high-level discourse structures or how visuals and language are organized to tell a coherent story.
Recently, several efforts have been made towards building story understanding benchmarks. As characters are the key component around which the story events unfold, character representations are crucial for deep story understanding such as their names, appearances, and relations to other characters. As a step towards endowing systems with a richer understanding of characters in a given narrative, this thesis develops new techniques that rely on the vision, audio and language channels to address three important challenges: i) speaker recognition and identification, ii) character representation and embedding, and iii) temporal modeling of character relations. We propose a multi-modal unsupervised model for speaker naming in movies, a novel way to represent movie character names in dialogues, and a multi-modal supervised character relation classification model. We also show that our approach improves systems ability to understand narratives, which is measured using several tasks such as their ability to answer questions about stories on several benchmarks.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/153444/1/mazab_1.pd
3DSAM-adapter: Holistic Adaptation of SAM from 2D to 3D for Promptable Medical Image Segmentation
Despite that the segment anything model (SAM) achieved impressive results on
general-purpose semantic segmentation with strong generalization ability on
daily images, its demonstrated performance on medical image segmentation is
less precise and not stable, especially when dealing with tumor segmentation
tasks that involve objects of small sizes, irregular shapes, and low contrast.
Notably, the original SAM architecture is designed for 2D natural images,
therefore would not be able to extract the 3D spatial information from
volumetric medical data effectively. In this paper, we propose a novel
adaptation method for transferring SAM from 2D to 3D for promptable medical
image segmentation. Through a holistically designed scheme for architecture
modification, we transfer the SAM to support volumetric inputs while retaining
the majority of its pre-trained parameters for reuse. The fine-tuning process
is conducted in a parameter-efficient manner, wherein most of the pre-trained
parameters remain frozen, and only a few lightweight spatial adapters are
introduced and tuned. Regardless of the domain gap between natural and medical
data and the disparity in the spatial arrangement between 2D and 3D, the
transformer trained on natural images can effectively capture the spatial
patterns present in volumetric medical images with only lightweight
adaptations. We conduct experiments on four open-source tumor segmentation
datasets, and with a single click prompt, our model can outperform domain
state-of-the-art medical image segmentation models on 3 out of 4 tasks,
specifically by 8.25%, 29.87%, and 10.11% for kidney tumor, pancreas tumor,
colon cancer segmentation, and achieve similar performance for liver tumor
segmentation. We also compare our adaptation method with existing popular
adapters, and observed significant performance improvement on most datasets.Comment: 14 pages, 6 figures, 5 table
Describing UI Screenshots in Natural Language
Funding Information: We acknowledge the computational resources provided by the Aalto Science-IT project. We thank Homayun Afrabandpey, Daniel Buschek, Jussi Jokinen, and Jörg Tiedemann for reviewing an earlier draft of this article. This work has been supported by the Horizon 2020 FET program of the European Union through the ERA-NET Cofund funding (grant CHIST-ERA-20-BCI-001), the European Innovation Council Pathfinder program (SYMBIOTIK project), and the Academy of Finland (grants 291556, 318559, 310947). Publisher Copyright: © 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM.Being able to describe any user interface (UI) screenshot in natural language can promote understanding of the main purpose of the UI, yet currently it cannot be accomplished with state-of-the-art captioning systems. We introduce XUI, a novel method inspired by the global precedence effect to create informative descriptions of UIs, starting with an overview and then providing fine-grained descriptions about the most salient elements. XUI builds upon computational models for topic classification, visual saliency prediction, and natural language generation (NLG). XUI provides descriptions with up to three different granularity levels that, together, describe what is in the interface and what the user can do with it. We found that XUI descriptions are highly readable, are perceived to accurately describe the UI, and score similarly to human-generated UI descriptions. XUI is available as open-source software.Peer reviewe
From Vision-Language Multimodal Learning Towards Embodied Agents
To build machine agents with intelligent capabilities mimicking human perception and cognition, vision and language stand out as two essential modalities and foster computer vision and natural language processing. Advances in such realms stimulate research in vision-language multimodal learning that allows optical and linguistic inputs and outputs. Due to the innate difference between the two modalities and the lack of large-scale fine-grained annotations, multimodal agents tend to inherit unimodal shortcuts. In this thesis, we develop various solutions to intervene unimodal shortcuts for multimodal generation and reasoning. For visual shortcuts, we introduce a linguistic prior and devise a syntax-aware action targeting module for dynamic description to rectify the correlation between subject and object in a sentence. We apply concept hierarchy and propose a visual superordinate abstraction framework for unbiased concept learning to reduce the correlation among different attributes of an object. For linguistic shortcuts, we disentangle the topic and syntax to reduce the repetition in generated paragraph descriptions for a given image. With the ubiquity of large-scale pre-trained models, we leverage self-supervised learning in finetuning process to increase the robustness of multimodal reasoning.
The rapid development in multimodal learning promises embodied agents capable of interacting with physical environments. This thesis studies the typical embodied task vision-and-language navigation in discrete scenarios and proposes an episodic scene memory (ESceme) mechanism to balance generalization and efficiency. We figure out one desirable instantiation of the mechanism, namely candidate enhancing, and validate its superiority in various settings. Without extra time and computational cost before inference, ESceme improves performance in unseen environments by a large margin. We hope our findings can inspire more practical explorations on episodic memory in embodied AI
A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future
As the most fundamental tasks of computer vision, object detection and
segmentation have made tremendous progress in the deep learning era. Due to the
expensive manual labeling, the annotated categories in existing datasets are
often small-scale and pre-defined, i.e., state-of-the-art detectors and
segmentors fail to generalize beyond the closed-vocabulary. To resolve this
limitation, the last few years have witnessed increasing attention toward
Open-Vocabulary Detection (OVD) and Segmentation (OVS). In this survey, we
provide a comprehensive review on the past and recent development of OVD and
OVS. To this end, we develop a taxonomy according to the type of task and
methodology. We find that the permission and usage of weak supervision signals
can well discriminate different methodologies, including: visual-semantic space
mapping, novel visual feature synthesis, region-aware training,
pseudo-labeling, knowledge distillation-based, and transfer learning-based. The
proposed taxonomy is universal across different tasks, covering object
detection, semantic/instance/panoptic segmentation, 3D scene and video
understanding. In each category, its main principles, key challenges,
development routes, strengths, and weaknesses are thoroughly discussed. In
addition, we benchmark each task along with the vital components of each
method. Finally, several promising directions are provided to stimulate future
research
- …