943 research outputs found
Video ChatCaptioner: Towards the Enriched Spatiotemporal Descriptions
Video captioning aims to convey dynamic scenes from videos using natural
language, facilitating the understanding of spatiotemporal information within
our environment. Although there have been recent advances, generating detailed
and enriched video descriptions continues to be a substantial challenge. In
this work, we introduce Video ChatCaptioner, an innovative approach for
creating more comprehensive spatiotemporal video descriptions. Our method
employs a ChatGPT model as a controller, specifically designed to select frames
for posing video content-driven questions. Subsequently, a robust algorithm is
utilized to answer these visual queries. This question-answer framework
effectively uncovers intricate video details and shows promise as a method for
enhancing video content. Following multiple conversational rounds, ChatGPT can
summarize enriched video content based on previous conversations. We
qualitatively demonstrate that our Video ChatCaptioner can generate captions
containing more visual details about the videos. The code is publicly available
at https://github.com/Vision-CAIR/ChatCaptione
JourneyDB: A Benchmark for Generative Image Understanding
While recent advancements in vision-language models have had a transformative
impact on multi-modal comprehension, the extent to which these models possess
the ability to comprehend generated images remains uncertain. Synthetic images,
in comparison to real data, encompass a higher level of diversity in terms of
both content and style, thereby presenting significant challenges for the
models to fully grasp. In light of this challenge, we introduce a comprehensive
dataset, referred to as JourneyDB, that caters to the domain of generative
images within the context of multi-modal visual understanding. Our meticulously
curated dataset comprises 4 million distinct and high-quality generated images,
each paired with the corresponding text prompts that were employed in their
creation. Furthermore, we additionally introduce an external subset with
results of another 22 text-to-image generative models, which makes JourneyDB a
comprehensive benchmark for evaluating the comprehension of generated images.
On our dataset, we have devised four benchmarks to assess the performance of
generated image comprehension in relation to both content and style
interpretation. These benchmarks encompass prompt inversion, style retrieval,
image captioning, and visual question answering. Lastly, we evaluate the
performance of state-of-the-art multi-modal models when applied to the
JourneyDB dataset, providing a comprehensive analysis of their strengths and
limitations in comprehending generated content. We anticipate that the proposed
dataset and benchmarks will facilitate further research in the field of
generative content understanding. The dataset is publicly available at
https://journeydb.github.io.Comment: Accepted to the Thirty-seventh Conference on Neural Information
Processing Systems (NeurIPS 2023
Simple Baselines for Interactive Video Retrieval with Questions and Answers
To date, the majority of video retrieval systems have been optimized for a
"single-shot" scenario in which the user submits a query in isolation, ignoring
previous interactions with the system. Recently, there has been renewed
interest in interactive systems to enhance retrieval, but existing approaches
are complex and deliver limited gains in performance. In this work, we revisit
this topic and propose several simple yet effective baselines for interactive
video retrieval via question-answering. We employ a VideoQA model to simulate
user interactions and show that this enables the productive study of the
interactive retrieval task without access to ground truth dialogue data.
Experiments on MSR-VTT, MSVD, and AVSD show that our framework using
question-based interaction significantly improves the performance of text-based
video retrieval systems.Comment: ICCV 2023, project page:
https://github.com/kevinliang888/IVR-QA-baseline
TaleCrafter: Interactive Story Visualization with Multiple Characters
Accurate Story visualization requires several necessary elements, such as
identity consistency across frames, the alignment between plain text and visual
content, and a reasonable layout of objects in images. Most previous works
endeavor to meet these requirements by fitting a text-to-image (T2I) model on a
set of videos in the same style and with the same characters, e.g., the
FlintstonesSV dataset. However, the learned T2I models typically struggle to
adapt to new characters, scenes, and styles, and often lack the flexibility to
revise the layout of the synthesized images. This paper proposes a system for
generic interactive story visualization, capable of handling multiple novel
characters and supporting the editing of layout and local structure. It is
developed by leveraging the prior knowledge of large language and T2I models,
trained on massive corpora. The system comprises four interconnected
components: story-to-prompt generation (S2P), text-to-layout generation (T2L),
controllable text-to-image generation (C-T2I), and image-to-video animation
(I2V). First, the S2P module converts concise story information into detailed
prompts required for subsequent stages. Next, T2L generates diverse and
reasonable layouts based on the prompts, offering users the ability to adjust
and refine the layout to their preference. The core component, C-T2I, enables
the creation of images guided by layouts, sketches, and actor-specific
identifiers to maintain consistency and detail across visualizations. Finally,
I2V enriches the visualization process by animating the generated images.
Extensive experiments and a user study are conducted to validate the
effectiveness and flexibility of interactive editing of the proposed system.Comment: Github repository: https://github.com/VideoCrafter/TaleCrafte
Advancing Medical Imaging with Language Models: A Journey from N-grams to ChatGPT
In this paper, we aimed to provide a review and tutorial for researchers in
the field of medical imaging using language models to improve their tasks at
hand. We began by providing an overview of the history and concepts of language
models, with a special focus on large language models. We then reviewed the
current literature on how language models are being used to improve medical
imaging, emphasizing different applications such as image captioning, report
generation, report classification, finding extraction, visual question
answering, interpretable diagnosis, and more for various modalities and organs.
The ChatGPT was specially highlighted for researchers to explore more potential
applications. We covered the potential benefits of accurate and efficient
language models for medical imaging analysis, including improving clinical
workflow efficiency, reducing diagnostic errors, and assisting healthcare
professionals in providing timely and accurate diagnoses. Overall, our goal was
to bridge the gap between language models and medical imaging and inspire new
ideas and innovations in this exciting area of research. We hope that this
review paper will serve as a useful resource for researchers in this field and
encourage further exploration of the possibilities of language models in
medical imaging
AutoAD: Movie Description in Context
The objective of this paper is an automatic Audio Description (AD) model that
ingests movies and outputs AD in text form. Generating high-quality movie AD is
challenging due to the dependency of the descriptions on context, and the
limited amount of training data available. In this work, we leverage the power
of pretrained foundation models, such as GPT and CLIP, and only train a mapping
network that bridges the two models for visually-conditioned text generation.
In order to obtain high-quality AD, we make the following four contributions:
(i) we incorporate context from the movie clip, AD from previous clips, as well
as the subtitles; (ii) we address the lack of training data by pretraining on
large-scale datasets, where visual or contextual information is unavailable,
e.g. text-only AD without movies or visual captioning datasets without context;
(iii) we improve on the currently available AD datasets, by removing label
noise in the MAD dataset, and adding character naming information; and (iv) we
obtain strong results on the movie AD task compared with previous methods.Comment: CVPR2023 Highlight. Project page:
https://www.robots.ox.ac.uk/~vgg/research/autoad
HiLM-D: Towards High-Resolution Understanding in Multimodal Large Language Models for Autonomous Driving
Autonomous driving systems generally employ separate models for different
tasks resulting in intricate designs. For the first time, we leverage singular
multimodal large language models (MLLMs) to consolidate multiple autonomous
driving tasks from videos, i.e., the Risk Object Localization and Intention and
Suggestion Prediction (ROLISP) task. ROLISP uses natural language to
simultaneously identify and interpret risk objects, understand ego-vehicle
intentions, and provide motion suggestions, eliminating the necessity for
task-specific architectures. However, lacking high-resolution (HR) information,
existing MLLMs often miss small objects (e.g., traffic cones) and overly focus
on salient ones (e.g., large trucks) when applied to ROLISP. We propose HiLM-D
(Towards High-Resolution Understanding in MLLMs for Autonomous Driving), an
efficient method to incorporate HR information into MLLMs for the ROLISP task.
Especially, HiLM-D integrates two branches: (i) the low-resolution reasoning
branch, can be any MLLMs, processes low-resolution videos to caption risk
objects and discern ego-vehicle intentions/suggestions; (ii) the
high-resolution perception branch (HR-PB), prominent to HiLM-D,, ingests HR
images to enhance detection by capturing vision-specific HR feature maps and
prioritizing all potential risks over merely salient objects. Our HR-PB serves
as a plug-and-play module, seamlessly fitting into current MLLMs. Experiments
on the ROLISP benchmark reveal HiLM-D's notable advantage over leading MLLMs,
with improvements of 4.8% in BLEU-4 for captioning and 17.2% in mIoU for
detection
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Large foundation models can exhibit unique capabilities depending on the
domain of data they are trained on. While these domains are generic, they may
only barely overlap. For example, visual-language models (VLMs) are trained on
Internet-scale image captions, but large language models (LMs) are further
trained on Internet-scale text with no images (e.g. from spreadsheets, to SAT
questions). As a result, these models store different forms of commonsense
knowledge across different domains. In this work, we show that this model
diversity is symbiotic, and can be leveraged to build AI systems with
structured Socratic dialogue -- in which new multimodal tasks are formulated as
a guided language-based exchange between different pre-existing foundation
models, without additional finetuning. In the context of egocentric perception,
we present a case study of Socratic Models (SMs) that can provide meaningful
results for complex tasks such as generating free-form answers to contextual
questions about egocentric video, by formulating video Q&A as short story Q&A,
i.e. summarizing the video into a short story, then answering questions about
it. Additionally, SMs can generate captions for Internet images, and are
competitive with state-of-the-art on zero-shot video-to-text retrieval with
42.8 R@1 on MSR-VTT 1k-A. SMs demonstrate how to compose foundation models
zero-shot to capture new multimodal functionalities, without domain-specific
data collection. Prototypes are available at socraticmodels.github.io.Comment: https://socraticmodels.github.io
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
This paper introduces InternVid, a large-scale video-centric multimodal
dataset that enables learning powerful and transferable video-text
representations for multimodal understanding and generation. The InternVid
dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M
video clips accompanied by detailed descriptions of total 4.1B words. Our core
contribution is to develop a scalable approach to autonomously build a
high-quality video-text dataset with large language models (LLM), thereby
showcasing its efficacy in learning video-language representation at scale.
Specifically, we utilize a multi-scale approach to generate video-related
descriptions. Furthermore, we introduce ViCLIP, a video-text representation
learning model based on ViT-L. Learned on InternVid via contrastive learning,
this model demonstrates leading zero-shot action recognition and competitive
video retrieval performance. Beyond basic video understanding tasks like
recognition and retrieval, our dataset and model have broad applications. They
are particularly beneficial for generating interleaved video-text data for
learning a video-centric dialogue system, advancing video-to-text and
text-to-video generation research. These proposed resources provide a tool for
researchers and practitioners interested in multimodal video understanding and
generation.Comment: Data and Code:
https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVi
- …