64 research outputs found
Long Story Short: a Summarize-then-Search Method for Long Video Question Answering
Large language models such as GPT-3 have demonstrated an impressive
capability to adapt to new tasks without requiring task-specific training data.
This capability has been particularly effective in settings such as narrative
question answering, where the diversity of tasks is immense, but the available
supervision data is small. In this work, we investigate if such language models
can extend their zero-shot reasoning abilities to long multimodal narratives in
multimedia content such as drama, movies, and animation, where the story plays
an essential role. We propose Long Story Short, a framework for narrative video
QA that first summarizes the narrative of the video to a short plot and then
searches parts of the video relevant to the question. We also propose to
enhance visual matching with CLIPCheck. Our model outperforms state-of-the-art
supervised models by a large margin, highlighting the potential of zero-shot QA
for long videos.Comment: Published in BMVC 202
A Deep Ranking Model for Spatio-Temporal Highlight Detection from a 360 Video
We address the problem of highlight detection from a 360 degree video by
summarizing it both spatially and temporally. Given a long 360 degree video, we
spatially select pleasantly-looking normal field-of-view (NFOV) segments from
unlimited field of views (FOV) of the 360 degree video, and temporally
summarize it into a concise and informative highlight as a selected subset of
subshots. We propose a novel deep ranking model named as Composition View Score
(CVS) model, which produces a spherical score map of composition per video
segment, and determines which view is suitable for highlight via a sliding
window kernel at inference. To evaluate the proposed framework, we perform
experiments on the Pano2Vid benchmark dataset and our newly collected 360
degree video highlight dataset from YouTube and Vimeo. Through evaluation using
both quantitative summarization metrics and user studies via Amazon Mechanical
Turk, we demonstrate that our approach outperforms several state-of-the-art
highlight detection methods. We also show that our model is 16 times faster at
inference than AutoCam, which is one of the first summarization algorithms of
360 degree videosComment: In AAAI 2018, 9 page
CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos
Visual information is central to conversation: body gestures and physical
behaviour, for example, contribute to meaning that transcends words alone. To
date, however, most neural conversational models are limited to just text. We
introduce CHAMPAGNE, a generative model of conversations that can account for
visual contexts. To train CHAMPAGNE, we collect and release YTD-18M, a
large-scale corpus of 18M video-based dialogues. YTD-18M is constructed from
web videos: crucial to our data collection pipeline is a pretrained language
model that converts error-prone automatic transcripts to a cleaner dialogue
format while maintaining meaning. Human evaluation reveals that YTD-18M is more
sensible and specific than prior resources (MMDialog, 1M dialogues), while
maintaining visual-groundedness. Experiments demonstrate that 1) CHAMPAGNE
learns to conduct conversation from YTD-18M; and 2) when fine-tuned, it
achieves state-of-the-art results on four vision-language tasks focused on
real-world conversations. We release data, models, and code.Comment: ICCV 2023, Project page: https://seungjuhan.me/champagn
Scalp Diagnostic System With Label-Free Segmentation and Training-Free Image Translation
Scalp diseases and alopecia affect millions of people around the world,
underscoring the urgent need for early diagnosis and management of the disease.
However, the development of a comprehensive AI-based diagnosis system
encompassing these conditions remains an underexplored domain due to the
challenges associated with data imbalance and the costly nature of labeling. To
address these issues, we propose ScalpVision, an AI-driven system for the
holistic diagnosis of scalp diseases and alopecia. In ScalpVision, effective
hair segmentation is achieved using pseudo image-label pairs and an innovative
prompting method in the absence of traditional hair masking labels. This
approach is crucial for extracting key features such as hair thickness and
count, which are then used to assess alopecia severity. Additionally,
ScalpVision introduces DiffuseIT-M, a generative model adept at dataset
augmentation while maintaining hair information, facilitating improved
predictions of scalp disease severity. Our experimental results affirm
ScalpVision's efficiency in diagnosing a variety of scalp conditions and
alopecia, showcasing its potential as a valuable tool in dermatological care.Comment: IEEE Transactions on Medical Imaging (Under Review
Learning Joint Representation of Human Motion and Language
In this work, we present MoLang (a Motion-Language connecting model) for
learning joint representation of human motion and language, leveraging both
unpaired and paired datasets of motion and language modalities. To this end, we
propose a motion-language model with contrastive learning, empowering our model
to learn better generalizable representations of the human motion domain.
Empirical results show that our model learns strong representations of human
motion data through navigating language modality. Our proposed method is able
to perform both action recognition and motion retrieval tasks with a single
model where it outperforms state-of-the-art approaches on a number of action
recognition benchmarks
Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback
Recent advancements in large language models have influenced the development
of video large multimodal models (VLMMs). The previous approaches for VLMMs
involved Supervised Fine-Tuning (SFT) with instruction-tuned datasets,
integrating LLM with visual encoders, and adding additional learnable modules.
Video and text multimodal alignment remains challenging, primarily due to the
deficient volume and quality of multimodal instruction-tune data compared to
text-only data. We present a novel alignment strategy that employs multimodal
AI system to oversee itself called Reinforcement Learning from AI Feedback
(RLAIF), providing self-preference feedback to refine itself and facilitating
the alignment of video and text modalities. In specific, we propose
context-aware reward modeling by providing detailed video descriptions as
context during the generation of preference feedback in order to enrich the
understanding of video content. Demonstrating enhanced performance across
diverse video benchmarks, our multimodal RLAIF approach, VLM-RLAIF, outperforms
existing approaches, including the SFT model. We commit to open-sourcing our
code, models, and datasets to foster further research in this area.Comment: ACL 202
- …