1,236 research outputs found
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Vision and text have been fully explored in contemporary video-text
foundational models, while other modalities such as audio and subtitles in
videos have not received sufficient attention. In this paper, we resort to
establish connections between multi-modality video tracks, including Vision,
Audio, and Subtitle, and Text by exploring an automatically generated
large-scale omni-modality video caption dataset called VAST-27M. Specifically,
we first collect 27 million open-domain video clips and separately train a
vision and an audio captioner to generate vision and audio captions. Then, we
employ an off-the-shelf Large Language Model (LLM) to integrate the generated
captions, together with subtitles and instructional prompts into omni-modality
captions. Based on the proposed VAST-27M dataset, we train an omni-modality
video-text foundational model named VAST, which can perceive and process
vision, audio, and subtitle modalities from video, and better support various
tasks including vision-text, audio-text, and multi-modal video-text tasks
(retrieval, captioning and QA). Extensive experiments have been conducted to
demonstrate the effectiveness of our proposed VAST-27M corpus and VAST
foundation model. VAST achieves 22 new state-of-the-art results on various
cross-modality benchmarks. Code, model and dataset will be released at
https://github.com/TXH-mercury/VAST.Comment: 23 pages, 5 figure
VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models
The ability to perceive how objects change over time is a crucial ingredient
in human intelligence. However, current benchmarks cannot faithfully reflect
the temporal understanding abilities of video-language models (VidLMs) due to
the existence of static visual shortcuts. To remedy this issue, we present
VITATECS, a diagnostic VIdeo-Text dAtaset for the evaluation of TEmporal
Concept underStanding. Specifically, we first introduce a fine-grained taxonomy
of temporal concepts in natural language in order to diagnose the capability of
VidLMs to comprehend different temporal aspects. Furthermore, to disentangle
the correlation between static and temporal information, we generate
counterfactual video descriptions that differ from the original one only in the
specified temporal aspect. We employ a semi-automatic data collection framework
using large language models and human-in-the-loop annotation to obtain
high-quality counterfactual descriptions efficiently. Evaluation of
representative video-language understanding models confirms their deficiency in
temporal understanding, revealing the need for greater emphasis on the temporal
elements in video-language research.Comment: 23 pages, 6 figures, 18 tables, data is available at
https://github.com/lscpku/VITATEC
Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications
Multimodality Representation Learning, as a technique of learning to embed
information from different modalities and their correlations, has achieved
remarkable success on a variety of applications, such as Visual Question
Answering (VQA), Natural Language for Visual Reasoning (NLVR), and Vision
Language Retrieval (VLR). Among these applications, cross-modal interaction and
complementary information from different modalities are crucial for advanced
models to perform any multimodal task, e.g., understand, recognize, retrieve,
or generate optimally. Researchers have proposed diverse methods to address
these tasks. The different variants of transformer-based architectures
performed extraordinarily on multiple modalities. This survey presents the
comprehensive literature on the evolution and enhancement of deep learning
multimodal architectures to deal with textual, visual and audio features for
diverse cross-modal and modern multimodal tasks. This study summarizes the (i)
recent task-specific deep learning methodologies, (ii) the pretraining types
and multimodal pretraining objectives, (iii) from state-of-the-art pretrained
multimodal approaches to unifying architectures, and (iv) multimodal task
categories and possible future improvements that can be devised for better
multimodal learning. Moreover, we prepare a dataset section for new researchers
that covers most of the benchmarks for pretraining and finetuning. Finally,
major challenges, gaps, and potential research topics are explored. A
constantly-updated paperlist related to our survey is maintained at
https://github.com/marslanm/multimodality-representation-learning
Making Short-Form Videos Accessible with Hierarchical Video Summaries
Short videos on platforms such as TikTok, Instagram Reels, and YouTube Shorts
(i.e. short-form videos) have become a primary source of information and
entertainment. Many short-form videos are inaccessible to blind and low vision
(BLV) viewers due to their rapid visual changes, on-screen text, and music or
meme-audio overlays. In our formative study, 7 BLV viewers who regularly
watched short-form videos reported frequently skipping such inaccessible
content. We present ShortScribe, a system that provides hierarchical visual
summaries of short-form videos at three levels of detail to support BLV viewers
in selecting and understanding short-form videos. ShortScribe allows BLV users
to navigate between video descriptions based on their level of interest. To
evaluate ShortScribe, we assessed description accuracy and conducted a user
study with 10 BLV participants comparing ShortScribe to a baseline interface.
When using ShortScribe, participants reported higher comprehension and provided
more accurate summaries of video content.Comment: To appear at CHI 202
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Large foundation models can exhibit unique capabilities depending on the
domain of data they are trained on. While these domains are generic, they may
only barely overlap. For example, visual-language models (VLMs) are trained on
Internet-scale image captions, but large language models (LMs) are further
trained on Internet-scale text with no images (e.g. from spreadsheets, to SAT
questions). As a result, these models store different forms of commonsense
knowledge across different domains. In this work, we show that this model
diversity is symbiotic, and can be leveraged to build AI systems with
structured Socratic dialogue -- in which new multimodal tasks are formulated as
a guided language-based exchange between different pre-existing foundation
models, without additional finetuning. In the context of egocentric perception,
we present a case study of Socratic Models (SMs) that can provide meaningful
results for complex tasks such as generating free-form answers to contextual
questions about egocentric video, by formulating video Q&A as short story Q&A,
i.e. summarizing the video into a short story, then answering questions about
it. Additionally, SMs can generate captions for Internet images, and are
competitive with state-of-the-art on zero-shot video-to-text retrieval with
42.8 R@1 on MSR-VTT 1k-A. SMs demonstrate how to compose foundation models
zero-shot to capture new multimodal functionalities, without domain-specific
data collection. Prototypes are available at socraticmodels.github.io.Comment: https://socraticmodels.github.io
Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos
To realize human-robot collaboration, robots need to execute actions for new
tasks according to human instructions given finite prior knowledge. Human
experts can share their knowledge of how to perform a task with a robot through
multi-modal instructions in their demonstrations, showing a sequence of
short-horizon steps to achieve a long-horizon goal. This paper introduces a
method for robot action sequence generation from instruction videos using (1)
an audio-visual Transformer that converts audio-visual features and instruction
speech to a sequence of robot actions called dynamic movement primitives (DMPs)
and (2) style-transfer-based training that employs multi-task learning with
video captioning and weakly-supervised learning with a semantic classifier to
exploit unpaired video-action data. We built a system that accomplishes various
cooking actions, where an arm robot executes a DMP sequence acquired from a
cooking video using the audio-visual Transformer. Experiments with
Epic-Kitchen-100, YouCookII, QuerYD, and in-house instruction video datasets
show that the proposed method improves the quality of DMP sequences by 2.3
times the METEOR score obtained with a baseline video-to-action Transformer.
The model achieved 32% of the task success rate with the task knowledge of the
object.Comment: Accepted to Interspeech202
Large scale datasets for Image and Video Captioning in Italian
The application of Attention-based Deep Neural architectures to the automatic captioning of images and videos is enabling the development of increasingly performing systems. Unfortunately, while image processing is language independent, this does not hold for caption generation. Training such architectures requires the availability of (possibly large-scale) language specific resources, which are not available for many languages, such as Italian.In this paper, we present MSCOCO-it e MSR-VTT-it, two large-scale resources for image and video captioning. They have been derived by applying automatic machine translation to existing resources. Even though this approach is naive and exposed to the gathering of noisy information (depending on the quality of the automatic translator), we experimentally show that robust deep learning is enabled, rather tolerant with respect to such noise. In particular, we improve the state-of-the-art results with respect to image captioning in Italian. Moreover, in the paper we discuss the training of a system that, at the best of our knowledge, is the first video captioning system in Italian
Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data
Scaling up weakly-supervised datasets has shown to be highly effective in the
image-text domain and has contributed to most of the recent state-of-the-art
computer vision and multimodal neural networks. However, existing large-scale
video-text datasets and mining techniques suffer from several limitations, such
as the scarcity of aligned data, the lack of diversity in the data, and the
difficulty of collecting aligned data. Currently popular video-text data mining
approach via automatic speech recognition (ASR) used in HowTo100M provides
low-quality captions that often do not refer to the video content. Other mining
approaches do not provide proper language descriptions (video tags) and are
biased toward short clips (alt text). In this work, we show how recent advances
in image captioning allow us to pre-train high-quality video models without any
parallel video-text data. We pre-train several video captioning models that are
based on an OPT language model and a TimeSformer visual backbone. We fine-tune
these networks on several video captioning datasets. First, we demonstrate that
image captioning pseudolabels work better for pre-training than the existing
HowTo100M ASR captions. Second, we show that pre-training on both images and
videos produces a significantly better network (+4 CIDER on MSR-VTT) than
pre-training on a single modality. Our methods are complementary to the
existing pre-training or data mining approaches and can be used in a variety of
settings. Given the efficacy of the pseudolabeling method, we are planning to
publicly release the generated captions
Developing Accessible Collection and Presentation Methods for Observational Data
The processes of collecting, cleaning, and presenting data are critical in ensuring the proper analysis of data at a later date. An opportunity exists to enhance the data collection and presentation process for those who are not data scientists – such as healthcare professionals and businesspeople interested in using data to help them make decisions. In this work, creating an observational data collection and presentation tool is investigated, with a focus on developing a tool prioritizing user-friendliness and context preservation of the data collected. This aim is achieved via the integration of three approaches to data collection and presentation.In the first approach, the collection of observational data is structured and carried out via a trichotomous, tailored, sub-branching scoring (TTSS) system. The system allows for deep levels of data collection while enabling data to be summarized quickly by a user via collapsing details. The system is evaluated against the stated requirements of usability and extensibility, proving the latter by providing examples of various evaluations created using the TTSS framework.Next, this approach is integrated with automated data collection via mobile device sensors, to facilitate the efficient completion of the assessment. Results are presented from a system used to combine the capture of complex data from the built environment and compare the results of the data collection, including how the system uses quantitative measures specifically. This approach is evaluated against other solutions for obtaining data about the accessibility of a built environment, and several assessments taken in the field are compared to illustrate the system’s flexibility. The extension of the system for automated data capture is also discussed.Finally, the use of accessibility information for data context preservation is integrated. This approach is evaluated via investigation of how accessible media entries improve the quality of search for an archival website. Human-generated accessibility information is compared to computer-generated accessibility information, as well as simple reliance on titles/metadata. This is followed by a discussion of how improved accessibility can benefit the understanding of gathered observational data’s context
Towards Interaction-level Video Action Understanding
A huge amount of videos have been created, spread, and viewed daily. Among these massive videos, the actions and activities of humans account for a large part. We desire machines to understand human actions in videos as this is essential to various applications, including but not limited to autonomous driving cars, security systems, human-robot interactions and healthcare. Towards real intelligent system that is able to interact with humans, video understanding must go beyond simply answering ``what is the action in the video", but be more aware of what those actions mean to humans and be more in line with human thinking, which we call interactive-level action understanding. This thesis identifies three main challenges to approaching interactive-level video action understanding: 1) understanding actions given human consensus; 2) understanding actions based on specific human rules; 3) directly understanding actions in videos via human natural language. For the first challenge, we select video summary as a representative task that aims to select informative frames to retain high-level information based on human annotators' experience. Through self-attention architecture and meta-learning, which jointly process dual representations of visual and sequential information for video summarization, the proposed model is capable of understanding video from human consensus (e.g., how humans think which parts of an action sequence are essential). For the second challenge, our works on action quality assessment utilize transformer decoders to parse the input action into several sub-actions and assess the more fine-grained qualities of the given action, yielding the capability of action understanding given specific human rules. (e.g., how well a diving action performs, how well a robot performs surgery) The third key idea explored in this thesis is to use graph neural networks in an adversarial fashion to understand actions through natural language. We demonstrate the utility of this technique for the video captioning task, which takes an action video as input, outputs natural language, and yields state-of-the-art performance. It can be concluded that the research directions and methods introduced in this thesis provide fundamental components toward interactive-level action understanding
- …