50 research outputs found
Deep Learning for Dense Interpretation of Video: Survey of Various Approach, Challenges, Datasets and Metrics
Video interpretation has garnered considerable attention in computer vision and natural language processing fields due to the rapid expansion of video data and the increasing demand for various applications such as intelligent video search, automated video subtitling, and assistance for visually impaired individuals. However, video interpretation presents greater challenges due to the inclusion of both temporal and spatial information within the video. While deep learning models for images, text, and audio have made significant progress, efforts have recently been focused on developing deep networks for video interpretation. A thorough evaluation of current research is necessary to provide insights for future endeavors, considering the myriad techniques, datasets, features, and evaluation criteria available in the video domain. This study offers a survey of recent advancements in deep learning for dense video interpretation, addressing various datasets and the challenges they present, as well as key features in video interpretation. Additionally, it provides a comprehensive overview of the latest deep learning models in video interpretation, which have been instrumental in activity identification and video description or captioning. The paper compares the performance of several deep learning models in this field based on specific metrics. Finally, the study summarizes future trends and directions in video interpretation
Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning
Although promising results have been achieved in video captioning, existing
models are limited to the fixed inventory of activities in the training corpus,
and do not generalize to open vocabulary scenarios. Here we introduce a novel
task, zero-shot video captioning, that aims at describing out-of-domain videos
of unseen activities. Videos of different activities usually require different
captioning strategies in many aspects, i.e. word selection, semantic
construction, and style expression etc, which poses a great challenge to depict
novel activities without paired training data. But meanwhile, similar
activities share some of those aspects in common. Therefore, We propose a
principled Topic-Aware Mixture of Experts (TAMoE) model for zero-shot video
captioning, which learns to compose different experts based on different topic
embeddings, implicitly transferring the knowledge learned from seen activities
to unseen ones. Besides, we leverage external topic-related text corpus to
construct the topic embedding for each activity, which embodies the most
relevant semantic vectors within the topic. Empirical results not only validate
the effectiveness of our method in utilizing semantic knowledge for video
captioning, but also show its strong generalization ability when describing
novel activities.Comment: Accepted to AAAI 201
Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos
A short clip of video may contain progression of multiple events and an
interesting story line. A human need to capture both the event in every shot
and associate them together to understand the story behind it. In this work, we
present a new multi-shot video understanding benchmark Shot2Story20K with
detailed shot-level captions and comprehensive video summaries. To facilitate
better semantic understanding of videos, we provide captions for both visual
signals and human narrations. We design several distinct tasks including
single-shot video and narration captioning, multi-shot video summarization, and
video retrieval with shot descriptions. Preliminary experiments show some
challenges to generate a long and comprehensive video summary. Nevertheless,
the generated imperfect summaries can already significantly boost the
performance of existing video understanding tasks such as video
question-answering, promoting an under-explored setting of video understanding
with detailed summaries.Comment: See https://mingfei.info/shot2story for updates and more informatio