1,472 research outputs found
Convolutional Hierarchical Attention Network for Query-Focused Video Summarization
Previous approaches for video summarization mainly concentrate on finding the
most diverse and representative visual contents as video summary without
considering the user's preference. This paper addresses the task of
query-focused video summarization, which takes user's query and a long video as
inputs and aims to generate a query-focused video summary. In this paper, we
consider the task as a problem of computing similarity between video shots and
query. To this end, we propose a method, named Convolutional Hierarchical
Attention Network (CHAN), which consists of two parts: feature encoding network
and query-relevance computing module. In the encoding network, we employ a
convolutional network with local self-attention mechanism and query-aware
global attention mechanism to learns visual information of each shot. The
encoded features will be sent to query-relevance computing module to generate
queryfocused video summary. Extensive experiments on the benchmark dataset
demonstrate the competitive performance and show the effectiveness of our
approach.Comment: Accepted by AAAI 2020 Conferenc
Query-controllable Video Summarization
When video collections become huge, how to explore both within and across
videos efficiently is challenging. Video summarization is one of the ways to
tackle this issue. Traditional summarization approaches limit the effectiveness
of video exploration because they only generate one fixed video summary for a
given input video independent of the information need of the user. In this
work, we introduce a method which takes a text-based query as input and
generates a video summary corresponding to it. We do so by modeling video
summarization as a supervised learning problem and propose an end-to-end deep
learning based method for query-controllable video summarization to generate a
query-dependent video summary. Our proposed method consists of a video summary
controller, video summary generator, and video summary output module. To foster
the research of query-controllable video summarization and conduct our
experiments, we introduce a dataset that contains frame-based relevance score
labels. Based on our experimental result, it shows that the text-based query
helps control the video summary. It also shows the text-based query improves
our model performance. Our code and dataset:
https://github.com/Jhhuangkay/Query-controllable-Video-Summarization.Comment: This paper is accepted by ACM International Conference on Multimedia
Retrieval (ICMR), 202
Towards Interaction-level Video Action Understanding
A huge amount of videos have been created, spread, and viewed daily. Among these massive videos, the actions and activities of humans account for a large part. We desire machines to understand human actions in videos as this is essential to various applications, including but not limited to autonomous driving cars, security systems, human-robot interactions and healthcare. Towards real intelligent system that is able to interact with humans, video understanding must go beyond simply answering ``what is the action in the video", but be more aware of what those actions mean to humans and be more in line with human thinking, which we call interactive-level action understanding. This thesis identifies three main challenges to approaching interactive-level video action understanding: 1) understanding actions given human consensus; 2) understanding actions based on specific human rules; 3) directly understanding actions in videos via human natural language. For the first challenge, we select video summary as a representative task that aims to select informative frames to retain high-level information based on human annotators' experience. Through self-attention architecture and meta-learning, which jointly process dual representations of visual and sequential information for video summarization, the proposed model is capable of understanding video from human consensus (e.g., how humans think which parts of an action sequence are essential). For the second challenge, our works on action quality assessment utilize transformer decoders to parse the input action into several sub-actions and assess the more fine-grained qualities of the given action, yielding the capability of action understanding given specific human rules. (e.g., how well a diving action performs, how well a robot performs surgery) The third key idea explored in this thesis is to use graph neural networks in an adversarial fashion to understand actions through natural language. We demonstrate the utility of this technique for the video captioning task, which takes an action video as input, outputs natural language, and yields state-of-the-art performance. It can be concluded that the research directions and methods introduced in this thesis provide fundamental components toward interactive-level action understanding
Large-scale interactive exploratory visual search
Large scale visual search has been one of the challenging issues in the era of big data. It demands techniques that are not only highly effective and efficient but also allow users conveniently express their information needs and refine their intents. In this thesis, we focus on developing an exploratory framework for large scale visual search. We also develop a number of enabling techniques in this thesis, including compact visual content representation for scalable search, near duplicate video shot detection, and action based event detection. We propose a novel scheme for extremely low bit rate visual search, which sends compressed visual words consisting of vocabulary tree histogram and descriptor orientations rather than descriptors. Compact representation of video data is achieved through identifying keyframes of a video which can also help users comprehend visual content efficiently. We propose a novel Bag-of-Importance model for static video summarization. Near duplicate detection is one of the key issues for large scale visual search, since there exist a large number nearly identical images and videos. We propose an improved near-duplicate video shot detection approach for more effective shot representation. Event detection has been one of the solutions for bridging the semantic gap in visual search. We particular focus on human action centred event detection. We propose an enhanced sparse coding scheme to model human actions. Our proposed approach is able to significantly reduce computational cost while achieving recognition accuracy highly comparable to the state-of-the-art methods. At last, we propose an integrated solution for addressing the prime challenges raised from large-scale interactive visual search. The proposed system is also one of the first attempts for exploratory visual search. It provides users more robust results to satisfy their exploring experiences
- …