3,980 research outputs found
Hierarchical Recurrent Neural Network for Video Summarization
Exploiting the temporal dependency among video frames or subshots is very
important for the task of video summarization. Practically, RNN is good at
temporal dependency modeling, and has achieved overwhelming performance in many
video-based tasks, such as video captioning and classification. However, RNN is
not capable enough to handle the video summarization task, since traditional
RNNs, including LSTM, can only deal with short videos, while the videos in the
summarization task are usually in longer duration. To address this problem, we
propose a hierarchical recurrent neural network for video summarization, called
H-RNN in this paper. Specifically, it has two layers, where the first layer is
utilized to encode short video subshots cut from the original video, and the
final hidden state of each subshot is input to the second layer for calculating
its confidence to be a key subshot. Compared to traditional RNNs, H-RNN is more
suitable to video summarization, since it can exploit long temporal dependency
among frames, meanwhile, the computation operations are significantly lessened.
The results on two popular datasets, including the Combined dataset and VTW
dataset, have demonstrated that the proposed H-RNN outperforms the
state-of-the-arts.Comment: published by ACM Conference on MultiMedi
Towards Abstraction from Extraction: Multiple Timescale Gated Recurrent Unit for Summarization
In this work, we introduce temporal hierarchies to the sequence to sequence
(seq2seq) model to tackle the problem of abstractive summarization of
scientific articles. The proposed Multiple Timescale model of the Gated
Recurrent Unit (MTGRU) is implemented in the encoder-decoder setting to better
deal with the presence of multiple compositionalities in larger texts. The
proposed model is compared to the conventional RNN encoder-decoder, and the
results demonstrate that our model trains faster and shows significant
performance gains. The results also show that the temporal hierarchies help
improve the ability of seq2seq models to capture compositionalities better
without the presence of highly complex architectural hierarchies.Comment: To appear in RepL4NLP at ACL 201
Video Summarization via Actionness Ranking
To automatically produce a brief yet expressive summary of a long video, an
automatic algorithm should start by resembling the human process of summary
generation. Prior work proposed supervised and unsupervised algorithms to train
models for learning the underlying behavior of humans by increasing modeling
complexity or craft-designing better heuristics to simulate human summary
generation process. In this work, we take a different approach by analyzing a
major cue that humans exploit for the summary generation; the nature and
intensity of actions.
We empirically observed that a frame is more likely to be included in
human-generated summaries if it contains a substantial amount of deliberate
motion performed by an agent, which is referred to as actionness. Therefore, we
hypothesize that learning to automatically generate summaries involves an
implicit knowledge of actionness estimation and ranking. We validate our
hypothesis by running a user study that explores the correlation between
human-generated summaries and actionness ranks. We also run a consensus and
behavioral analysis between human subjects to ensure reliable and consistent
results. The analysis exhibits a considerable degree of agreement among
subjects within obtained data and verifying our initial hypothesis.
Based on the study findings, we develop a method to incorporate actionness
data to explicitly regulate a learning algorithm that is trained for summary
generation. We assess the performance of our approach to four summarization
benchmark datasets and demonstrate an evident advantage compared to
state-of-the-art summarization methods
Latent Network Summarization: Bridging Network Embedding and Summarization
Motivated by the computational and storage challenges that dense embeddings
pose, we introduce the problem of latent network summarization that aims to
learn a compact, latent representation of the graph structure with
dimensionality that is independent of the input graph size (i.e., #nodes and
#edges), while retaining the ability to derive node representations on the fly.
We propose Multi-LENS, an inductive multi-level latent network summarization
approach that leverages a set of relational operators and relational functions
(compositions of operators) to capture the structure of egonets and
higher-order subgraphs, respectively. The structure is stored in low-rank,
size-independent structural feature matrices, which along with the relational
functions comprise our latent network summary. Multi-LENS is general and
naturally supports both homogeneous and heterogeneous graphs with or without
directionality, weights, attributes or labels. Extensive experiments on real
graphs show 3.5 - 34.3% improvement in AUC for link prediction, while requiring
80 - 2152x less output storage space than baseline embedding methods on large
datasets. As application areas, we show the effectiveness of Multi-LENS in
detecting anomalies and events in the Enron email communication graph and
Twitter co-mention graph
CNN-Based Prediction of Frame-Level Shot Importance for Video Summarization
In the Internet, ubiquitous presence of redundant, unedited, raw videos has
made video summarization an important problem. Traditional methods of video
summarization employ a heuristic set of hand-crafted features, which in many
cases fail to capture subtle abstraction of a scene. This paper presents a deep
learning method that maps the context of a video to the importance of a scene
similar to that is perceived by humans. In particular, a convolutional neural
network (CNN)-based architecture is proposed to mimic the frame-level shot
importance for user-oriented video summarization. The weights and biases of the
CNN are trained extensively through off-line processing, so that it can provide
the importance of a frame of an unseen video almost instantaneously.
Experiments on estimating the shot importance is carried out using the publicly
available database TVSum50. It is shown that the performance of the proposed
network is substantially better than that of commonly referred feature-based
methods for estimating the shot importance in terms of mean absolute error,
absolute error variance, and relative F-measure.Comment: Accepted in International Conference on new Trends in Computer
Sciences (ICTCS), Amman-Jordan, 201
Video Summarisation by Classification with Deep Reinforcement Learning
Most existing video summarisation methods are based on either supervised or
unsupervised learning. In this paper, we propose a reinforcement learning-based
weakly supervised method that exploits easy-to-obtain, video-level category
labels and encourages summaries to contain category-related information and
maintain category recognisability. Specifically, We formulate video
summarisation as a sequential decision-making process and train a summarisation
network with deep Q-learning (DQSN). A companion classification network is also
trained to provide rewards for training the DQSN. With the classification
network, we develop a global recognisability reward based on the classification
result. Critically, a novel dense ranking-based reward is also proposed in
order to cope with the temporally delayed and sparse reward problems for long
sequence reinforcement learning. Extensive experiments on two benchmark
datasets show that the proposed approach achieves state-of-the-art performance.Comment: In Proc. of BMVC 201
Unsupervised Object-Level Video Summarization with Online Motion Auto-Encoder
Unsupervised video summarization plays an important role on digesting,
browsing, and searching the ever-growing videos every day, and the underlying
fine-grained semantic and motion information (i.e., objects of interest and
their key motions) in online videos has been barely touched. In this paper, we
investigate a pioneer research direction towards the fine-grained unsupervised
object-level video summarization. It can be distinguished from existing
pipelines in two aspects: extracting key motions of participated objects, and
learning to summarize in an unsupervised and online manner. To achieve this
goal, we propose a novel online motion Auto-Encoder (online motion-AE)
framework that functions on the super-segmented object motion clips.
Comprehensive experiments on a newly-collected surveillance dataset and public
datasets have demonstrated the effectiveness of our proposed method
A Blended Deep Learning Approach for Predicting User Intended Actions
User intended actions are widely seen in many areas. Forecasting these
actions and taking proactive measures to optimize business outcome is a crucial
step towards sustaining the steady business growth. In this work, we focus on
pre- dicting attrition, which is one of typical user intended actions.
Conventional attrition predictive modeling strategies suffer a few inherent
drawbacks. To overcome these limitations, we propose a novel end-to-end
learning scheme to keep track of the evolution of attrition patterns for the
predictive modeling. It integrates user activity logs, dynamic and static user
profiles based on multi-path learning. It exploits historical user records by
establishing a decaying multi-snapshot technique. And finally it employs the
precedent user intentions via guiding them to the subsequent learning
procedure. As a result, it addresses all disadvantages of conventional methods.
We evaluate our methodology on two public data repositories and one private
user usage dataset provided by Adobe Creative Cloud. The extensive experiments
demonstrate that it can offer the appealing performance in comparison with
several existing approaches as rated by different popular metrics. Furthermore,
we introduce an advanced interpretation and visualization strategy to
effectively characterize the periodicity of user activity logs. It can help to
pinpoint important factors that are critical to user attrition and retention
and thus suggests actionable improvement targets for business practice. Our
work will provide useful insights into the prediction and elucidation of other
user intended actions as well.Comment: 10 pages, International Conference on Data Mining 201
A Survey on Content-Aware Video Analysis for Sports
Sports data analysis is becoming increasingly large-scale, diversified, and
shared, but difficulty persists in rapidly accessing the most crucial
information. Previous surveys have focused on the methodologies of sports video
analysis from the spatiotemporal viewpoint instead of a content-based
viewpoint, and few of these studies have considered semantics. This study
develops a deeper interpretation of content-aware sports video analysis by
examining the insight offered by research into the structure of content under
different scenarios. On the basis of this insight, we provide an overview of
the themes particularly relevant to the research on content-aware systems for
broadcast sports. Specifically, we focus on the video content analysis
techniques applied in sportscasts over the past decade from the perspectives of
fundamentals and general review, a content hierarchical model, and trends and
challenges. Content-aware analysis methods are discussed with respect to
object-, event-, and context-oriented groups. In each group, the gap between
sensation and content excitement must be bridged using proper strategies. In
this regard, a content-aware approach is required to determine user demands.
Finally, the paper summarizes the future trends and challenges for sports video
analysis. We believe that our findings can advance the field of research on
content-aware video analysis for broadcast sports.Comment: Accepted for publication in IEEE Transactions on Circuits and Systems
for Video Technology (TCSVT
What I See Is What You See: Joint Attention Learning for First and Third Person Video Co-analysis
In recent years, more and more videos are captured from the first-person
viewpoint by wearable cameras. Such first-person video provides additional
information besides the traditional third-person video, and thus has a wide
range of applications. However, techniques for analyzing the first-person video
can be fundamentally different from those for the third-person video, and it is
even more difficult to explore the shared information from both viewpoints. In
this paper, we propose a novel method for first- and third-person video
co-analysis. At the core of our method is the notion of "joint attention",
indicating the learnable representation that corresponds to the shared
attention regions in different viewpoints and thus links the two viewpoints. To
this end, we develop a multi-branch deep network with a triplet loss to extract
the joint attention from the first- and third-person videos via self-supervised
learning. We evaluate our method on the public dataset with cross-viewpoint
video matching tasks. Our method outperforms the state-of-the-art both
qualitatively and quantitatively. We also demonstrate how the learned joint
attention can benefit various applications through a set of additional
experiments
- …