6 research outputs found
Hierarchical Video Understanding
We introduce a hierarchical architecture for video understanding that
exploits the structure of real world actions by capturing targets at different
levels of granularity. We design the model such that it first learns simpler
coarse-grained tasks, and then moves on to learn more fine-grained targets. The
model is trained with a joint loss on different granularity levels. We
demonstrate empirical results on the recent release of Something-Something
dataset, which provides a hierarchy of targets, namely coarse-grained action
groups, fine-grained action categories, and captions. Experiments suggest that
models that exploit targets at different levels of granularity achieve better
performance on all levels
On the effectiveness of task granularity for transfer learning
We describe a DNN for video classification and captioning, trained
end-to-end, with shared features, to solve tasks at different levels of
granularity, exploring the link between granularity in a source task and the
quality of learned features for transfer learning. For solving the new task
domain in transfer learning, we freeze the trained encoder and fine-tune a
neural net on the target domain. We train on the Something-Something dataset
with over 220, 000 videos, and multiple levels of target granularity, including
50 action groups, 174 fine-grained action categories and captions.
Classification and captioning with Something-Something are challenging because
of the subtle differences between actions, applied to thousands of different
object classes, and the diversity of captions penned by crowd actors. Our model
performs better than existing classification baselines for SomethingSomething,
with impressive fine-grained results. And it yields a strong baseline on the
new Something-Something captioning task. Experiments reveal that training with
more fine-grained tasks tends to produce better features for transfer learning
Generating Adjacency Matrix for Video-Query based Video Moment Retrieval
In this paper, we continue our work on Video-Query based Video Moment
retrieval task. Based on using graph convolution to extract intra-video and
inter-video frame features, we improve the method by using similarity-metric
based graph convolution, whose weighted adjacency matrix is achieved by
calculating similarity metric between features of any two different timesteps
in the graph. Experiments on ActivityNet v1.2 and Thumos14 dataset shows the
effectiveness of this improvement, and it outperforms the state-of-the-art
methods.Comment: arXiv admin note: substantial text overlap with arXiv:2007.0987
A Joint Sequence Fusion Model for Video Question Answering and Retrieval
We present an approach named JSFusion (Joint Sequence Fusion) that can
measure semantic similarity between any pairs of multimodal sequence data (e.g.
a video clip and a language sentence). Our multimodal matching network consists
of two key components. First, the Joint Semantic Tensor composes a dense
pairwise representation of two sequence data into a 3D tensor. Then, the
Convolutional Hierarchical Decoder computes their similarity score by
discovering hidden hierarchical matches between the two sequence modalities.
Both modules leverage hierarchical attention mechanisms that learn to promote
well-matched representation patterns while prune out misaligned ones in a
bottom-up manner. Although the JSFusion is a universal model to be applicable
to any multimodal sequence data, this work focuses on video-language tasks
including multimodal retrieval and video QA. We evaluate the JSFusion model in
three retrieval and VQA tasks in LSMDC, for which our model achieves the best
performance reported so far. We also perform multiple-choice and movie
retrieval tasks for the MSR-VTT dataset, on which our approach outperforms many
state-of-the-art methods.Comment: To appear in ECCV 2018. 17 page
Mining YouTube - A dataset for learning fine-grained action concepts from webly supervised video data
Action recognition is so far mainly focusing on the problem of classification
of hand selected preclipped actions and reaching impressive results in this
field. But with the performance even ceiling on current datasets, it also
appears that the next steps in the field will have to go beyond this fully
supervised classification. One way to overcome those problems is to move
towards less restricted scenarios. In this context we present a large-scale
real-world dataset designed to evaluate learning techniques for human action
recognition beyond hand-crafted datasets. To this end we put the process of
collecting data on its feet again and start with the annotation of a test set
of 250 cooking videos. The training data is then gathered by searching for the
respective annotated classes within the subtitles of freely available videos.
The uniqueness of the dataset is attributed to the fact that the whole process
of collecting the data and training does not involve any human intervention. To
address the problem of semantic inconsistencies that arise with this kind of
training data, we further propose a semantical hierarchical structure for the
mined classes.Comment: 9 page
Graph Neural Network for Video-Query based Video Moment Retrieval
In this paper, we focus on Video Query based Video Moment Retrieval (VQ-VMR)
task, which uses a query video clip as input to retrieve a semantic relative
video clip in another untrimmed long video. we find that in VQ-VMR datasets,
there exists a phenomenon showing that there does not exist consistent
relationship between feature similarity by frame and feature similarity by
video, which affects the feature fusion among frames. However, existing VQ-VMR
methods do not fully consider it. Taking this phenomenon into account, in this
article, we treat video features as a graph by concatenating the query video
feature and proposal video feature along time dimension, where each timestep is
treated as a node, each row of the feature matrix is treated as feature of each
node. Then, with the power of graph neural networks, we propose a Multi-Graph
Feature Fusion Module to fuse the relation feature of this graph. After
evaluating our method on ActivityNet v1.2 dataset and Thumos14 dataset, we find
that our proposed method outperforms the state of art methods