6 research outputs found
Hierarchical Attention Network for Action Segmentation
The temporal segmentation of events is an essential task and a precursor for
the automatic recognition of human actions in the video. Several attempts have
been made to capture frame-level salient aspects through attention but they
lack the capacity to effectively map the temporal relationships in between the
frames as they only capture a limited span of temporal dependencies. To this
end we propose a complete end-to-end supervised learning approach that can
better learn relationships between actions over time, thus improving the
overall segmentation performance. The proposed hierarchical recurrent attention
framework analyses the input video at multiple temporal scales, to form
embeddings at frame level and segment level, and perform fine-grained action
segmentation. This generates a simple, lightweight, yet extremely effective
architecture for segmenting continuous video streams and has multiple
application domains. We evaluate our system on multiple challenging public
benchmark datasets, including MERL Shopping, 50 salads, and Georgia Tech
Egocentric datasets, and achieves state-of-the-art performance. The evaluated
datasets encompass numerous video capture settings which are inclusive of
static overhead camera views and dynamic, ego-centric head-mounted camera
views, demonstrating the direct applicability of the proposed framework in a
variety of settings.Comment: Published in Pattern Recognition Letter
Long-Term Anticipation of Activities with Cycle Consistency
With the success of deep learning methods in analyzing activities in videos,
more attention has recently been focused towards anticipating future
activities. However, most of the work on anticipation either analyzes a
partially observed activity or predicts the next action class. Recently, new
approaches have been proposed to extend the prediction horizon up to several
minutes in the future and that anticipate a sequence of future activities
including their durations. While these works decouple the semantic
interpretation of the observed sequence from the anticipation task, we propose
a framework for anticipating future activities directly from the features of
the observed frames and train it in an end-to-end fashion. Furthermore, we
introduce a cycle consistency loss over time by predicting the past activities
given the predicted future. Our framework achieves state-of-the-art results on
two datasets: the Breakfast dataset and 50Salads.Comment: GCPR 202
VS-TransGRU: A Novel Transformer-GRU-based Framework Enhanced by Visual-Semantic Fusion for Egocentric Action Anticipation
Egocentric action anticipation is a challenging task that aims to make
advanced predictions of future actions from current and historical observations
in the first-person view. Most existing methods focus on improving the model
architecture and loss function based on the visual input and recurrent neural
network to boost the anticipation performance. However, these methods, which
merely consider visual information and rely on a single network architecture,
gradually reach a performance plateau. In order to fully understand what has
been observed and capture the dependencies between current observations and
future actions well enough, we propose a novel visual-semantic fusion enhanced
and Transformer GRU-based action anticipation framework in this paper. Firstly,
high-level semantic information is introduced to improve the performance of
action anticipation for the first time. We propose to use the semantic features
generated based on the class labels or directly from the visual observations to
augment the original visual features. Secondly, an effective visual-semantic
fusion module is proposed to make up for the semantic gap and fully utilize the
complementarity of different modalities. Thirdly, to take advantage of both the
parallel and autoregressive models, we design a Transformer based encoder for
long-term sequential modeling and a GRU-based decoder for flexible iteration
decoding. Extensive experiments on two large-scale first-person view datasets,
i.e., EPIC-Kitchens and EGTEA Gaze+, validate the effectiveness of our proposed
method, which achieves new state-of-the-art performance, outperforming previous
approaches by a large margin.Comment: 12 pages, 7 figure
Modeling Events and Interactions through Temporal Processes -- A Survey
In real-world scenario, many phenomena produce a collection of events that
occur in continuous time. Point Processes provide a natural mathematical
framework for modeling these sequences of events. In this survey, we
investigate probabilistic models for modeling event sequences through temporal
processes. We revise the notion of event modeling and provide the mathematical
foundations that characterize the literature on the topic. We define an
ontology to categorize the existing approaches in terms of three families:
simple, marked, and spatio-temporal point processes. For each family, we
systematically review the existing approaches based based on deep learning.
Finally, we analyze the scenarios where the proposed techniques can be used for
addressing prediction and modeling aspects.Comment: Image replacement
Forecasting Future Action Sequences with Neural Memory Networks
We propose a novel neural memory network based framework for future action sequence forecasting. This is a challenging task where we have to consider short-term, within sequence relationships as well as relationships in between sequences, to understand how sequences of actions evolve over time. To capture these relationships effectively, we introduce neural memory networks to our modelling scheme. We show the significance of using two input streams, the observed frames and the corresponding action labels, which provide different information cues for our prediction task. Furthermore, through the proposed method we effectively map the long-term relationships among individual input sequences through separate memory modules, which enables better fusion of the salient features. Our method outperforms the state-of-the-art approaches by a large margin on two publicly available datasets: Breakfast and 50 Salads