10 research outputs found
Convolutional Long Short-Term Memory Networks for Recognizing First Person Interactions
We present a novel deep learning approach for addressing the problem of interaction recognition from a first person perspective. The approach uses a pair of convolutional neural networks, whose parameters are shared, for extracting frame level features from successive frames of the video. The frame level features are then aggregated using a convolutional long short-term memory. The final hidden state of the convolutional long short-term memory is used for classification in to the respective categories. In our network the spatio-temporal structure of the input is preserved till the very final processing stage. Experimental results show that our method outperforms the state of the art on most recent first person interactions datasets that involve complex ego-motion. On UTKinect, it competes with methods that use depth image and skeletal joints information along with RGB images, while it surpasses previous methods that use only RGB images by more than 20% in recognition accuracy
Top-down Attention Recurrent VLAD Encoding for Action Recognition in Videos
Most recent approaches for action recognition from video leverage deep
architectures to encode the video clip into a fixed length representation
vector that is then used for classification. For this to be successful, the
network must be capable of suppressing irrelevant scene background and extract
the representation from the most discriminative part of the video. Our
contribution builds on the observation that spatio-temporal patterns
characterizing actions in videos are highly correlated with objects and their
location in the video. We propose Top-down Attention Action VLAD (TA-VLAD), a
deep recurrent architecture with built-in spatial attention that performs
temporally aggregated VLAD encoding for action recognition from videos. We
adopt a top-down approach of attention, by using class specific activation maps
obtained from a deep CNN pre-trained for image classification, to weight
appearance features before encoding them into a fixed-length video descriptor
using Gated Recurrent Units. Our method achieves state of the art recognition
accuracy on HMDB51 and UCF101 benchmarks.Comment: Accepted to the 17th International Conference of the Italian
Association for Artificial Intelligenc
Attention is All We Need: Nailing Down Object-centric Attention for Egocentric Activity Recognition
In this paper we propose an end-to-end trainable deep neural network model
for egocentric activity recognition. Our model is built on the observation that
egocentric activities are highly characterized by the objects and their
locations in the video. Based on this, we develop a spatial attention mechanism
that enables the network to attend to regions containing objects that are
correlated with the activity under consideration. We learn highly specialized
attention maps for each frame using class-specific activations from a CNN
pre-trained for generic image recognition, and use them for spatio-temporal
encoding of the video with a convolutional LSTM. Our model is trained in a
weakly supervised setting using raw video-level activity-class labels.
Nonetheless, on standard egocentric activity benchmarks our model surpasses by
up to +6% points recognition accuracy the currently best performing method that
leverages hand segmentation and object location strong supervision for
training. We visually analyze attention maps generated by the network,
revealing that the network successfully identifies the relevant objects present
in the video frames which may explain the strong recognition performance. We
also discuss an extensive ablation analysis regarding the design choices.Comment: Accepted to BMVC 201
Multi-modal Egocentric Activity Recognition using Audio-Visual Features
Egocentric activity recognition in first-person videos has an increasing
importance with a variety of applications such as lifelogging, summarization,
assisted-living and activity tracking. Existing methods for this task are based
on interpretation of various sensor information using pre-determined weights
for each feature. In this work, we propose a new framework for egocentric
activity recognition problem based on combining audio-visual features with
multi-kernel learning (MKL) and multi-kernel boosting (MKBoost). For that
purpose, firstly grid optical-flow, virtual-inertia feature, log-covariance,
cuboid are extracted from the video. The audio signal is characterized using a
"supervector", obtained based on Gaussian mixture modelling of frame-level
features, followed by a maximum a-posteriori adaptation. Then, the extracted
multi-modal features are adaptively fused by MKL classifiers in which both the
feature and kernel selection/weighing and recognition tasks are performed
together. The proposed framework was evaluated on a number of egocentric
datasets. The results showed that using multi-modal features with MKL
outperforms the existing methods
LSTA: Long Short-Term Attention for Egocentric Action Recognition
Egocentric activity recognition is one of the most challenging tasks in video analysis. It requires a fine-grained discrimination of small objects and their manipulation. While some methods base on strong supervision and attention mechanisms, they are either annotation consuming or do not take spatio-temporal patterns into account. In this paper we propose LSTA as a mechanism to focus on features from spatial relevant parts while attention is being tracked smoothly across the video sequence. We demonstrate the effectiveness of LSTA on egocentric activity recognition with an end-to-end trainable two-stream architecture, achieving state-of-the-art performance on four standard benchmarks
LSTA: Long Short-Term Attention for Egocentric Action Recognition
Egocentric activity recognition is one of the most challenging tasks in video
analysis. It requires a fine-grained discrimination of small objects and their
manipulation. While some methods base on strong supervision and attention
mechanisms, they are either annotation consuming or do not take spatio-temporal
patterns into account. In this paper we propose LSTA as a mechanism to focus on
features from spatial relevant parts while attention is being tracked smoothly
across the video sequence. We demonstrate the effectiveness of LSTA on
egocentric activity recognition with an end-to-end trainable two-stream
architecture, achieving state of the art performance on four standard
benchmarks.Comment: Accepted to CVPR 201
DeepDynamicHand: A Deep Neural Architecture for Labeling Hand Manipulation Strategies in Video Sources Exploiting Temporal Information
Humans are capable of complex manipulation interactions with the environment, relying on the intrinsic adaptability and compliance of their hands. Recently, soft robotic manipulation has attempted to reproduce such an extraordinary behavior, through the design of deformable yet robust end-effectors. To this goal, the investigation of human behavior has become crucial to correctly inform technological developments of robotic hands that can successfully exploit environmental constraint as humans actually do. Among the different tools robotics can leverage on to achieve this objective, deep learning has emerged as a promising approach for the study and then the implementation of neuro-scientific observations on the artificial side. However, current approaches tend to neglect the dynamic nature of hand pose recognition problems, limiting the effectiveness of these techniques in identifying sequences of manipulation primitives underpinning action generation, e.g., during purposeful interaction with the environment. In this work, we propose a vision-based supervised Hand Pose Recognition method which, for the first time, takes into account temporal information to identify meaningful sequences of actions in grasping and manipulation tasks. More specifically, we apply Deep Neural Networks to automatically learn features from hand posture images that consist of frames extracted from grasping and manipulation task videos with objects and external environmental constraints. For training purposes, videos are divided into intervals, each associated to a specific action by a human supervisor. The proposed algorithm combines a Convolutional Neural Network to detect the hand within each video frame and a Recurrent Neural Network to predict the hand action in the current frame, while taking into consideration the history of actions performed in the previous frames. Experimental validation has been performed on two datasets of dynamic hand-centric strategies, where subjects regularly interact with objects and environment. Proposed architecture achieved a very good classification accuracy on both datasets, reaching performance up to 94%, and outperforming state of the art techniques. The outcomes of this study can be successfully applied to robotics, e.g., for planning and control of soft anthropomorphic manipulators
DeepDynamicHand: A Deep Neural Architecture for Labeling Hand Manipulation Strategies in Video Sources Exploiting Temporal Information
Humans are capable of complex manipulation interactions with the environment, relying on the intrinsic adaptability and compliance of their hands. Recently, soft robotic manipulation has attempted to reproduce such an extraordinary behavior, through the design of deformable yet robust end-effectors. To this goal, the investigation of human behavior has become crucial to correctly inform technological developments of robotic hands that can successfully exploit environmental constraint as humans actually do. Among the different tools robotics can leverage on to achieve this objective, deep learning has emerged as a promising approach for the study and then the implementation of neuro-scientific observations on the artificial side. However, current approaches tend to neglect the dynamic nature of hand pose recognition problems, limiting the effectiveness of these techniques in identifying sequences of manipulation primitives underpinning action generation, e.g., during purposeful interaction with the environment. In this work, we propose a vision-based supervised Hand Pose Recognition method which, for the first time, takes into account temporal information to identify meaningful sequences of actions in grasping and manipulation tasks. More specifically, we apply Deep Neural Networks to automatically learn features from hand posture images that consist of frames extracted from grasping and manipulation task videos with objects and external environmental constraints. For training purposes, videos are divided into intervals, each associated to a specific action by a human supervisor. The proposed algorithm combines a Convolutional Neural Network to detect the hand within each video frame and a Recurrent Neural Network to predict the hand action in the current frame, while taking into consideration the history of actions performed in the previous frames. Experimental validation has been performed on two datasets of dynamic hand-centric strategies, where subjects regularly interact with objects and environment. Proposed architecture achieved a very good classification accuracy on both datasets, reaching performance up to 94%, and outperforming state of the art techniques. The outcomes of this study can be successfully applied to robotics, e.g., for planning and control of soft anthropomorphic manipulators