5 research outputs found
Boosted Multiple Kernel Learning for First-Person Activity Recognition
Activity recognition from first-person (ego-centric) videos has recently
gained attention due to the increasing ubiquity of the wearable cameras. There
has been a surge of efforts adapting existing feature descriptors and designing
new descriptors for the first-person videos. An effective activity recognition
system requires selection and use of complementary features and appropriate
kernels for each feature. In this study, we propose a data-driven framework for
first-person activity recognition which effectively selects and combines
features and their respective kernels during the training. Our experimental
results show that use of Multiple Kernel Learning (MKL) and Boosted MKL in
first-person activity recognition problem exhibits improved results in
comparison to the state-of-the-art. In addition, these techniques enable the
expansion of the framework with new features in an efficient and convenient
way.Comment: First published in the Proceedings of the 25th European Signal
Processing Conference (EUSIPCO-2017) in 2017, published by EURASI
Semi-Supervised First-Person Activity Recognition in Body-Worn Video
Body-worn cameras are now commonly used for logging daily life, sports, and
law enforcement activities, creating a large volume of archived footage. This
paper studies the problem of classifying frames of footage according to the
activity of the camera-wearer with an emphasis on application to real-world
police body-worn video. Real-world datasets pose a different set of challenges
from existing egocentric vision datasets: the amount of footage of different
activities is unbalanced, the data contains personally identifiable
information, and in practice it is difficult to provide substantial training
footage for a supervised approach. We address these challenges by extracting
features based exclusively on motion information then segmenting the video
footage using a semi-supervised classification algorithm. On publicly available
datasets, our method achieves results comparable to, if not better than,
supervised and/or deep learning methods using a fraction of the training data.
It also shows promising results on real-world police body-worn video
Multi-modal Egocentric Activity Recognition using Audio-Visual Features
Egocentric activity recognition in first-person videos has an increasing
importance with a variety of applications such as lifelogging, summarization,
assisted-living and activity tracking. Existing methods for this task are based
on interpretation of various sensor information using pre-determined weights
for each feature. In this work, we propose a new framework for egocentric
activity recognition problem based on combining audio-visual features with
multi-kernel learning (MKL) and multi-kernel boosting (MKBoost). For that
purpose, firstly grid optical-flow, virtual-inertia feature, log-covariance,
cuboid are extracted from the video. The audio signal is characterized using a
"supervector", obtained based on Gaussian mixture modelling of frame-level
features, followed by a maximum a-posteriori adaptation. Then, the extracted
multi-modal features are adaptively fused by MKL classifiers in which both the
feature and kernel selection/weighing and recognition tasks are performed
together. The proposed framework was evaluated on a number of egocentric
datasets. The results showed that using multi-modal features with MKL
outperforms the existing methods
Convolutional Long Short-Term Memory Networks for Recognizing First Person Interactions
We present a novel deep learning approach for addressing the problem of interaction recognition from a first person perspective. The approach uses a pair of convolutional neural networks, whose parameters are shared, for extracting frame level features from successive frames of the video. The frame level features are then aggregated using a convolutional long short-term memory. The final hidden state of the convolutional long short-term memory is used for classification in to the respective categories. In our network the spatio-temporal structure of the input is preserved till the very final processing stage. Experimental results show that our method outperforms the state of the art on most recent first person interactions datasets that involve complex ego-motion. On UTKinect, it competes with methods that use depth image and skeletal joints information along with RGB images, while it surpasses previous methods that use only RGB images by more than 20% in recognition accuracy