26 research outputs found
Part Affinity Field based Activity Recognition
This report presents work and results on Activity Recognition using Part Affinity Fields for real-time surveillance applications. Starting with a short introduction to the motivation, this report gives a detailed overview over the key idea of the pursued approach and explains the basic ideas. In addition a variety of experiments on various subjects are presented, like i) the impact of the number of input frames, ii) the impact of different simple dimensionality reduction approaches, and iii) a comparison on how multi-class and binary problem formulation influence the performance
Improved Actor Relation Graph based Group Activity Recognition
Video understanding is to recognize and classify different actions or
activities appearing in the video. A lot of previous work, such as video
captioning, has shown promising performance in producing general video
understanding. However, it is still challenging to generate a fine-grained
description of human actions and their interactions using state-of-the-art
video captioning techniques. The detailed description of human actions and
group activities is essential information, which can be used in real-time CCTV
video surveillance, health care, sports video analysis, etc. This study
proposes a video understanding method that mainly focused on group activity
recognition by learning the pair-wise actor appearance similarity and actor
positions. We propose to use Normalized cross-correlation (NCC) and the sum of
absolute differences (SAD) to calculate the pair-wise appearance similarity and
build the actor relationship graph to allow the graph convolution network to
learn how to classify group activities. We also propose to use MobileNet as the
backbone to extract features from each video frame. A visualization model is
further introduced to visualize each input video frame with predicted bounding
boxes on each human object and predict individual action and collective
activity
Self-supervised Keypoint Correspondences for Multi-Person Pose Estimation and Tracking in Videos
Video annotation is expensive and time consuming. Consequently, datasets for
multi-person pose estimation and tracking are less diverse and have more sparse
annotations compared to large scale image datasets for human pose estimation.
This makes it challenging to learn deep learning based models for associating
keypoints across frames that are robust to nuisance factors such as motion blur
and occlusions for the task of multi-person pose tracking. To address this
issue, we propose an approach that relies on keypoint correspondences for
associating persons in videos. Instead of training the network for estimating
keypoint correspondences on video data, it is trained on a large scale image
datasets for human pose estimation using self-supervision. Combined with a
top-down framework for human pose estimation, we use keypoints correspondences
to (i) recover missed pose detections (ii) associate pose detections across
video frames. Our approach achieves state-of-the-art results for multi-frame
pose estimation and multi-person pose tracking on the PosTrack and
PoseTrack data sets.Comment: Submitted to ECCV 202