4,733 research outputs found
DAP3D-Net: Where, What and How Actions Occur in Videos?
Action parsing in videos with complex scenes is an interesting but
challenging task in computer vision. In this paper, we propose a generic 3D
convolutional neural network in a multi-task learning manner for effective Deep
Action Parsing (DAP3D-Net) in videos. Particularly, in the training phase,
action localization, classification and attributes learning can be jointly
optimized on our appearancemotion data via DAP3D-Net. For an upcoming test
video, we can describe each individual action in the video simultaneously as:
Where the action occurs, What the action is and How the action is performed. To
well demonstrate the effectiveness of the proposed DAP3D-Net, we also
contribute a new Numerous-category Aligned Synthetic Action dataset, i.e.,
NASA, which consists of 200; 000 action clips of more than 300 categories and
with 33 pre-defined action attributes in two hierarchical levels (i.e.,
low-level attributes of basic body part movements and high-level attributes
related to action motion). We learn DAP3D-Net using the NASA dataset and then
evaluate it on our collected Human Action Understanding (HAU) dataset.
Experimental results show that our approach can accurately localize, categorize
and describe multiple actions in realistic videos
Social Scene Understanding: End-to-End Multi-Person Action Localization and Collective Activity Recognition
We present a unified framework for understanding human social behaviors in
raw image sequences. Our model jointly detects multiple individuals, infers
their social actions, and estimates the collective actions with a single
feed-forward pass through a neural network. We propose a single architecture
that does not rely on external detection algorithms but rather is trained
end-to-end to generate dense proposal maps that are refined via a novel
inference scheme. The temporal consistency is handled via a person-level
matching Recurrent Neural Network. The complete model takes as input a sequence
of frames and outputs detections along with the estimates of individual actions
and collective activities. We demonstrate state-of-the-art performance of our
algorithm on multiple publicly available benchmarks
- …