832 research outputs found

    DAP3D-Net: Where, What and How Actions Occur in Videos?

    Full text link
    Action parsing in videos with complex scenes is an interesting but challenging task in computer vision. In this paper, we propose a generic 3D convolutional neural network in a multi-task learning manner for effective Deep Action Parsing (DAP3D-Net) in videos. Particularly, in the training phase, action localization, classification and attributes learning can be jointly optimized on our appearancemotion data via DAP3D-Net. For an upcoming test video, we can describe each individual action in the video simultaneously as: Where the action occurs, What the action is and How the action is performed. To well demonstrate the effectiveness of the proposed DAP3D-Net, we also contribute a new Numerous-category Aligned Synthetic Action dataset, i.e., NASA, which consists of 200; 000 action clips of more than 300 categories and with 33 pre-defined action attributes in two hierarchical levels (i.e., low-level attributes of basic body part movements and high-level attributes related to action motion). We learn DAP3D-Net using the NASA dataset and then evaluate it on our collected Human Action Understanding (HAU) dataset. Experimental results show that our approach can accurately localize, categorize and describe multiple actions in realistic videos

    Unsupervised Video Understanding by Reconciliation of Posture Similarities

    Full text link
    Understanding human activity and being able to explain it in detail surpasses mere action classification by far in both complexity and value. The challenge is thus to describe an activity on the basis of its most fundamental constituents, the individual postures and their distinctive transitions. Supervised learning of such a fine-grained representation based on elementary poses is very tedious and does not scale. Therefore, we propose a completely unsupervised deep learning procedure based solely on video sequences, which starts from scratch without requiring pre-trained networks, predefined body models, or keypoints. A combinatorial sequence matching algorithm proposes relations between frames from subsets of the training data, while a CNN is reconciling the transitivity conflicts of the different subsets to learn a single concerted pose embedding despite changes in appearance across sequences. Without any manual annotation, the model learns a structured representation of postures and their temporal development. The model not only enables retrieval of similar postures but also temporal super-resolution. Additionally, based on a recurrent formulation, next frames can be synthesized.Comment: Accepted by ICCV 201

    Latent Structured Models for Video Understanding

    Get PDF
    The proliferation of videos in recent years has spurred a surge of interest in developing efficient techniques for automatic video interpretation. The thesis improves the understanding of videos by building structured models that use latent information to detect and recognize instances of actions or abnormalities in videos. The thesis also proposes efficient algorithms for inference in and learning of the proposed latent structured models that are appropriate for learning with weak supervision. An important class of latent variable models is the multiple instance learning where the training labels are provided only for bags of instances, but not for instances themselves. As inference of latent instance labels is performed jointly with training of a classifier on the same data, multiple-instance learning is very susceptible to overfitting. To increase the robustness of popular methods for multiple instance learning, the thesis introduces a novel concept of superbags (ensemble of bags of bags) that allows for decoupling of classifier training and latent label inference steps. In the thesis, a novel latent structured representation is proposed to discover instances of action classes in videos and jointly train an action classifier on them. Action class instances typically occupy only a part of the whole video that is not annotated in weakly labeled training videos. Therefore, multiple instance learning is proposed to find these latent action instances in training videos and jointly train the action classifier. The thesis proposes a sequential method to multiple instance learning to increase the robustness of the training. For the interpretation of crowded scenes, it is important to detect all irregular objects or actions in a video. However, the abnormality detection is hindered by the fact that the training set does not contain any abnormal sample, thus it is necessary to find abnormalities in a test video without actually knowing what they are. To address this problem, the thesis proposes a probabilistic graphical model for video parsing that searches for latent object hypotheses to jointly explain all the foreground pixels, which are, at the same time, well matched to the normal training samples. By inferring all latent normal hypotheses in a video, the model indirectly finds abnormalities as those hypotheses that are not supported by normal samples but still need to be used to explain the foreground. Video parsing is applied sequentially on individual video frames, where hypotheses are jointly inferred by a local search in a graphical model. The thesis then proposes a spatio-temporal extension of the video parsing, where an efficient inference method based on convex optimization is developed to find abnormal/normal spatio-temporal hypotheses in the video
    • …
    corecore