6,738 research outputs found
Learning Latent Super-Events to Detect Multiple Activities in Videos
In this paper, we introduce the concept of learning latent super-events from
activity videos, and present how it benefits activity detection in continuous
videos. We define a super-event as a set of multiple events occurring together
in videos with a particular temporal organization; it is the opposite concept
of sub-events. Real-world videos contain multiple activities and are rarely
segmented (e.g., surveillance videos), and learning latent super-events allows
the model to capture how the events are temporally related in videos. We design
temporal structure filters that enable the model to focus on particular
sub-intervals of the videos, and use them together with a soft attention
mechanism to learn representations of latent super-events. Super-event
representations are combined with per-frame or per-segment CNNs to provide
frame-level annotations. Our approach is designed to be fully differentiable,
enabling end-to-end learning of latent super-event representations jointly with
the activity detector using them. Our experiments with multiple public video
datasets confirm that the proposed concept of latent super-event learning
significantly benefits activity detection, advancing the state-of-the-arts.Comment: CVPR 201
Learning Multimodal Latent Attributes
Abstract—The rapid development of social media sharing has created a huge demand for automatic media classification and annotation techniques. Attribute learning has emerged as a promising paradigm for bridging the semantic gap and addressing data sparsity via transferring attribute knowledge in object recognition and relatively simple action classification. In this paper, we address the task of attribute learning for understanding multimedia data with sparse and incomplete labels. In particular we focus on videos of social group activities, which are particularly challenging and topical examples of this task because of their multi-modal content and complex and unstructured nature relative to the density of annotations. To solve this problem, we (1) introduce a concept of semi-latent attribute space, expressing user-defined and latent attributes in a unified framework, and (2) propose a novel scalable probabilistic topic model for learning multi-modal semi-latent attributes, which dramatically reduces requirements for an exhaustive accurate attribute ontology and expensive annotation effort. We show that our framework is able to exploit latent attributes to outperform contemporary approaches for addressing a variety of realistic multimedia sparse data learning tasks including: multi-task learning, learning with label noise, N-shot transfer learning and importantly zero-shot learning
Learning Social Affordance Grammar from Videos: Transferring Human Interactions to Human-Robot Interactions
In this paper, we present a general framework for learning social affordance
grammar as a spatiotemporal AND-OR graph (ST-AOG) from RGB-D videos of human
interactions, and transfer the grammar to humanoids to enable a real-time
motion inference for human-robot interaction (HRI). Based on Gibbs sampling,
our weakly supervised grammar learning can automatically construct a
hierarchical representation of an interaction with long-term joint sub-tasks of
both agents and short term atomic actions of individual agents. Based on a new
RGB-D video dataset with rich instances of human interactions, our experiments
of Baxter simulation, human evaluation, and real Baxter test demonstrate that
the model learned from limited training data successfully generates human-like
behaviors in unseen scenarios and outperforms both baselines.Comment: The 2017 IEEE International Conference on Robotics and Automation
(ICRA
- …