1,623 research outputs found
Latent Embeddings for Collective Activity Recognition
Rather than simply recognizing the action of a person individually,
collective activity recognition aims to find out what a group of people is
acting in a collective scene. Previ- ous state-of-the-art methods using
hand-crafted potentials in conventional graphical model which can only define a
limited range of relations. Thus, the complex structural de- pendencies among
individuals involved in a collective sce- nario cannot be fully modeled. In
this paper, we overcome these limitations by embedding latent variables into
feature space and learning the feature mapping functions in a deep learning
framework. The embeddings of latent variables build a global relation
containing person-group interac- tions and richer contextual information by
jointly modeling broader range of individuals. Besides, we assemble atten- tion
mechanism during embedding for achieving more com- pact representations. We
evaluate our method on three col- lective activity datasets, where we
contribute a much larger dataset in this work. The proposed model has achieved
clearly better performance as compared to the state-of-the- art methods in our
experiments.Comment: 6pages, accepted by IEEE-AVSS201
Multi-Label Zero-Shot Human Action Recognition via Joint Latent Ranking Embedding
Human action recognition refers to automatic recognizing human actions from a
video clip. In reality, there often exist multiple human actions in a video
stream. Such a video stream is often weakly-annotated with a set of relevant
human action labels at a global level rather than assigning each label to a
specific video episode corresponding to a single action, which leads to a
multi-label learning problem. Furthermore, there are many meaningful human
actions in reality but it would be extremely difficult to collect/annotate
video clips regarding all of various human actions, which leads to a zero-shot
learning scenario. To the best of our knowledge, there is no work that has
addressed all the above issues together in human action recognition. In this
paper, we formulate a real-world human action recognition task as a multi-label
zero-shot learning problem and propose a framework to tackle this problem in a
holistic way. Our framework holistically tackles the issue of unknown temporal
boundaries between different actions for multi-label learning and exploits the
side information regarding the semantic relationship between different human
actions for knowledge transfer. Consequently, our framework leads to a joint
latent ranking embedding for multi-label zero-shot human action recognition. A
novel neural architecture of two component models and an alternate learning
algorithm are proposed to carry out the joint latent ranking embedding
learning. Thus, multi-label zero-shot recognition is done by measuring
relatedness scores of action labels to a test video clip in the joint latent
visual and semantic embedding spaces. We evaluate our framework with different
settings, including a novel data split scheme designed especially for
evaluating multi-label zero-shot learning, on two datasets: Breakfast and
Charades. The experimental results demonstrate the effectiveness of our
framework.Comment: 27 pages, 10 figures and 7 tables. Technical report submitted to a
journal. More experimental results/references were added and typos were
correcte
Multimodal Human Group Behavior Analysis
Human behaviors in a group setting involve a complex mixture of multiple modalities: audio, visual, linguistic, and human interactions. With the rapid progress of AI, automatic prediction and understanding of these behaviors is no longer a dream. In a negotiation, discovering human relationships and identifying the dominant person can be useful for decision making. In security settings, detecting nervous behaviors can help law enforcement agents spot suspicious people. In adversarial settings such as national elections and court defense, identifying persuasive speakers is a critical task. It is beneficial to build accurate machine learning (ML) models to predict such human group behaviors. There are two elements for successful prediction of group behaviors. The first is to design domain-specific features for each modality. Social and Psychological studies have uncovered various factors including both individual cues and group interactions, which inspire us to extract relevant features computationally. In particular, the group interaction modality plays an important role, since human behaviors influence each other through interactions in a group. Second, effective multimodal ML models are needed to align and integrate the different modalities for accurate predictions. However, most previous work ignored the group interaction modality. Moreover, they only adopt early fusion or late fusion to combine different modalities, which is not optimal. This thesis presents methods to train models taking multimodal inputs in group interaction videos, and to predict human group behaviors. First, we develop an ML algorithm to automatically predict human interactions from videos, which is the basis to extract interaction features and model group behaviors. Second, we propose a multimodal method to identify dominant people in videos from multiple modalities. Third, we study the nervousness in human behavior by a developing hybrid method: group interaction feature engineering combined with individual facial embedding learning. Last, we introduce a multimodal fusion framework that enables us to predict how persuasive speakers are.
Overall, we develop one algorithm to extract group interactions and build three multimodal models to identify three kinds of human behavior in videos: dominance, nervousness and persuasion. The experiments demonstrate the efficacy of the methods and analyze the modality-wise contributions
- …