14 research outputs found
Recommended from our members
Learning human activities and poses with interconnected data sources
Understanding human actions and poses in images or videos is a challenging problem in computer vision. There are different topics related to this problem such as action recognition, pose estimation, human-object interaction, and activity detection. Knowledge of actions and poses could benefit many applications, including video search, surveillance, auto-tagging, event detection, and human-computer interfaces. To understand humans' actions and poses, we need to address several challenges. First, humans are able to perform an enormous amount of poses. For example, simply to move forward, we can do crawling, walking, running, and sprinting. These poses all look different and require examples to cover these variations. Second, the appearance of a person's pose changes when looking from different viewing angles. The learned action model needs to cover the variations from different views. Third, many actions involve interactions between people and other objects, so we need to consider the appearance change corresponding to that object as well. Fourth, collecting such data for learning is difficult and expensive. Last, even if we can learn a good model for an action, to localize when and where the action happens in a long video remains a difficult problem due to the large search space. My key idea to alleviate these obstacles in learning humans' actions and poses is to discover the underlying patterns that connect the information from different data sources. Why will there be underlying patterns? The intuition is that all people share the same articulated physical structure. Though we can change our pose, there are common regulations that limit how our pose can be and how it can move over time. Therefore, all types of human data will follow these rules and they can serve as prior knowledge or regularization in our learning framework. If we can exploit these tendencies, we are able to extract additional information from data and use them to improve learning of humans' actions and poses. In particular, we are able to find patterns for how our pose could vary over time, how our appearance looks in a specific view, how our pose is when we are interacting with objects with certain properties, and how part of our body configuration is shared across different poses. If we could learn these patterns, they can be used to interconnect and extrapolate the knowledge between different data sources. To this end, I propose several new ways to connect human activity data. First, I show how to connect snapshot images and videos by exploring the patterns of how our pose could change over time. Building on this idea, I explore how to connect humans' poses across multiple views by discovering the correlations between different poses and the latent factors that affect the viewpoint variations. In addition, I consider if there are also patterns connecting our poses and nearby objects when we are interacting with them. Furthermore, I explore how we can utilize the predicted interaction as a cue to better address existing recognition problems including image re-targeting and image description generation. Finally, after learning models effectively incorporating these patterns, I propose a robust approach to efficiently localize when and where a complex action happens in a video sequence. The variants of my proposed approaches offer a good trade-off between computational cost and detection accuracy. My thesis exploits various types of underlying patterns in human data. The discovered structure is used to enhance the understanding of humans' actions and poses. By my proposed methods, we are able to 1) learn an action with very few snapshots by connecting them to a pool of label-free videos, 2) infer the pose for some views even without any examples by connecting the latent factors between different views, 3) predict the location of an object that a person is interacting with independent of the type and appearance of that object, then use the inferred interaction as a cue to improve recognition, and 4) localize an action in a complex long video. These approaches improve existing frameworks for understanding humans' actions and poses without extra data collection cost and broaden the problems that we can tackle.Computer Science
Turbo Learning Framework for Human-Object Interactions Recognition and Human Pose Estimation
Human-object interactions (HOI) recognition and pose estimation are two
closely related tasks. Human pose is an essential cue for recognizing actions
and localizing the interacted objects. Meanwhile, human action and their
interacted objects' localizations provide guidance for pose estimation. In this
paper, we propose a turbo learning framework to perform HOI recognition and
pose estimation simultaneously. First, two modules are designed to enforce
message passing between the tasks, i.e. pose aware HOI recognition module and
HOI guided pose estimation module. Then, these two modules form a closed loop
to utilize the complementary information iteratively, which can be trained in
an end-to-end manner. The proposed method achieves the state-of-the-art
performance on two public benchmarks including Verbs in COCO (V-COCO) and
HICO-DET datasets.Comment: AAAI201
Forecasting Human-Object Interaction: Joint Prediction of Motor Attention and Actions in First Person Video
We address the challenging task of anticipating human-object interaction in
first person videos. Most existing methods ignore how the camera wearer
interacts with the objects, or simply consider body motion as a separate
modality. In contrast, we observe that the international hand movement reveals
critical information about the future activity. Motivated by this, we adopt
intentional hand movement as a future representation and propose a novel deep
network that jointly models and predicts the egocentric hand motion,
interaction hotspots and future action. Specifically, we consider the future
hand motion as the motor attention, and model this attention using latent
variables in our deep model. The predicted motor attention is further used to
characterise the discriminative spatial-temporal visual features for predicting
actions and interaction hotspots. We present extensive experiments
demonstrating the benefit of the proposed joint model. Importantly, our model
produces new state-of-the-art results for action anticipation on both EGTEA
Gaze+ and the EPIC-Kitchens datasets. Our project page is available at
https://aptx4869lm.github.io/ForecastingHOI
Crowdsourcing in Computer Vision
Computer vision systems require large amounts of manually annotated data to
properly learn challenging visual concepts. Crowdsourcing platforms offer an
inexpensive method to capture human knowledge and understanding, for a vast
number of visual perception tasks. In this survey, we describe the types of
annotations computer vision researchers have collected using crowdsourcing, and
how they have ensured that this data is of high quality while annotation effort
is minimized. We begin by discussing data collection on both classic (e.g.,
object recognition) and recent (e.g., visual story-telling) vision tasks. We
then summarize key design decisions for creating effective data collection
interfaces and workflows, and present strategies for intelligently selecting
the most important data instances to annotate. Finally, we conclude with some
thoughts on the future of crowdsourcing in computer vision.Comment: A 69-page meta review of the field, Foundations and Trends in
Computer Graphics and Vision, 201
4D Human Body Capture from Egocentric Video via 3D Scene Grounding
We introduce a novel task of reconstructing a time series of second-person 3D
human body meshes from monocular egocentric videos. The unique viewpoint and
rapid embodied camera motion of egocentric videos raise additional technical
barriers for human body capture. To address those challenges, we propose a
simple yet effective optimization-based approach that leverages 2D observations
of the entire video sequence and human-scene interaction constraint to estimate
second-person human poses, shapes, and global motion that are grounded on the
3D environment captured from the egocentric view. We conduct detailed ablation
studies to validate our design choice. Moreover, we compare our method with the
previous state-of-the-art method on human motion capture from monocular video,
and show that our method estimates more accurate human-body poses and shapes
under the challenging egocentric setting. In addition, we demonstrate that our
approach produces more realistic human-scene interaction
Role of opinion sharing on the emergency evacuation dynamics
Emergency evacuation is a critical research topic and any improvement to the existing evacuation models will help in improving the safety of the evacuees. Currently, there are evacuation models that have either an accurate movement model or a sophisticated decision model. Individuals in a crowd tend to share and propagate their opinion. This opinion sharing part is either implicitly modeled or entirely overlooked in most of the existing models. Thus, one of the overarching goal of this research is to the study the effect of opinion evolution through an evacuating crowd. First, the opinion evolution in a crowd was modeled mathematically. Next, the results from the analytical model were validated with a simulation model having a simple motion model. To improve the fidelity of the evacuation model, a more realistic movement and decision model were incorporated and the effect of opinion sharing on the evacuation dynamics was studied extensively. Further, individuals with strong inclination towards particular route were introduced and their effect on overall efficiency was studied. Current evacuation guidance algorithms focuses on efficient crowd evacuation. The method of guidance delivery is generally overlooked. This important gap in guidance delivery is addressed next. Additionally, a virtual reality based immersive experiment is designed to study factors affecting individuals\u27 decision making during emergency evacuation
Egocentric Action Understanding by Learning Embodied Attention
Videos captured from wearable cameras, known as egocentric videos, create a continuous record of human daily visual experience, and thereby offer a new perspective for human activity understanding. Importantly, egocentric video aligns gaze, embodied movement, and action in the same “first-person” coordinate system. The rich egocentric cues reflect the attended scene context of an action, and thereby provide novel means for reasoning human daily routines.
In my thesis work, I describe my efforts on developing novel computational models that learn the embodied egocentric attention for the automatic analysis of egocentric actions. First, I introduce a probabilistic model for learning gaze and actions in egocentric video and further demonstrate that attention can serve as a robust tool for learning motion-aware video representation. Second, I develop a novel deep model to address the challenging problem of jointly recognizing and localizing actions of a mobile user on a known 3D map from egocentric videos. Third, I present a novel deep latent variable model that makes use of human intentional body movement (motor attention) as a key representation for forecasting human-object interaction in egocentric video. Finally, I propose a novel task of future hand segmentation from egocentric videos, and show how explicitly modeling the future head motion can facilitate future hand movement forecasting.Ph.D