10 research outputs found
Deep Affordance-grounded Sensorimotor Object Recognition
It is well-established by cognitive neuroscience that human perception of
objects constitutes a complex process, where object appearance information is
combined with evidence about the so-called object "affordances", namely the
types of actions that humans typically perform when interacting with them. This
fact has recently motivated the "sensorimotor" approach to the challenging task
of automatic object recognition, where both information sources are fused to
improve robustness. In this work, the aforementioned paradigm is adopted,
surpassing current limitations of sensorimotor object recognition research.
Specifically, the deep learning paradigm is introduced to the problem for the
first time, developing a number of novel neuro-biologically and
neuro-physiologically inspired architectures that utilize state-of-the-art
neural networks for fusing the available information sources in multiple ways.
The proposed methods are evaluated using a large RGB-D corpus, which is
specifically collected for the task of sensorimotor object recognition and is
made publicly available. Experimental results demonstrate the utility of
affordance information to object recognition, achieving an up to 29% relative
error reduction by its inclusion.Comment: 9 pages, 7 figures, dataset link included, accepted to CVPR 201
A Deep Learning Approach to Object Affordance Segmentation
Learning to understand and infer object functionalities is an important step
towards robust visual intelligence. Significant research efforts have recently
focused on segmenting the object parts that enable specific types of
human-object interaction, the so-called "object affordances". However, most
works treat it as a static semantic segmentation problem, focusing solely on
object appearance and relying on strong supervision and object detection. In
this paper, we propose a novel approach that exploits the spatio-temporal
nature of human-object interaction for affordance segmentation. In particular,
we design an autoencoder that is trained using ground-truth labels of only the
last frame of the sequence, and is able to infer pixel-wise affordance labels
in both videos and static images. Our model surpasses the need for object
labels and bounding boxes by using a soft-attention mechanism that enables the
implicit localization of the interaction hotspot. For evaluation purposes, we
introduce the SOR3D-AFF corpus, which consists of human-object interaction
sequences and supports 9 types of affordances in terms of pixel-wise
annotation, covering typical manipulations of tool-like objects. We show that
our model achieves competitive results compared to strongly supervised methods
on SOR3D-AFF, while being able to predict affordances for similar unseen
objects in two affordance image-only datasets.Comment: 5 pages, 4 figures, ICASSP 202
Learning Scene Flow With Skeleton Guidance For 3D Action Recognition
Among the existing modalities for 3D action recognition, 3D flow has been
poorly examined, although conveying rich motion information cues for human
actions. Presumably, its susceptibility to noise renders it intractable, thus
challenging the learning process within deep models. This work demonstrates the
use of 3D flow sequence by a deep spatiotemporal model and further proposes an
incremental two-level spatial attention mechanism, guided from skeleton domain,
for emphasizing motion features close to the body joint areas and according to
their informativeness. Towards this end, an extended deep skeleton model is
also introduced to learn the most discriminant action motion dynamics, so as to
estimate an informativeness score for each joint. Subsequently, a late fusion
scheme is adopted between the two models for learning the high level
cross-modal correlations. Experimental results on the currently largest and
most challenging dataset NTU RGB+D, demonstrate the effectiveness of the
proposed approach, achieving state-of-the-art results.Comment: 18 pages, 3 figures, 3 tables, conferenc
AffordPose: A Large-scale Dataset of Hand-Object Interactions with Affordance-driven Hand Pose
How human interact with objects depends on the functional roles of the target
objects, which introduces the problem of affordance-aware hand-object
interaction. It requires a large number of human demonstrations for the
learning and understanding of plausible and appropriate hand-object
interactions. In this work, we present AffordPose, a large-scale dataset of
hand-object interactions with affordance-driven hand pose. We first annotate
the specific part-level affordance labels for each object, e.g. twist, pull,
handle-grasp, etc, instead of the general intents such as use or handover, to
indicate the purpose and guide the localization of the hand-object
interactions. The fine-grained hand-object interactions reveal the influence of
hand-centered affordances on the detailed arrangement of the hand poses, yet
also exhibit a certain degree of diversity. We collect a total of 26.7K
hand-object interactions, each including the 3D object shape, the part-level
affordance label, and the manually adjusted hand poses. The comprehensive data
analysis shows the common characteristics and diversity of hand-object
interactions per affordance via the parameter statistics and contacting
computation. We also conduct experiments on the tasks of hand-object affordance
understanding and affordance-oriented hand-object interaction generation, to
validate the effectiveness of our dataset in learning the fine-grained
hand-object interactions. Project page:
https://github.com/GentlesJan/AffordPose.Comment: Accepted by ICCV 202
Forecasting Human-Object Interaction: Joint Prediction of Motor Attention and Actions in First Person Video
We address the challenging task of anticipating human-object interaction in
first person videos. Most existing methods ignore how the camera wearer
interacts with the objects, or simply consider body motion as a separate
modality. In contrast, we observe that the international hand movement reveals
critical information about the future activity. Motivated by this, we adopt
intentional hand movement as a future representation and propose a novel deep
network that jointly models and predicts the egocentric hand motion,
interaction hotspots and future action. Specifically, we consider the future
hand motion as the motor attention, and model this attention using latent
variables in our deep model. The predicted motor attention is further used to
characterise the discriminative spatial-temporal visual features for predicting
actions and interaction hotspots. We present extensive experiments
demonstrating the benefit of the proposed joint model. Importantly, our model
produces new state-of-the-art results for action anticipation on both EGTEA
Gaze+ and the EPIC-Kitchens datasets. Our project page is available at
https://aptx4869lm.github.io/ForecastingHOI
Beyond Object Recognition: A New Benchmark towards Object Concept Learning
Understanding objects is a central building block of artificial intelligence,
especially for embodied AI. Even though object recognition excels with deep
learning, current machines still struggle to learn higher-level knowledge,
e.g., what attributes an object has, and what can we do with an object. In this
work, we propose a challenging Object Concept Learning (OCL) task to push the
envelope of object understanding. It requires machines to reason out object
affordances and simultaneously give the reason: what attributes make an object
possesses these affordances. To support OCL, we build a densely annotated
knowledge base including extensive labels for three levels of object concept
(category, attribute, affordance), and the causal relations of three levels. By
analyzing the causal structure of OCL, we present a baseline, Object Concept
Reasoning Network (OCRN). It leverages causal intervention and concept
instantiation to infer the three levels following their causal relations. In
experiments, OCRN effectively infers the object knowledge while following the
causalities well. Our data and code are available at https://mvig-rhos.com/ocl.Comment: ICCV 2023. Webpage: https://mvig-rhos.com/oc
Egocentric Action Understanding by Learning Embodied Attention
Videos captured from wearable cameras, known as egocentric videos, create a continuous record of human daily visual experience, and thereby offer a new perspective for human activity understanding. Importantly, egocentric video aligns gaze, embodied movement, and action in the same “first-person” coordinate system. The rich egocentric cues reflect the attended scene context of an action, and thereby provide novel means for reasoning human daily routines.
In my thesis work, I describe my efforts on developing novel computational models that learn the embodied egocentric attention for the automatic analysis of egocentric actions. First, I introduce a probabilistic model for learning gaze and actions in egocentric video and further demonstrate that attention can serve as a robust tool for learning motion-aware video representation. Second, I develop a novel deep model to address the challenging problem of jointly recognizing and localizing actions of a mobile user on a known 3D map from egocentric videos. Third, I present a novel deep latent variable model that makes use of human intentional body movement (motor attention) as a key representation for forecasting human-object interaction in egocentric video. Finally, I propose a novel task of future hand segmentation from egocentric videos, and show how explicitly modeling the future head motion can facilitate future hand movement forecasting.Ph.D
Deep affordance-grounded sensorimotor object recognition
It is well-established by cognitive neuroscience that human perception of objects constitutes a complex process, where object appearance information is combined with evidence about the so-called object “affordances”, namely the types of actions that humans typically perform when interacting with them. This fact has recently motivated the “sensorimotor” approach to the challenging task of automatic object recognition, where both information sources are fused to improve robustness. In this work, the aforementioned paradigm is adopted, surpassing current limitations of sensorimotor object recognition research. Specifically, the deep learning paradigm is introduced to the problem for the first time, developing a number of novel neuro-biologically and neuro-physiologically inspired architectures that utilize state-of-the-art neural networks for fusing the available information sources in multiple ways. The proposed methods are evaluated using a large RGB-D corpus, which is specifically collected for the task of sensorimotor object recognition and is made publicly available. Experimental results demonstrate the utility of affordance information to object recognition, achieving an up to 29% relative error reduction by its inclusion. © 2017 IEEE
Reasoning and understanding grasp affordances for robot manipulation
This doctoral research focuses on developing new methods that enable an artificial agent
to grasp and manipulate objects autonomously. More specifically, we are using the concept
of affordances to learn and generalise robot grasping and manipulation techniques. [75] defined affordances as the ability of an agent to perform a certain action with an object in a
given environment. In robotics, affordances defines the possibility of an agent to perform
actions with an object. Therefore, by understanding the relation between actions, objects
and the effect of these actions, the agent understands the task at hand, providing the robot
with the potential to bridge perception to action. The significance of affordances in robotics
has been studied from varied perspectives, such as psychology and cognitive sciences.
Many efforts have been made to pragmatically employ the concept of affordances as it
provides the potential for an artificial agent to perform tasks autonomously. We start by reviewing and finding common ground amongst different strategies that use affordances for
robotic tasks. We build on the identified grounds to provide guidance on including the concept of affordances as a medium to boost autonomy for an artificial agent. To this end, we
outline common design choices to build an affordance relation; and their implications on
the generalisation capabilities of the agent when facing previously unseen scenarios. Based
on our exhaustive review, we conclude that prior research on object affordance detection
is effective, however, among others, it has the following technical gaps: (i) the methods are
limited to a single object ↔ affordance hypothesis, and (ii) they cannot guarantee task completion or any level of performance for the manipulation task alone nor (iii) in collaboration
with other agents. In this research thesis, we propose solutions to these technical challenges.
In an incremental fashion, we start by addressing the limited generalisation capabilities
of, at the time state-of-the-art methods, by strengthening the perception to action connection through the construction of an Knowledge Base (KB). We then leverage the information
encapsulated in the KB to design and implement a reasoning and understanding method
based on statistical relational leaner (SRL) that allows us to cope with uncertainty in testing
environments, and thus, improve generalisation capabilities in affordance-aware manipulation tasks. The KB in conjunctions with our SRL are the base for our designed solutions
that guarantee task completion when the robot is performing a task alone as well as when in
collaboration with other agents. We finally expose and discuss a range of interesting avenues
that have the potential to thrive the capabilities of a robotic agent through the use of the
concept of affordances for manipulation tasks. A summary of the contributions of this thesis
can be found at: https://bit.ly/grasp_affordance_reasonin