10 research outputs found

    Deep Affordance-grounded Sensorimotor Object Recognition

    Full text link
    It is well-established by cognitive neuroscience that human perception of objects constitutes a complex process, where object appearance information is combined with evidence about the so-called object "affordances", namely the types of actions that humans typically perform when interacting with them. This fact has recently motivated the "sensorimotor" approach to the challenging task of automatic object recognition, where both information sources are fused to improve robustness. In this work, the aforementioned paradigm is adopted, surpassing current limitations of sensorimotor object recognition research. Specifically, the deep learning paradigm is introduced to the problem for the first time, developing a number of novel neuro-biologically and neuro-physiologically inspired architectures that utilize state-of-the-art neural networks for fusing the available information sources in multiple ways. The proposed methods are evaluated using a large RGB-D corpus, which is specifically collected for the task of sensorimotor object recognition and is made publicly available. Experimental results demonstrate the utility of affordance information to object recognition, achieving an up to 29% relative error reduction by its inclusion.Comment: 9 pages, 7 figures, dataset link included, accepted to CVPR 201

    A Deep Learning Approach to Object Affordance Segmentation

    Full text link
    Learning to understand and infer object functionalities is an important step towards robust visual intelligence. Significant research efforts have recently focused on segmenting the object parts that enable specific types of human-object interaction, the so-called "object affordances". However, most works treat it as a static semantic segmentation problem, focusing solely on object appearance and relying on strong supervision and object detection. In this paper, we propose a novel approach that exploits the spatio-temporal nature of human-object interaction for affordance segmentation. In particular, we design an autoencoder that is trained using ground-truth labels of only the last frame of the sequence, and is able to infer pixel-wise affordance labels in both videos and static images. Our model surpasses the need for object labels and bounding boxes by using a soft-attention mechanism that enables the implicit localization of the interaction hotspot. For evaluation purposes, we introduce the SOR3D-AFF corpus, which consists of human-object interaction sequences and supports 9 types of affordances in terms of pixel-wise annotation, covering typical manipulations of tool-like objects. We show that our model achieves competitive results compared to strongly supervised methods on SOR3D-AFF, while being able to predict affordances for similar unseen objects in two affordance image-only datasets.Comment: 5 pages, 4 figures, ICASSP 202

    Learning Scene Flow With Skeleton Guidance For 3D Action Recognition

    Full text link
    Among the existing modalities for 3D action recognition, 3D flow has been poorly examined, although conveying rich motion information cues for human actions. Presumably, its susceptibility to noise renders it intractable, thus challenging the learning process within deep models. This work demonstrates the use of 3D flow sequence by a deep spatiotemporal model and further proposes an incremental two-level spatial attention mechanism, guided from skeleton domain, for emphasizing motion features close to the body joint areas and according to their informativeness. Towards this end, an extended deep skeleton model is also introduced to learn the most discriminant action motion dynamics, so as to estimate an informativeness score for each joint. Subsequently, a late fusion scheme is adopted between the two models for learning the high level cross-modal correlations. Experimental results on the currently largest and most challenging dataset NTU RGB+D, demonstrate the effectiveness of the proposed approach, achieving state-of-the-art results.Comment: 18 pages, 3 figures, 3 tables, conferenc

    AffordPose: A Large-scale Dataset of Hand-Object Interactions with Affordance-driven Hand Pose

    Full text link
    How human interact with objects depends on the functional roles of the target objects, which introduces the problem of affordance-aware hand-object interaction. It requires a large number of human demonstrations for the learning and understanding of plausible and appropriate hand-object interactions. In this work, we present AffordPose, a large-scale dataset of hand-object interactions with affordance-driven hand pose. We first annotate the specific part-level affordance labels for each object, e.g. twist, pull, handle-grasp, etc, instead of the general intents such as use or handover, to indicate the purpose and guide the localization of the hand-object interactions. The fine-grained hand-object interactions reveal the influence of hand-centered affordances on the detailed arrangement of the hand poses, yet also exhibit a certain degree of diversity. We collect a total of 26.7K hand-object interactions, each including the 3D object shape, the part-level affordance label, and the manually adjusted hand poses. The comprehensive data analysis shows the common characteristics and diversity of hand-object interactions per affordance via the parameter statistics and contacting computation. We also conduct experiments on the tasks of hand-object affordance understanding and affordance-oriented hand-object interaction generation, to validate the effectiveness of our dataset in learning the fine-grained hand-object interactions. Project page: https://github.com/GentlesJan/AffordPose.Comment: Accepted by ICCV 202

    Forecasting Human-Object Interaction: Joint Prediction of Motor Attention and Actions in First Person Video

    Full text link
    We address the challenging task of anticipating human-object interaction in first person videos. Most existing methods ignore how the camera wearer interacts with the objects, or simply consider body motion as a separate modality. In contrast, we observe that the international hand movement reveals critical information about the future activity. Motivated by this, we adopt intentional hand movement as a future representation and propose a novel deep network that jointly models and predicts the egocentric hand motion, interaction hotspots and future action. Specifically, we consider the future hand motion as the motor attention, and model this attention using latent variables in our deep model. The predicted motor attention is further used to characterise the discriminative spatial-temporal visual features for predicting actions and interaction hotspots. We present extensive experiments demonstrating the benefit of the proposed joint model. Importantly, our model produces new state-of-the-art results for action anticipation on both EGTEA Gaze+ and the EPIC-Kitchens datasets. Our project page is available at https://aptx4869lm.github.io/ForecastingHOI

    Beyond Object Recognition: A New Benchmark towards Object Concept Learning

    Full text link
    Understanding objects is a central building block of artificial intelligence, especially for embodied AI. Even though object recognition excels with deep learning, current machines still struggle to learn higher-level knowledge, e.g., what attributes an object has, and what can we do with an object. In this work, we propose a challenging Object Concept Learning (OCL) task to push the envelope of object understanding. It requires machines to reason out object affordances and simultaneously give the reason: what attributes make an object possesses these affordances. To support OCL, we build a densely annotated knowledge base including extensive labels for three levels of object concept (category, attribute, affordance), and the causal relations of three levels. By analyzing the causal structure of OCL, we present a baseline, Object Concept Reasoning Network (OCRN). It leverages causal intervention and concept instantiation to infer the three levels following their causal relations. In experiments, OCRN effectively infers the object knowledge while following the causalities well. Our data and code are available at https://mvig-rhos.com/ocl.Comment: ICCV 2023. Webpage: https://mvig-rhos.com/oc

    Egocentric Action Understanding by Learning Embodied Attention

    Get PDF
    Videos captured from wearable cameras, known as egocentric videos, create a continuous record of human daily visual experience, and thereby offer a new perspective for human activity understanding. Importantly, egocentric video aligns gaze, embodied movement, and action in the same “first-person” coordinate system. The rich egocentric cues reflect the attended scene context of an action, and thereby provide novel means for reasoning human daily routines. In my thesis work, I describe my efforts on developing novel computational models that learn the embodied egocentric attention for the automatic analysis of egocentric actions. First, I introduce a probabilistic model for learning gaze and actions in egocentric video and further demonstrate that attention can serve as a robust tool for learning motion-aware video representation. Second, I develop a novel deep model to address the challenging problem of jointly recognizing and localizing actions of a mobile user on a known 3D map from egocentric videos. Third, I present a novel deep latent variable model that makes use of human intentional body movement (motor attention) as a key representation for forecasting human-object interaction in egocentric video. Finally, I propose a novel task of future hand segmentation from egocentric videos, and show how explicitly modeling the future head motion can facilitate future hand movement forecasting.Ph.D

    Deep affordance-grounded sensorimotor object recognition

    No full text
    It is well-established by cognitive neuroscience that human perception of objects constitutes a complex process, where object appearance information is combined with evidence about the so-called object “affordances”, namely the types of actions that humans typically perform when interacting with them. This fact has recently motivated the “sensorimotor” approach to the challenging task of automatic object recognition, where both information sources are fused to improve robustness. In this work, the aforementioned paradigm is adopted, surpassing current limitations of sensorimotor object recognition research. Specifically, the deep learning paradigm is introduced to the problem for the first time, developing a number of novel neuro-biologically and neuro-physiologically inspired architectures that utilize state-of-the-art neural networks for fusing the available information sources in multiple ways. The proposed methods are evaluated using a large RGB-D corpus, which is specifically collected for the task of sensorimotor object recognition and is made publicly available. Experimental results demonstrate the utility of affordance information to object recognition, achieving an up to 29% relative error reduction by its inclusion. © 2017 IEEE

    Reasoning and understanding grasp affordances for robot manipulation

    Get PDF
    This doctoral research focuses on developing new methods that enable an artificial agent to grasp and manipulate objects autonomously. More specifically, we are using the concept of affordances to learn and generalise robot grasping and manipulation techniques. [75] defined affordances as the ability of an agent to perform a certain action with an object in a given environment. In robotics, affordances defines the possibility of an agent to perform actions with an object. Therefore, by understanding the relation between actions, objects and the effect of these actions, the agent understands the task at hand, providing the robot with the potential to bridge perception to action. The significance of affordances in robotics has been studied from varied perspectives, such as psychology and cognitive sciences. Many efforts have been made to pragmatically employ the concept of affordances as it provides the potential for an artificial agent to perform tasks autonomously. We start by reviewing and finding common ground amongst different strategies that use affordances for robotic tasks. We build on the identified grounds to provide guidance on including the concept of affordances as a medium to boost autonomy for an artificial agent. To this end, we outline common design choices to build an affordance relation; and their implications on the generalisation capabilities of the agent when facing previously unseen scenarios. Based on our exhaustive review, we conclude that prior research on object affordance detection is effective, however, among others, it has the following technical gaps: (i) the methods are limited to a single object ↔ affordance hypothesis, and (ii) they cannot guarantee task completion or any level of performance for the manipulation task alone nor (iii) in collaboration with other agents. In this research thesis, we propose solutions to these technical challenges. In an incremental fashion, we start by addressing the limited generalisation capabilities of, at the time state-of-the-art methods, by strengthening the perception to action connection through the construction of an Knowledge Base (KB). We then leverage the information encapsulated in the KB to design and implement a reasoning and understanding method based on statistical relational leaner (SRL) that allows us to cope with uncertainty in testing environments, and thus, improve generalisation capabilities in affordance-aware manipulation tasks. The KB in conjunctions with our SRL are the base for our designed solutions that guarantee task completion when the robot is performing a task alone as well as when in collaboration with other agents. We finally expose and discuss a range of interesting avenues that have the potential to thrive the capabilities of a robotic agent through the use of the concept of affordances for manipulation tasks. A summary of the contributions of this thesis can be found at: https://bit.ly/grasp_affordance_reasonin