316 research outputs found
Deep Affordance-grounded Sensorimotor Object Recognition
It is well-established by cognitive neuroscience that human perception of
objects constitutes a complex process, where object appearance information is
combined with evidence about the so-called object "affordances", namely the
types of actions that humans typically perform when interacting with them. This
fact has recently motivated the "sensorimotor" approach to the challenging task
of automatic object recognition, where both information sources are fused to
improve robustness. In this work, the aforementioned paradigm is adopted,
surpassing current limitations of sensorimotor object recognition research.
Specifically, the deep learning paradigm is introduced to the problem for the
first time, developing a number of novel neuro-biologically and
neuro-physiologically inspired architectures that utilize state-of-the-art
neural networks for fusing the available information sources in multiple ways.
The proposed methods are evaluated using a large RGB-D corpus, which is
specifically collected for the task of sensorimotor object recognition and is
made publicly available. Experimental results demonstrate the utility of
affordance information to object recognition, achieving an up to 29% relative
error reduction by its inclusion.Comment: 9 pages, 7 figures, dataset link included, accepted to CVPR 201
Open-Vocabulary Affordance Detection using Knowledge Distillation and Text-Point Correlation
Affordance detection presents intricate challenges and has a wide range of
robotic applications. Previous works have faced limitations such as the
complexities of 3D object shapes, the wide range of potential affordances on
real-world objects, and the lack of open-vocabulary support for affordance
understanding. In this paper, we introduce a new open-vocabulary affordance
detection method in 3D point clouds, leveraging knowledge distillation and
text-point correlation. Our approach employs pre-trained 3D models through
knowledge distillation to enhance feature extraction and semantic understanding
in 3D point clouds. We further introduce a new text-point correlation method to
learn the semantic links between point cloud features and open-vocabulary
labels. The intensive experiments show that our approach outperforms previous
works and adapts to new affordance labels and unseen objects. Notably, our
method achieves the improvement of 7.96% mIOU score compared to the baselines.
Furthermore, it offers real-time inference which is well-suitable for robotic
manipulation applications.Comment: 8 page
Human Activity Recognition and Prediction using RGBD Data
Being able to predict and recognize human activities is an essential element for us to effectively communicate with other humans during our day to day activities. A system that is able to do this has a number of appealing applications, from assistive robotics to health care and preventative medicine. Previous work in supervised video-based human activity prediction and detection fails to capture the richness of spatiotemporal data that these activities generate. Convolutional Long short-term memory (Convolutional LSTM) networks are a useful tool in analyzing this type of data, showing good results in many other areas. This thesis’ focus is on utilizing RGB-D Data to improve human activity prediction and recognition. A modified Convolutional LSTM network is introduced to do so. Experiments are performed on the network and are compared to other models in-use as well as the current state-of-the-art system. We show that our proposed model for human activity prediction and recognition outperforms the current state-of-the-art models in the CAD-120 dataset without giving bounding frames or ground-truths about objects
Open-Vocabulary Affordance Detection in 3D Point Clouds
Affordance detection is a challenging problem with a wide variety of robotic
applications. Traditional affordance detection methods are limited to a
predefined set of affordance labels, hence potentially restricting the
adaptability of intelligent robots in complex and dynamic environments. In this
paper, we present the Open-Vocabulary Affordance Detection (OpenAD) method,
which is capable of detecting an unbounded number of affordances in 3D point
clouds. By simultaneously learning the affordance text and the point feature,
OpenAD successfully exploits the semantic relationships between affordances.
Therefore, our proposed method enables zero-shot detection and can be able to
detect previously unseen affordances without a single annotation example.
Intensive experimental results show that OpenAD works effectively on a wide
range of affordance detection setups and outperforms other baselines by a large
margin. Additionally, we demonstrate the practicality of the proposed OpenAD in
real-world robotic applications with a fast inference speed (~100ms). Our
project is available at https://openad2023.github.io.Comment: Accepted to The 2023 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS 2023
Conditional Affordance Learning for Driving in Urban Environments
Most existing approaches to autonomous driving fall into one of two
categories: modular pipelines, that build an extensive model of the
environment, and imitation learning approaches, that map images directly to
control outputs. A recently proposed third paradigm, direct perception, aims to
combine the advantages of both by using a neural network to learn appropriate
low-dimensional intermediate representations. However, existing direct
perception approaches are restricted to simple highway situations, lacking the
ability to navigate intersections, stop at traffic lights or respect speed
limits. In this work, we propose a direct perception approach which maps video
input to intermediate representations suitable for autonomous navigation in
complex urban environments given high-level directional inputs. Compared to
state-of-the-art reinforcement and conditional imitation learning approaches,
we achieve an improvement of up to 68 % in goal-directed navigation on the
challenging CARLA simulation benchmark. In addition, our approach is the first
to handle traffic lights and speed signs by using image-level labels only, as
well as smooth car-following, resulting in a significant reduction of traffic
accidents in simulation.Comment: Accepted for Conference on Robot Learning (CoRL) 201
Self-Supervised Learning of Action Affordances as Interaction Modes
When humans perform a task with an articulated object, they interact with the
object only in a handful of ways, while the space of all possible interactions
is nearly endless. This is because humans have prior knowledge about what
interactions are likely to be successful, i.e., to open a new door we first try
the handle. While learning such priors without supervision is easy for humans,
it is notoriously hard for machines. In this work, we tackle unsupervised
learning of priors of useful interactions with articulated objects, which we
call interaction modes. In contrast to the prior art, we use no supervision or
privileged information; we only assume access to the depth sensor in the
simulator to learn the interaction modes. More precisely, we define a
successful interaction as the one changing the visual environment substantially
and learn a generative model of such interactions, that can be conditioned on
the desired goal state of the object. In our experiments, we show that our
model covers most of the human interaction modes, outperforms existing
state-of-the-art methods for affordance learning, and can generalize to objects
never seen during training. Additionally, we show promising results in the
goal-conditional setup, where our model can be quickly fine-tuned to perform a
given task. We show in the experiments that such affordance learning predicts
interaction which covers most modes of interaction for the querying articulated
object and can be fine-tuned to a goal-conditional model. For supplementary:
https://actaim.github.io
A Deep Learning Approach to Object Affordance Segmentation
Learning to understand and infer object functionalities is an important step
towards robust visual intelligence. Significant research efforts have recently
focused on segmenting the object parts that enable specific types of
human-object interaction, the so-called "object affordances". However, most
works treat it as a static semantic segmentation problem, focusing solely on
object appearance and relying on strong supervision and object detection. In
this paper, we propose a novel approach that exploits the spatio-temporal
nature of human-object interaction for affordance segmentation. In particular,
we design an autoencoder that is trained using ground-truth labels of only the
last frame of the sequence, and is able to infer pixel-wise affordance labels
in both videos and static images. Our model surpasses the need for object
labels and bounding boxes by using a soft-attention mechanism that enables the
implicit localization of the interaction hotspot. For evaluation purposes, we
introduce the SOR3D-AFF corpus, which consists of human-object interaction
sequences and supports 9 types of affordances in terms of pixel-wise
annotation, covering typical manipulations of tool-like objects. We show that
our model achieves competitive results compared to strongly supervised methods
on SOR3D-AFF, while being able to predict affordances for similar unseen
objects in two affordance image-only datasets.Comment: 5 pages, 4 figures, ICASSP 202
- …