1,335 research outputs found
Multimodal Deep Learning for Robust RGB-D Object Recognition
Robust object recognition is a crucial ingredient of many, if not all,
real-world robotics applications. This paper leverages recent progress on
Convolutional Neural Networks (CNNs) and proposes a novel RGB-D architecture
for object recognition. Our architecture is composed of two separate CNN
processing streams - one for each modality - which are consecutively combined
with a late fusion network. We focus on learning with imperfect sensor data, a
typical problem in real-world robotics tasks. For accurate learning, we
introduce a multi-stage training methodology and two crucial ingredients for
handling depth data with CNNs. The first, an effective encoding of depth
information for CNNs that enables learning without the need for large depth
datasets. The second, a data augmentation scheme for robust learning with depth
images by corrupting them with realistic noise patterns. We present
state-of-the-art results on the RGB-D object dataset and show recognition in
challenging RGB-D real-world noisy settings.Comment: Final version submitted to IROS'2015, results unchanged,
reformulation of some text passages in abstract and introductio
Am I Done? Predicting Action Progress in Videos
In this paper we deal with the problem of predicting action progress in
videos. We argue that this is an extremely important task since it can be
valuable for a wide range of interaction applications. To this end we introduce
a novel approach, named ProgressNet, capable of predicting when an action takes
place in a video, where it is located within the frames, and how far it has
progressed during its execution. To provide a general definition of action
progress, we ground our work in the linguistics literature, borrowing terms and
concepts to understand which actions can be the subject of progress estimation.
As a result, we define a categorization of actions and their phases. Motivated
by the recent success obtained from the interaction of Convolutional and
Recurrent Neural Networks, our model is based on a combination of the Faster
R-CNN framework, to make frame-wise predictions, and LSTM networks, to estimate
action progress through time. After introducing two evaluation protocols for
the task at hand, we demonstrate the capability of our model to effectively
predict action progress on the UCF-101 and J-HMDB datasets
A Survey of Embodied AI: From Simulators to Research Tasks
There has been an emerging paradigm shift from the era of "internet AI" to
"embodied AI", where AI algorithms and agents no longer learn from datasets of
images, videos or text curated primarily from the internet. Instead, they learn
through interactions with their environments from an egocentric perception
similar to humans. Consequently, there has been substantial growth in the
demand for embodied AI simulators to support various embodied AI research
tasks. This growing interest in embodied AI is beneficial to the greater
pursuit of Artificial General Intelligence (AGI), but there has not been a
contemporary and comprehensive survey of this field. This paper aims to provide
an encyclopedic survey for the field of embodied AI, from its simulators to its
research. By evaluating nine current embodied AI simulators with our proposed
seven features, this paper aims to understand the simulators in their provision
for use in embodied AI research and their limitations. Lastly, this paper
surveys the three main research tasks in embodied AI -- visual exploration,
visual navigation and embodied question answering (QA), covering the
state-of-the-art approaches, evaluation metrics and datasets. Finally, with the
new insights revealed through surveying the field, the paper will provide
suggestions for simulator-for-task selections and recommendations for the
future directions of the field.Comment: Under Review for IEEE TETC
Mixing Deep Networks and Entangled Forests for the Semantic Segmentation of 3D Indoor Scenes
This work focuses on semantic segmentation over indoor 3D data, that is, to assign labels to every point in the point clouds representing working spaces: after researching the current state of the art, traditional approaches like random forests and deep neural networks based on PointNet are evaluated. The Superpoint Graph architecture and the 3D Entangled Forests algorithm are selected for mixing their features to try to enhance their performance
PartSLIP: Low-Shot Part Segmentation for 3D Point Clouds via Pretrained Image-Language Models
Generalizable 3D part segmentation is important but challenging in vision and
robotics. Training deep models via conventional supervised methods requires
large-scale 3D datasets with fine-grained part annotations, which are costly to
collect. This paper explores an alternative way for low-shot part segmentation
of 3D point clouds by leveraging a pretrained image-language model, GLIP, which
achieves superior performance on open-vocabulary 2D detection. We transfer the
rich knowledge from 2D to 3D through GLIP-based part detection on point cloud
rendering and a novel 2D-to-3D label lifting algorithm. We also utilize
multi-view 3D priors and few-shot prompt tuning to boost performance
significantly. Extensive evaluation on PartNet and PartNet-Mobility datasets
shows that our method enables excellent zero-shot 3D part segmentation. Our
few-shot version not only outperforms existing few-shot approaches by a large
margin but also achieves highly competitive results compared to the fully
supervised counterpart. Furthermore, we demonstrate that our method can be
directly applied to iPhone-scanned point clouds without significant domain
gaps.Comment: CVPR 2023, project page: https://colin97.github.io/PartSLIP_page
ALET (Automated Labeling of Equipment and Tools): A Dataset, a Baseline and a Usecase for Tool Detection in the Wild
Robots collaborating with humans in realistic environments will need to be
able to detect the tools that can be used and manipulated. However, there is no
available dataset or study that addresses this challenge in real settings. In
this paper, we fill this gap by providing an extensive dataset (METU-ALET) for
detecting farming, gardening, office, stonemasonry, vehicle, woodworking and
workshop tools. The scenes correspond to sophisticated environments with or
without humans using the tools. The scenes we consider introduce several
challenges for object detection, including the small scale of the tools, their
articulated nature, occlusion, inter-class invariance, etc. Moreover, we train
and compare several state of the art deep object detectors (including Faster
R-CNN, Cascade R-CNN, RepPoint and RetinaNet) on our dataset. We observe that
the detectors have difficulty in detecting especially small-scale tools or
tools that are visually similar to parts of other tools. This in turn supports
the importance of our dataset and paper. With the dataset, the code and the
trained models, our work provides a basis for further research into tools and
their use in robotics applications.Comment: 7 pages, 4 figure
- …