1,335 research outputs found

    Multimodal Deep Learning for Robust RGB-D Object Recognition

    Full text link
    Robust object recognition is a crucial ingredient of many, if not all, real-world robotics applications. This paper leverages recent progress on Convolutional Neural Networks (CNNs) and proposes a novel RGB-D architecture for object recognition. Our architecture is composed of two separate CNN processing streams - one for each modality - which are consecutively combined with a late fusion network. We focus on learning with imperfect sensor data, a typical problem in real-world robotics tasks. For accurate learning, we introduce a multi-stage training methodology and two crucial ingredients for handling depth data with CNNs. The first, an effective encoding of depth information for CNNs that enables learning without the need for large depth datasets. The second, a data augmentation scheme for robust learning with depth images by corrupting them with realistic noise patterns. We present state-of-the-art results on the RGB-D object dataset and show recognition in challenging RGB-D real-world noisy settings.Comment: Final version submitted to IROS'2015, results unchanged, reformulation of some text passages in abstract and introductio

    Am I Done? Predicting Action Progress in Videos

    Get PDF
    In this paper we deal with the problem of predicting action progress in videos. We argue that this is an extremely important task since it can be valuable for a wide range of interaction applications. To this end we introduce a novel approach, named ProgressNet, capable of predicting when an action takes place in a video, where it is located within the frames, and how far it has progressed during its execution. To provide a general definition of action progress, we ground our work in the linguistics literature, borrowing terms and concepts to understand which actions can be the subject of progress estimation. As a result, we define a categorization of actions and their phases. Motivated by the recent success obtained from the interaction of Convolutional and Recurrent Neural Networks, our model is based on a combination of the Faster R-CNN framework, to make frame-wise predictions, and LSTM networks, to estimate action progress through time. After introducing two evaluation protocols for the task at hand, we demonstrate the capability of our model to effectively predict action progress on the UCF-101 and J-HMDB datasets

    A Survey of Embodied AI: From Simulators to Research Tasks

    Full text link
    There has been an emerging paradigm shift from the era of "internet AI" to "embodied AI", where AI algorithms and agents no longer learn from datasets of images, videos or text curated primarily from the internet. Instead, they learn through interactions with their environments from an egocentric perception similar to humans. Consequently, there has been substantial growth in the demand for embodied AI simulators to support various embodied AI research tasks. This growing interest in embodied AI is beneficial to the greater pursuit of Artificial General Intelligence (AGI), but there has not been a contemporary and comprehensive survey of this field. This paper aims to provide an encyclopedic survey for the field of embodied AI, from its simulators to its research. By evaluating nine current embodied AI simulators with our proposed seven features, this paper aims to understand the simulators in their provision for use in embodied AI research and their limitations. Lastly, this paper surveys the three main research tasks in embodied AI -- visual exploration, visual navigation and embodied question answering (QA), covering the state-of-the-art approaches, evaluation metrics and datasets. Finally, with the new insights revealed through surveying the field, the paper will provide suggestions for simulator-for-task selections and recommendations for the future directions of the field.Comment: Under Review for IEEE TETC

    Mixing Deep Networks and Entangled Forests for the Semantic Segmentation of 3D Indoor Scenes

    Get PDF
    This work focuses on semantic segmentation over indoor 3D data, that is, to assign labels to every point in the point clouds representing working spaces: after researching the current state of the art, traditional approaches like random forests and deep neural networks based on PointNet are evaluated. The Superpoint Graph architecture and the 3D Entangled Forests algorithm are selected for mixing their features to try to enhance their performance

    PartSLIP: Low-Shot Part Segmentation for 3D Point Clouds via Pretrained Image-Language Models

    Full text link
    Generalizable 3D part segmentation is important but challenging in vision and robotics. Training deep models via conventional supervised methods requires large-scale 3D datasets with fine-grained part annotations, which are costly to collect. This paper explores an alternative way for low-shot part segmentation of 3D point clouds by leveraging a pretrained image-language model, GLIP, which achieves superior performance on open-vocabulary 2D detection. We transfer the rich knowledge from 2D to 3D through GLIP-based part detection on point cloud rendering and a novel 2D-to-3D label lifting algorithm. We also utilize multi-view 3D priors and few-shot prompt tuning to boost performance significantly. Extensive evaluation on PartNet and PartNet-Mobility datasets shows that our method enables excellent zero-shot 3D part segmentation. Our few-shot version not only outperforms existing few-shot approaches by a large margin but also achieves highly competitive results compared to the fully supervised counterpart. Furthermore, we demonstrate that our method can be directly applied to iPhone-scanned point clouds without significant domain gaps.Comment: CVPR 2023, project page: https://colin97.github.io/PartSLIP_page

    ALET (Automated Labeling of Equipment and Tools): A Dataset, a Baseline and a Usecase for Tool Detection in the Wild

    Full text link
    Robots collaborating with humans in realistic environments will need to be able to detect the tools that can be used and manipulated. However, there is no available dataset or study that addresses this challenge in real settings. In this paper, we fill this gap by providing an extensive dataset (METU-ALET) for detecting farming, gardening, office, stonemasonry, vehicle, woodworking and workshop tools. The scenes correspond to sophisticated environments with or without humans using the tools. The scenes we consider introduce several challenges for object detection, including the small scale of the tools, their articulated nature, occlusion, inter-class invariance, etc. Moreover, we train and compare several state of the art deep object detectors (including Faster R-CNN, Cascade R-CNN, RepPoint and RetinaNet) on our dataset. We observe that the detectors have difficulty in detecting especially small-scale tools or tools that are visually similar to parts of other tools. This in turn supports the importance of our dataset and paper. With the dataset, the code and the trained models, our work provides a basis for further research into tools and their use in robotics applications.Comment: 7 pages, 4 figure
    • …
    corecore