626 research outputs found
Multimodal Uncertainty Reduction for Intention Recognition in Human-Robot Interaction
Assistive robots can potentially improve the quality of life and personal
independence of elderly people by supporting everyday life activities. To
guarantee a safe and intuitive interaction between human and robot, human
intentions need to be recognized automatically. As humans communicate their
intentions multimodally, the use of multiple modalities for intention
recognition may not just increase the robustness against failure of individual
modalities but especially reduce the uncertainty about the intention to be
predicted. This is desirable as particularly in direct interaction between
robots and potentially vulnerable humans a minimal uncertainty about the
situation as well as knowledge about this actual uncertainty is necessary.
Thus, in contrast to existing methods, in this work a new approach for
multimodal intention recognition is introduced that focuses on uncertainty
reduction through classifier fusion. For the four considered modalities speech,
gestures, gaze directions and scene objects individual intention classifiers
are trained, all of which output a probability distribution over all possible
intentions. By combining these output distributions using the Bayesian method
Independent Opinion Pool the uncertainty about the intention to be recognized
can be decreased. The approach is evaluated in a collaborative human-robot
interaction task with a 7-DoF robot arm. The results show that fused
classifiers which combine multiple modalities outperform the respective
individual base classifiers with respect to increased accuracy, robustness, and
reduced uncertainty.Comment: Submitted to IROS 201
Deep Learning-Based Robotic Perception for Adaptive Facility Disinfection
Hospitals, schools, airports, and other environments built for mass gatherings can become hot spots for microbial pathogen colonization, transmission, and exposure, greatly accelerating the spread of infectious diseases across communities, cities, nations, and the world. Outbreaks of infectious diseases impose huge burdens on our society. Mitigating the spread of infectious pathogens within mass-gathering facilities requires routine cleaning and disinfection, which are primarily performed by cleaning staff under current practice. However, manual disinfection is limited in terms of both effectiveness and efficiency, as it is labor-intensive, time-consuming, and health-undermining. While existing studies have developed a variety of robotic systems for disinfecting contaminated surfaces, those systems are not adequate for intelligent, precise, and environmentally adaptive disinfection. They are also difficult to deploy in mass-gathering infrastructure facilities, given the high volume of occupants. Therefore, there is a critical need to develop an adaptive robot system capable of complete and efficient indoor disinfection.
The overarching goal of this research is to develop an artificial intelligence (AI)-enabled robotic system that adapts to ambient environments and social contexts for precise and efficient disinfection. This would maintain environmental hygiene and health, reduce unnecessary labor costs for cleaning, and mitigate opportunity costs incurred from infections. To these ends, this dissertation first develops a multi-classifier decision fusion method, which integrates scene graph and visual information, in order to recognize patterns in human activity in infrastructure facilities. Next, a deep-learning-based method is proposed for detecting and classifying indoor objects, and a new mechanism is developed to map detected objects in 3D maps. A novel framework is then developed to detect and segment object affordance and to project them into a 3D semantic map for precise disinfection. Subsequently, a novel deep-learning network, which integrates multi-scale features and multi-level features, and an encoder network are developed to recognize the materials of surfaces requiring disinfection. Finally, a novel computational method is developed to link the recognition of object surface information to robot disinfection actions with optimal disinfection parameters
GraspGPT: Leveraging Semantic Knowledge from a Large Language Model for Task-Oriented Grasping
Task-oriented grasping (TOG) refers to the problem of predicting grasps on an
object that enable subsequent manipulation tasks. To model the complex
relationships between objects, tasks, and grasps, existing methods incorporate
semantic knowledge as priors into TOG pipelines. However, the existing semantic
knowledge is typically constructed based on closed-world concept sets,
restraining the generalization to novel concepts out of the pre-defined sets.
To address this issue, we propose GraspGPT, a large language model (LLM) based
TOG framework that leverages the open-end semantic knowledge from an LLM to
achieve zero-shot generalization to novel concepts. We conduct experiments on
Language Augmented TaskGrasp (LA-TaskGrasp) dataset and demonstrate that
GraspGPT outperforms existing TOG methods on different held-out settings when
generalizing to novel concepts out of the training set. The effectiveness of
GraspGPT is further validated in real-robot experiments. Our code, data,
appendix, and video are publicly available at
https://sites.google.com/view/graspgpt/.Comment: 15 pages, 8 figure
On the Linguistic and Computational Requirements for Creating Face-to-Face Multimodal Human-Machine Interaction
In this study, conversations between humans and avatars are linguistically,
organizationally, and structurally analyzed, focusing on what is necessary for
creating face-to-face multimodal interfaces for machines. We videorecorded
thirty-four human-avatar interactions, performed complete linguistic
microanalysis on video excerpts, and marked all the occurrences of multimodal
actions and events. Statistical inferences were applied to data, allowing us to
comprehend not only how often multimodal actions occur but also how multimodal
events are distributed between the speaker (emitter) and the listener
(recipient). We also observed the distribution of multimodal occurrences for
each modality. The data show evidence that double-loop feedback is established
during a face-to-face conversation. This led us to propose that knowledge from
Conversation Analysis (CA), cognitive science, and Theory of Mind (ToM), among
others, should be incorporated into the ones used for describing human-machine
multimodal interactions. Face-to-face interfaces require an additional control
layer to the multimodal fusion layer. This layer has to organize the flow of
conversation, integrate the social context into the interaction, as well as
make plans concerning 'what' and 'how' to progress on the interaction. This
higher level is best understood if we incorporate insights from CA and ToM into
the interface system
Multimodal Uncertainty Reduction for Intention Recognition in Human-Robot Interaction
Assistive robots can potentially improve the quality of life and personal independence of elderly people by supporting everyday life activities. To guarantee a safe and intuitive interaction between human and robot, human intentions need to be recognized automatically. As humans communicate their intentions multimodally, the use of multiple modalities for intention recognition may not just increase the robustness against failure of individual modalities but especially reduce the uncertainty about the intention to be recognized. This is desirable as particularly in direct interaction between robots and potentially vulnerable humans a minimal uncertainty about the situation as well as knowledge about this actual uncertainty is necessary. Thus, in contrast to existing methods, in this work a new approach for multimodal intention recognition is introduced that focuses on uncertainty reduction through classifier fusion. For the four considered modalities speech, gestures, gaze directions and scene objects individual intention classifiers are trained, all of which output a probability distribution over all possible intentions. By combining these output distributions using the Bayesian method Independent Opinion Pool [1] the uncertainty about the intention to be recognized can be decreased. The approach is evaluated in a collaborative human-robot interaction task with a 7-DoF robot arm. The results show that fused classifiers, which combine multiple modalities, outperform the respective individual base classifiers with respect to increased accuracy, robustness, and reduced uncertainty
Context-Independent Task Knowledge for Neurosymbolic Reasoning in Cognitive Robotics
One of the current main goals of artificial intelligence and robotics research is the creation of an artificial assistant which can have flexible, human like behavior, in order to accomplish everyday tasks. A lot of what is context-independent task knowledge to the human is what enables this flexibility at multiple levels of cognition. In this scope the author analyzes how to acquire, represent and disambiguate symbolic knowledge representing context-independent task knowledge, abstracted from multiple instances: this thesis elaborates the incurred problems, implementation constraints, current state-of-the-art practices and ultimately the solutions newly introduced in this scope. The author specifically discusses acquisition of context-independent task knowledge from large amounts of human-written texts and their reusability in the robotics domain; the acquisition of knowledge on human musculoskeletal dependencies constraining motion which allows a better higher level representation of observed trajectories; the means of verbalization of partial contextual and instruction knowledge, increasing interaction possibilities with the human as well as contextual adaptation. All the aforementioned points are supported by evaluation in heterogeneous setups, to bring a view on how to make optimal use of statistical & symbolic applications (i.e. neurosymbolic reasoning) in cognitive robotics. This work has been performed to enable context-adaptable artificial assistants, by bringing together knowledge on what is usually regarded as context-independent task knowledge
Action-oriented Scene Understanding
In order to allow robots to act autonomously it is crucial that they do not only describe their environment accurately but also identify how to interact with their surroundings.
While we witnessed tremendous progress in descriptive computer vision, approaches that explicitly target action are scarcer.
This cumulative dissertation approaches the goal of interpreting visual scenes “in the wild” with respect to actions implied by the scene. We call this approach action-oriented scene understanding. It involves identifying and judging opportunities for interaction with constituents of the scene (e.g. objects and their parts) as well as understanding object functions and how interactions will impact the future. All of these aspects are addressed on three levels of abstraction: elements, perception and reasoning.
On the elementary level, we investigate semantic and functional grouping of objects by analyzing annotated natural image scenes. We compare object label-based and visual context definitions with respect to their suitability for generating meaningful object class representations. Our findings suggest that representations generated from visual context are on-par in terms of semantic quality with those generated from large quantities of text.
The perceptive level concerns action identification. We propose a system to identify possible interactions for robots and humans with the environment (affordances) on a pixel level using state-of-the-art machine learning methods. Pixel-wise part annotations of images are transformed into 12 affordance maps. Using these maps, a convolutional neural network is trained to densely predict affordance maps from unknown RGB images. In contrast to previous work, this approach operates exclusively on RGB images during both, training and testing, and yet achieves state-of-the-art performance.
At the reasoning level, we extend the question from asking what actions are possible to what actions are plausible. For this, we gathered a dataset of household images associated with human ratings of the likelihoods of eight different actions. Based on the judgement provided by the human raters, we train convolutional neural networks to generate plausibility scores from unseen images.
Furthermore, having considered only static scenes previously in this thesis, we propose a system that takes video input and predicts plausible future actions. Since this requires careful identification of relevant features in the video sequence, we analyze this particular aspect in detail using a synthetic dataset for several state-of-the-art video models. We identify feature learning as a major obstacle for anticipation in natural video data.
The presented projects analyze the role of action in scene understanding from various angles and in multiple settings while highlighting the advantages of assuming an action-oriented perspective.
We conclude that action-oriented scene understanding can augment classic computer vision in many real-life applications, in particular robotics
Pave the Way to Grasp Anything: Transferring Foundation Models for Universal Pick-Place Robots
Improving the generalization capabilities of general-purpose robotic agents
has long been a significant challenge actively pursued by research communities.
Existing approaches often rely on collecting large-scale real-world robotic
data, such as the RT-1 dataset. However, these approaches typically suffer from
low efficiency, limiting their capability in open-domain scenarios with new
objects, and diverse backgrounds. In this paper, we propose a novel paradigm
that effectively leverages language-grounded segmentation masks generated by
state-of-the-art foundation models, to address a wide range of pick-and-place
robot manipulation tasks in everyday scenarios. By integrating precise
semantics and geometries conveyed from masks into our multi-view policy model,
our approach can perceive accurate object poses and enable sample-efficient
learning. Besides, such design facilitates effective generalization for
grasping new objects with similar shapes observed during training. Our approach
consists of two distinct steps. First, we introduce a series of foundation
models to accurately ground natural language demands across multiple tasks.
Second, we develop a Multi-modal Multi-view Policy Model that incorporates
inputs such as RGB images, semantic masks, and robot proprioception states to
jointly predict precise and executable robot actions. Extensive real-world
experiments conducted on a Franka Emika robot arm validate the effectiveness of
our proposed paradigm. Real-world demos are shown in YouTube
(https://www.youtube.com/watch?v=1m9wNzfp_4E ) and Bilibili
(https://www.bilibili.com/video/BV178411Z7H2/ )
Robotics Dexterous Grasping: The Methods Based on Point Cloud and Deep Learning
Dexterous manipulation, especially dexterous grasping, is a primitive and crucial ability of robots that allows the implementation of performing human-like behaviors. Deploying the ability on robots enables them to assist and substitute human to accomplish more complex tasks in daily life and industrial production. A comprehensive review of the methods based on point cloud and deep learning for robotics dexterous grasping from three perspectives is given in this paper. As a new category schemes of the mainstream methods, the proposed generation-evaluation framework is the core concept of the classification. The other two classifications based on learning modes and applications are also briefly described afterwards. This review aims to afford a guideline for robotics dexterous grasping researchers and developers
- …