Search CORE

3 research outputs found

Multimodal Grounding for Embodied AI via Augmented Reality Headsets for Natural Language Driven Task Planning

Author: Kruusamäe Karl
Parra Fabian
Pryor Mitch
Valner Robert
Wanna Selma
Publication venue
Publication date: 26/04/2023
Field of study

Recent advances in generative modeling have spurred a resurgence in the field of Embodied Artificial Intelligence (EAI). EAI systems typically deploy large language models to physical systems capable of interacting with their environment. In our exploration of EAI for industrial domains, we successfully demonstrate the feasibility of co-located, human-robot teaming. Specifically, we construct an experiment where an Augmented Reality (AR) headset mediates information exchange between an EAI agent and human operator for a variety of inspection tasks. To our knowledge the use of an AR headset for multimodal grounding and the application of EAI to industrial tasks are novel contributions within Embodied AI research. In addition, we highlight potential pitfalls in EAI's construction by providing quantitative and qualitative analysis on prompt robustness.Comment: 18 pages, 15 figure

arXiv.org e-Print Archive

Recommended from our members

Creating a low resource semantic parser for the unified meaning representation format

Author: Wanna Selma Liliane
Publication venue
Publication date: 30/08/2021
Field of study

This thesis investigates the performance of state-of-the-art neural models on a low resource semantic parsing task. This task required the models to convert natural language commands directed at a robot into Unified Meaning Representation Format (UMRF) structures. UMRF structures are standalone Meaning Representation (MR) containers that support embedding predicate-argument semantics and graphical MR formats. The structure was design for semi-autonomous systems in Human Robot Interaction (HRI) domains. The UMRF formalism is both new and novel, thus there is a scarcity of annotated UMRF data and thus a lack of available training data. For this project, the Examine in light task from the ALFRED dataset was selected as the corpora to annotate labeled UMRF training and validation examples. 1,010 and 100 training and validation datasets were collected respectively. Thereafter, the following models were tested on the low resource semantic parsing task: sequence-to-sequence, CopyNet, and transformer architectures. Of the three designs, the CopyNet model performed the best with a BLEU score of 0.891 and an accuracy of 61.3%. Once the design was finalized, the CopyNet model was integrated into a ROS2 software package, allowing the larger robotics community to access the semantic parser.Mechanical Engineerin

Texas ScholarWorks

Temporal and Semantic Evaluation Metrics for Foundation Models in Post-Hoc Analysis of Robotic Sub-tasks

Author: Choi Minkyu
Pryor Mitch
Salfity Jonathan
Wanna Selma
Publication venue
Publication date: 25/03/2024
Field of study

Recent works in Task and Motion Planning (TAMP) show that training control policies on language-supervised robot trajectories with quality labeled data markedly improves agent task success rates. However, the scarcity of such data presents a significant hurdle to extending these methods to general use cases. To address this concern, we present an automated framework to decompose trajectory data into temporally bounded and natural language-based descriptive sub-tasks by leveraging recent prompting strategies for Foundation Models (FMs) including both Large Language Models (LLMs) and Vision Language Models (VLMs). Our framework provides both time-based and language-based descriptions for lower-level sub-tasks that comprise full trajectories. To rigorously evaluate the quality of our automatic labeling framework, we contribute an algorithm SIMILARITY to produce two novel metrics, temporal similarity and semantic similarity. The metrics measure the temporal alignment and semantic fidelity of language descriptions between two sub-task decompositions, namely an FM sub-task decomposition prediction and a ground-truth sub-task decomposition. We present scores for temporal similarity and semantic similarity above 90%, compared to 30% of a randomized baseline, for multiple robotic environments, demonstrating the effectiveness of our proposed framework. Our results enable building diverse, large-scale, language-supervised datasets for improved robotic TAMP.Comment: 8 pages, 3 figures. IROS 2024 Submissio

arXiv.org e-Print Archive