Search CORE

327 research outputs found

Towards Realistic Embodied AI Agents

Author: Datta Samyak
Publication venue: Georgia Institute of Technology
Publication date: 25/08/2022
Field of study

Recent years have witnessed the inception of a growing field of inquiry within the broader AI community termed as "Embodied AI". Problems studied under the umbrella of Embodied AI include the introduction of scene datasets and simulators to train AI agents to perform a wide spectrum of tasks requiring a curriculum of capabilities. While progress on this front has been commendable, it is nonetheless important and worthwhile to pause and carefully examine the real-world context under which such AI agents would be expected to operate. While doing so, it is critical to ensure "realism" i.e. the settings, parameters, and assumptions under which these agents and tasks are investigated in simulation indeed serve as the right test beds and high-fidelity precursors to the real world. Simulation has its own advantages of being fast, scalable/distributed, and safe and therefore, it is valuable to strive to make simulations more realistic. Towards that end, this thesis serves as an investigation into realism for Embodied AI agents in simulation. We study realism along 3 different axes. (1) Photorealism: The visual appearance of objects and rooms in indoor scenes, as viewed by the agent in simulation, must be a close approximation of what the agent would actually see in the real world. (2) Sensing and Actuation Realism: Embodied agents in simulation are often equipped with a variety of idealized sensors that provide highly privileged, noise-free sensing signals, depending on the task they are being trained for and take deterministic actions. This is in contrast to the dirty reality of noisy sensors and actuations in the real world. (3) Task Realism: Moving beyond realistic sensors and actuations, we need to ensure that the assumptions made while formulating tasks and the settings under which these tasks are being evaluated in simulation does indeed bode well with the deployment scenarios and use-cases in the real world. Finally, the thesis also explores connections between these different axes of realism.Ph.D

Scholarly Materials And Research @ Georgia Tech

Applications of Large Scale Foundation Models for Autonomous Driving

Author: Chen Yue
Huang Yu
Li Zhu
Publication venue
Publication date: 04/01/2024
Field of study

Since DARPA Grand Challenges (rural) in 2004/05 and Urban Challenges in 2007, autonomous driving has been the most active field of AI applications. Recently powered by large language models (LLMs), chat systems, such as chatGPT and PaLM, emerge and rapidly become a promising direction to achieve artificial general intelligence (AGI) in natural language processing (NLP). There comes a natural thinking that we could employ these abilities to reformulate autonomous driving. By combining LLM with foundation models, it is possible to utilize the human knowledge, commonsense and reasoning to rebuild autonomous driving systems from the current long-tailed AI dilemma. In this paper, we investigate the techniques of foundation models and LLMs applied for autonomous driving, categorized as simulation, world model, data annotation and planning or E2E solutions etc.Comment: 23 pages. A survey pape

arXiv.org e-Print Archive

Core Challenges in Embodied Vision-Language Planning

Author: Francis Jonathan
Kitamura Nariaki
Labelle Felix
Lu Xiaopeng
Navarro Ingrid
Oh Jean
Publication venue
Publication date: 27/07/2021
Field of study

Recent advances in the areas of multimodal machine learning and artificial intelligence (AI) have led to the development of challenging tasks at the intersection of Computer Vision, Natural Language Processing, and Embodied AI. Whereas many approaches and previous survey pursuits have characterised one or two of these dimensions, there has not been a holistic analysis at the center of all three. Moreover, even when combinations of these topics are considered, more focus is placed on describing, e.g., current architectural methods, as opposed to also illustrating high-level challenges and opportunities for the field. In this survey paper, we discuss Embodied Vision-Language Planning (EVLP) tasks, a family of prominent embodied navigation and manipulation problems that jointly use computer vision and natural language. We propose a taxonomy to unify these tasks and provide an in-depth analysis and comparison of the new and current algorithmic approaches, metrics, simulated environments, as well as the datasets used for EVLP tasks. Finally, we present the core challenges that we believe new EVLP works should seek to address, and we advocate for task construction that enables model generalizability and furthers real-world deployment.Comment: 35 page

arXiv.org e-Print Archive

Learning a Visually Grounded Memory Assistant

Author: Carlberg Kevin
Desai Ruta
Hahn Meera
Hillis James
Publication venue
Publication date: 07/10/2022
Field of study

We introduce a novel interface for large scale collection of human memory and assistance. Using the 3D Matterport simulator we create a realistic indoor environments in which we have people perform specific embodied memory tasks that mimic household daily activities. This interface was then deployed on Amazon Mechanical Turk allowing us to test and record human memory, navigation and needs for assistance at a large scale that was previously impossible. Using the interface we collect the `The Visually Grounded Memory Assistant Dataset' which is aimed at developing our understanding of (1) the information people encode during navigation of 3D environments and (2) conditions under which people ask for memory assistance. Additionally we experiment with with predicting when people will ask for assistance using models trained on hand-selected visual and semantic features. This provides an opportunity to build stronger ties between the machine-learning and cognitive-science communities through learned models of human perception, memory, and cognition

arXiv.org e-Print Archive

When LLMs step into the 3D world: a survey and meta-analysis of 3D tasks via multi-modal Large Language Models

Author: Bhalgat Yash
Bian Jw
Chang Ax
Chen Dz
Chen Shuai
Ding Jian
Gu Jindong
Laina Iro
Li Xinghui
Ma X
Nießner M
Peng S
Pollefeys M
Prisacariu Victor
Reid Id
Smart Brandon
Torr Philip
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 10/07/2024
Field of study

As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs) has seen rapid progress, offering unprecedented capabilities for understanding and interacting with physical spaces. This survey provides a comprehensive overview of the methodologies enabling LLMs to process, understand, and generate 3D data. Highlighting the unique advantages of LLMs, such as in-context learning, step-by-step reasoning, open-vocabulary capabilities, and extensive world knowledge, we underscore their potential to significantly advance spatial comprehension and interaction within embodied Artificial Intelligence (AI) systems. Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs). It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue, as well as LLM-based agents for spatial reasoning, planning, and navigation. The paper also includes a brief review of other methods that integrate 3D and language. The meta-analysis presented in this paper reveals significant progress yet underscores the necessity for novel approaches to harness the full potential of 3D-LLMs. Hence, with this paper, we aim to chart a course for future research that explores and expands the capabilities of 3D-LLMs in understanding and interacting with the complex 3D world. To support this survey, we have established a project page where papers related to our topic are organized and listed: https://github.com/ActiveVisionLab/Awesome-LLM-3D

Oxford University Research Archive

Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following

Author: Chen Kexin
Gao Peng
Guo Ziyu
Han Jiaming
Heng Pheng-Ann
Li Hongsheng
Li Xianzhi
Ma Xianzheng
Tang Yiwen
Zhang Renrui
Zhu Xiangyang
Publication venue
Publication date: 01/09/2023
Field of study

We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image, language, audio, and video. Guided by ImageBind, we construct a joint embedding space between 3D and multi-modalities, enabling many promising applications, e.g., any-to-3D generation, 3D embedding arithmetic, and 3D open-world understanding. On top of this, we further present Point-LLM, the first 3D large language model (LLM) following 3D multi-modal instructions. By parameter-efficient fine-tuning techniques, Point-LLM injects the semantics of Point-Bind into pre-trained LLMs, e.g., LLaMA, which requires no 3D instruction data, but exhibits superior 3D and multi-modal question-answering capacity. We hope our work may cast a light on the community for extending 3D point clouds to multi-modality applications. Code is available at https://github.com/ZiyuGuo99/Point-Bind_Point-LLM.Comment: Work in progress. Code is available at https://github.com/ZiyuGuo99/Point-Bind_Point-LL

arXiv.org e-Print Archive