253 research outputs found

    A Survey of Embodied AI: From Simulators to Research Tasks

    Full text link
    There has been an emerging paradigm shift from the era of "internet AI" to "embodied AI", where AI algorithms and agents no longer learn from datasets of images, videos or text curated primarily from the internet. Instead, they learn through interactions with their environments from an egocentric perception similar to humans. Consequently, there has been substantial growth in the demand for embodied AI simulators to support various embodied AI research tasks. This growing interest in embodied AI is beneficial to the greater pursuit of Artificial General Intelligence (AGI), but there has not been a contemporary and comprehensive survey of this field. This paper aims to provide an encyclopedic survey for the field of embodied AI, from its simulators to its research. By evaluating nine current embodied AI simulators with our proposed seven features, this paper aims to understand the simulators in their provision for use in embodied AI research and their limitations. Lastly, this paper surveys the three main research tasks in embodied AI -- visual exploration, visual navigation and embodied question answering (QA), covering the state-of-the-art approaches, evaluation metrics and datasets. Finally, with the new insights revealed through surveying the field, the paper will provide suggestions for simulator-for-task selections and recommendations for the future directions of the field.Comment: Under Review for IEEE TETC

    Vision for Social Robots: Human Perception and Pose Estimation

    Get PDF
    In order to extract the underlying meaning from a scene captured from the surrounding world in a single still image, social robots will need to learn the human ability to detect different objects, understand their arrangement and relationships relative both to their own parts and to each other, and infer the dynamics under which they are evolving. Furthermore, they will need to develop and hold a notion of context to allow assigning different meanings (semantics) to the same visual configuration (syntax) of a scene. The underlying thread of this Thesis is the investigation of new ways for enabling interactions between social robots and humans, by advancing the visual perception capabilities of robots when they process images and videos in which humans are the main focus of attention. First, we analyze the general problem of scene understanding, as social robots moving through the world need to be able to interpret scenes without having been assigned a specific preset goal. Throughout this line of research, i) we observe that human actions and interactions which can be visually discriminated from an image follow a very heavy-tailed distribution; ii) we develop an algorithm that can obtain a spatial understanding of a scene by only using cues arising from the effect of perspective on a picture of a person’s face; and iii) we define a novel taxonomy of errors for the task of estimating the 2D body pose of people in images to better explain the behavior of algorithms and highlight their underlying causes of error. Second, we focus on the specific task of 3D human pose and motion estimation from monocular 2D images using weakly supervised training data, as accurately predicting human pose will open up the possibility of richer interactions between humans and social robots. We show that when 3D ground-truth data is only available in small quantities, or not at all, it is possible to leverage knowledge about the physical properties of the human body, along with additional constraints related to alternative types of supervisory signals, to learn models that can regress the full 3D pose of the human body and predict its motions from monocular 2D images. Taken in its entirety, the intent of this Thesis is to highlight the importance of, and provide novel methodologies for, social robots' ability to interpret their surrounding environment, learn in a way that is robust to low data availability, and generalize previously observed behaviors to unknown situations in a similar way to humans.</p

    Generic Object Detection and Segmentation for Real-World Environments

    Get PDF

    An affective computing and image retrieval approach to support diversified and emotion-aware reminiscence therapy sessions

    Get PDF
    A demĂȘncia Ă© uma das principais causas de dependĂȘncia e incapacidade entre as pessoas idosas em todo o mundo. A terapia de reminiscĂȘncia Ă© uma terapia nĂŁo farmacolĂłgica comummente utilizada nos cuidados com demĂȘncia devido ao seu valor terapĂȘutico para as pessoas com demĂȘncia. Esta terapia Ă© Ăștil para criar uma comunicação envolvente entre pessoas com demĂȘncia e o resto do mundo, utilizando as capacidades preservadas da memĂłria a longo prazo, em vez de enfatizar as limitaçÔes existentes por forma a aliviar a experiĂȘncia de fracasso e isolamento social. As soluçÔes tecnolĂłgicas de assistĂȘncia existentes melhoram a terapia de reminiscĂȘncia ao proporcionar uma experiĂȘncia mais envolvente para todos os participantes (pessoas com demĂȘncia, familiares e clĂ­nicos), mas nĂŁo estĂŁo livres de lacunas: a) os dados multimĂ©dia utilizados permanecem inalterados ao longo das sessĂ”es, e hĂĄ uma falta de personalização para cada pessoa com demĂȘncia; b) nĂŁo tĂȘm em conta as emoçÔes transmitidas pelos dados multimĂ©dia utilizados nem as reacçÔes emocionais da pessoa com demĂȘncia aos dados multimĂ©dia apresentados; c) a perspectiva dos cuidadores ainda nĂŁo foi totalmente tida em consideração. Para superar estes desafios, seguimos uma abordagem de concepção centrada no utilizador atravĂ©s de inquĂ©ritos mundiais, entrevistas de seguimento, e grupos de discussĂŁo com cuidadores formais e informais para informar a concepção de soluçÔes tecnolĂłgicas no Ăąmbito dos cuidados de demĂȘncia. Para cumprir com os requisitos identificados, propomos novos mĂ©todos que facilitam a inclusĂŁo de emoçÔes no loop durante a terapia de reminiscĂȘncia para personalizar e diversificar o conteĂșdo das sessĂ”es ao longo do tempo. As contribuiçÔes desta tese incluem: a) um conjunto de requisitos funcionais validados recolhidos com os cuidadores formais e informais, os resultados esperados com o cumprimento de cada requisito, e um modelo de arquitectura para o desenvolvimento de soluçÔes tecnolĂłgicas de assistĂȘncia para cuidados de demĂȘncia; b) uma abordagem end-to-end para identificar automaticamente mĂșltiplas informaçÔes emocionais transmitidas por imagens; c) uma abordagem para reduzir a quantidade de imagens que precisam ser anotadas pelas pessoas sem comprometer o desempenho dos modelos de reconhecimento; d) uma tĂ©cnica de fusĂŁo tardia interpretĂĄvel que combina dinamicamente mĂșltiplos sistemas de recuperação de imagens com base em conteĂșdo para procurar eficazmente por imagens semelhantes para diversificar e personalizar o conjunto de imagens disponĂ­veis para serem utilizadas nas sessĂ”es.Dementia is one of the major causes of dependency and disability among elderly subjects worldwide. Reminiscence therapy is an inexpensive non-pharmacological therapy commonly used within dementia care due to its therapeutic value for people with dementia. This therapy is useful to create engaging communication between people with dementia and the rest of the world by using the preserved abilities of long-term memory rather than emphasizing the existing impairments to alleviate the experience of failure and social isolation. Current assistive technological solutions improve reminiscence therapy by providing a more lively and engaging experience to all participants (people with dementia, family members, and clinicians), but they are not free of drawbacks: a) the multimedia data used remains unchanged throughout sessions, and there is a lack of customization for each person with dementia; b) they do not take into account the emotions conveyed by the multimedia data used nor the person with dementia’s emotional reactions to the multimedia presented; c) the caregivers’ perspective have not been fully taken into account yet. To overcome these challenges, we followed a usercentered design approach through worldwide surveys, follow-up interviews, and focus groups with formal and informal caregivers to inform the design of technological solutions within dementia care. To fulfil the requirements identified, we propose novel methods that facilitate the inclusion of emotions in the loop during reminiscence therapy to personalize and diversify the content of the sessions over time. Contributions from this thesis include: a) a set of validated functional requirements gathered from formal and informal caregivers, the expected outcomes with the fulfillment of each requirement, and an architecture’s template for the development of assistive technology solutions for dementia care; b) an end-to-end approach to automatically identify multiple emotional information conveyed by images; c) an approach to reduce the amount of images that need to be annotated by humans without compromising the recognition models’ performance; d) an interpretable late-fusion technique that dynamically combines multiple content-based image retrieval systems to effectively search for similar images to diversify and personalize the pool of images available to be used in sessions

    Simulated role-playing from crowdsourced data

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2013.Cataloged from PDF version of thesis.Includes bibliographical references (p. 173-178).Collective Artificial Intelligence (CAl) simulates human intelligence from data contributed by many humans, mined for inter-related patterns. This thesis applies CAI to social role-playing, introducing an end-to-end process for compositing recorded performances from thousands of humans, and simulating open-ended interaction from this data. The CAI process combines crowdsourcing, pattern discovery, and case-based planning. Content creation is crowdsourced by recording role-players online. Browser-based tools allow nonexperts to annotate data, organizing content into a hierarchical narrative structure. Patterns discovered from data power a novel system combining plan recognition with case-based planning. The combination of this process and structure produces a new medium, which exploits a massive corpus to realize characters who interact and converse with humans. This medium enables new experiences in videogames, and new classes of training simulations, therapeutic applications, and social robots. While advances in graphics support incredible freedom to interact physically in simulations, current approaches to development restrict simulated social interaction to hand-crafted branches that do not scale to the thousands of possible patterns of actions and utterances observed in actual human interaction. There is a tension between freedom and system comprehension due to two bottlenecks, making open-ended social interaction a challenge. First is the authorial effort entailed to cover all possible inputs. Second, like other cognitive processes, imagination is a bounded resource. Any individual author only has so much imagination. The convergence of advances in connectivity, storage, and processing power is bringing people together in ways never before possible, amplifying the imagination of individuals by harnessing the creativity and productivity of the crowd, revolutionizing how we create media, and what media we can create. By embracing data-driven approaches, and capitalizing on the creativity of the crowd, authoring bottlenecks can be overcome, taking a step toward realizing a medium that robustly supports player choice. Doing so requires rethinking both technology and division of labor in media production. As a proof of concept, a CAI system has been evaluated by recording over 10,000 performances in The Restaurant Game, automating an Al-controlled waitress who interacts in the world, and converses with a human via text or speech. Quantitative results demonstrate how CAI supports significantly more open-ended interaction with humans, while focus groups reveal factors for improving engagement.by Jeffrey David Orkin.Ph.D

    Semantic Robot Programming for Taskable Goal-Directed Manipulation

    Full text link
    Autonomous robots have the potential to assist people to be more productive in factories, homes, hospitals, and similar environments. Unlike traditional industrial robots that are pre-programmed for particular tasks in controlled environments, modern autonomous robots should be able to perform arbitrary user-desired tasks. Thus, it is beneficial to provide pathways to enable users to program an arbitrary robot to perform an arbitrary task in an arbitrary world. Advances in robot Programming by Demonstration (PbD) has made it possible for end-users to program robot behavior for performing desired tasks through demonstrations. However, it still remains a challenge for users to program robot behavior in a generalizable, performant, scalable, and intuitive manner. In this dissertation, we address the problem of robot programming by demonstration in a declarative manner by introducing the concept of Semantic Robot Programming (SRP). In SRP, we focus on addressing the following challenges for robot PbD: 1) generalization across robots, tasks, and worlds, 2) robustness under partial observations of cluttered scenes, 3) efficiency in task performance as the workspace scales up, and 4) feasibly intuitive modalities of interaction for end-users to demonstrate tasks to robots. Through SRP, our objective is to enable an end-user to intuitively program a mobile manipulator by providing a workspace demonstration of the desired goal scene. We use a scene graph to semantically represent conditions on the current and goal states of the world. To estimate the scene graph given raw sensor observations, we bring together discriminative object detection and generative state estimation for the inference of object classes and poses. The proposed scene estimation method outperformed the state of the art in cluttered scenes. With SRP, we successfully enabled users to program a Fetch robot to set up a kitchen tray on a cluttered tabletop in 10 different start and goal settings. In order to scale up SRP from tabletop to large scale, we propose Contextual-Temporal Mapping (CT-Map) for semantic mapping of large scale scenes given streaming sensor observations. We model the semantic mapping problem via a Conditional Random Field (CRF), which accounts for spatial dependencies between objects. Over time, object poses and inter-object spatial relations can vary due to human activities. To deal with such dynamics, CT-Map maintains the belief over object classes and poses across an observed environment. We present CT-Map semantically mapping cluttered rooms with robustness to perceptual ambiguities, demonstrating higher accuracy on object detection and 6 DoF pose estimation compared to state-of-the-art neural network-based object detector and commonly adopted 3D registration methods. Towards SRP at the building scale, we explore notions of Generalized Object Permanence (GOP) for robots to search for objects efficiently. We state the GOP problem as the prediction of where an object can be located when it is not being directly observed by a robot. We model object permanence via a factor graph inference model, with factors representing long-term memory, short-term memory, and common sense knowledge over inter-object spatial relations. We propose the Semantic Linking Maps (SLiM) model to maintain the belief over object locations while accounting for object permanence through a CRF. Based on the belief maintained by SLiM, we present a hybrid object search strategy that enables the Fetch robot to actively search for objects on a large scale, with a higher search success rate and less search time compared to state-of-the-art search methods.PHDElectrical and Computer EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/155073/1/zengzhen_1.pd

    Virtual Reality

    Get PDF
    At present, the virtual reality has impact on information organization and management and even changes design principle of information systems, which will make it adapt to application requirements. The book aims to provide a broader perspective of virtual reality on development and application. First part of the book is named as "virtual reality visualization and vision" and includes new developments in virtual reality visualization of 3D scenarios, virtual reality and vision, high fidelity immersive virtual reality included tracking, rendering and display subsystems. The second part named as "virtual reality in robot technology" brings forth applications of virtual reality in remote rehabilitation robot-based rehabilitation evaluation method and multi-legged robot adaptive walking in unstructured terrains. The third part, named as "industrial and construction applications" is about the product design, space industry, building information modeling, construction and maintenance by virtual reality, and so on. And the last part, which is named as "culture and life of human" describes applications of culture life and multimedia-technology
    • 

    corecore