11 research outputs found

    Learning the Semantics of Manipulation Action

    Full text link
    In this paper we present a formal computational framework for modeling manipulation actions. The introduced formalism leads to semantics of manipulation action and has applications to both observing and understanding human manipulation actions as well as executing them with a robotic mechanism (e.g. a humanoid robot). It is based on a Combinatory Categorial Grammar. The goal of the introduced framework is to: (1) represent manipulation actions with both syntax and semantic parts, where the semantic part employs λ\lambda-calculus; (2) enable a probabilistic semantic parsing schema to learn the λ\lambda-calculus representation of manipulation action from an annotated action corpus of videos; (3) use (1) and (2) to develop a system that visually observes manipulation actions and understands their meaning while it can reason beyond observations using propositional logic and axiom schemata. The experiments conducted on a public available large manipulation action dataset validate the theoretical framework and our implementation

    Robot Learning and Execution of Collaborative Manipulation Plans from YouTube Cooking Videos

    Full text link
    People often watch videos on the web to learn how to cook new recipes, assemble furniture or repair a computer. We wish to enable robots with the very same capability. This is challenging; there is a large variation in manipulation actions and some videos even involve multiple persons, who collaborate by sharing and exchanging objects and tools. Furthermore, the learned representations need to be general enough to be transferable to robotic systems. On the other hand, previous work has shown that the space of human manipulation actions has a linguistic, hierarchical structure that relates actions to manipulated objects and tools. Building upon this theory of language for action, we propose a framework for understanding and executing demonstrated action sequences from full-length, unconstrained cooking videos on the web. The framework takes as input a cooking video annotated with object labels and bounding boxes, and outputs a collaborative manipulation action plan for one or more robotic arms. We demonstrate performance of the system in a standardized dataset of 100 YouTube cooking videos, as well as in three full-length Youtube videos that include collaborative actions between two participants. We additionally propose an open-source platform for executing the learned plans in a simulation environment as well as with an actual robotic arm

    A Survey of Knowledge Representation in Service Robotics

    Full text link
    Within the realm of service robotics, researchers have placed a great amount of effort into learning, understanding, and representing motions as manipulations for task execution by robots. The task of robot learning and problem-solving is very broad, as it integrates a variety of tasks such as object detection, activity recognition, task/motion planning, localization, knowledge representation and retrieval, and the intertwining of perception/vision and machine learning techniques. In this paper, we solely focus on knowledge representations and notably how knowledge is typically gathered, represented, and reproduced to solve problems as done by researchers in the past decades. In accordance with the definition of knowledge representations, we discuss the key distinction between such representations and useful learning models that have extensively been introduced and studied in recent years, such as machine learning, deep learning, probabilistic modelling, and semantic graphical structures. Along with an overview of such tools, we discuss the problems which have existed in robot learning and how they have been built and used as solutions, technologies or developments (if any) which have contributed to solving them. Finally, we discuss key principles that should be considered when designing an effective knowledge representation.Comment: Accepted for RAS Special Issue on Semantic Policy and Action Representations for Autonomous Robots - 22 Page

    Modeling and Recognizing Assembly Actions

    Get PDF
    We develop the task of assembly understanding by applying concepts from computer vision, robotics, and sequence modeling. Motivated by the need to develop tools for recording and analyzing experimental data for a collaborative study of spatial cognition in humans, we gradually extend an application-specific model into a framework that is broadly applicable across data modalities and application instances. The core of our approach is a sequence model that relates assembly actions to their structural consequences. We combine this sequence model with increasingly-general observation models. With each iteration we increase the variety of applications that can be considered by our framework, and decrease the complexity of modeling decisions that designers are required to make. First we present an initial solution for modeling and recognizing assembly activities in our primary application: videos of children performing a block-assembly task. We develop a symbolic model that completely characterizes the fine-grained temporal and geometric structure of assembly sequences, then combine this sequence model with a probabilistic visual observation model that operates by rendering and registering template images of each assembly hypothesis. Then, we extend this perception system by incorporating kinematic sensor-based observations. We use a part-based observation model that compares mid-level attributes derived from sensor streams with their corresponding predictions from assembly hypotheses. We additionally address the joint segmentation and classification of assembly sequences for the first time, resulting in a feature-based segmental CRF framework. Finally, we address the task of learning observation models rather than constructing them by hand. To achieve this we incorporate contemporary, vision-based action recognition models into our segmental CRF framework. In this approach, the only information required from a tool designer is a mapping from human-centric activities to our previously-defined task-centric activities. These innovations have culminated in a method for modeling fine-grained assembly actions that can be applied generally to any kinematic structure, along with a set of techniques for recognizing assembly actions and structures from a variety of modalities and sensors

    Action Categorisation in Multimodal Instructions

    Get PDF
    We present an explorative study for the (semi-)automatic categorisation of actions in Dutch multimodal first aid instructions, where the actions needed to successfully execute the procedure in question are presented verbally and in pictures. We start with the categorisation of verbalised actions and expect that this will later facilitate the identification of those actions in the pictures, which is known to be hard. Comparisons of and user-based experimentation with the verbal and visual representations will allow us to determine the effectiveness of picture-text combinations and will eventually support the automatic generation of multimodal documents. We used Natural Language Processing tools to identify and categorise 2,388 verbs in a corpus of 78 multimodal instructions (MIs). We show that the main action structure of an instruction can be retrieved through verb identification using the Alpino parser followed by a manual election operation. The selected main action verbs were subsequently generalised and categorised with the use of Cornetto, a lexical resource that combines a Dutch Wordnet and a Dutch Reference Lexicon. Results show that these tools are useful but also have limitations which make human intervention essential to guide an accurate categorisation of actions in multimodal instructions

    Robot Learning from Human Demonstrations for Human-Robot Synergy

    Get PDF
    Human-robot synergy enables new developments in industrial and assistive robotics research. In recent years, collaborative robots can work together with humans to perform a task, while sharing the same workplace. However, the teachability of robots is a crucial factor, in order to establish the role of robots as human teammates. Robots require certain abilities, such as easily learning diversified tasks and adapting to unpredicted events. The most feasible method, which currently utilizes human teammate to teach robots how to perform a task, is the Robot Learning from Demonstrations (RLfD). The goal of this method is to allow non-expert users to a programa a robot by simply guiding the robot through a task. The focus of this thesis is on the development of a novel framework for Robot Learning from Demonstrations that enhances the robotsa abilities to learn and perform the sequences of actions for object manipulation tasks (high-level learning) and, simultaneously, learn and adapt the necessary trajectories for object manipulation (low-level learning). A method that automatically segments demonstrated tasks into sequences of actions is developed in this thesis. Subsequently, the generated sequences of actions are employed by a Reinforcement Learning (RL) from human demonstration approach to enable high-level robot learning. The low-level robot learning consists of a novel method that selects similar demonstrations (in case of multiple demonstrations of a task) and the Gaussian Mixture Model (GMM) method. The developed robot learning framework allows learning from single and multiple demonstrations. As soon as the robot has the knowledge of a demonstrated task, it can perform the task in cooperation with the human. However, the need for adaptation of the learned knowledge may arise during the human-robot synergy. Firstly, Interactive Reinforcement Learning (IRL) is employed as a decision support method to predict the sequence of actions in real-time, to keep the human in the loop and to enable learning the usera s preferences. Subsequently, a novel method that modifies the learned Gaussian Mixture Model (m-GMM) is developed in this thesis. This method allows the robot to cope with changes in the environment, such as objects placed in a different from the demonstrated pose or obstacles, which may be introduced by the human teammate. The modified Gaussian Mixture Model is further used by the Gaussian Mixture Regression (GMR) to generate a trajectory, which can efficiently control the robot. The developed framework for Robot Learning from Demonstrations was evaluated in two different robotic platforms: a dual-arm industrial robot and an assistive robotic manipulator. For both robotic platforms, small studies were performed for industrial and assistive manipulation tasks, respectively. Several Human-Robot Interaction (HRI) methods, such as kinesthetic teaching, gamepad or a hands-freea via head gestures, were used to provide the robot demonstrations. The a hands-freea HRI enables individuals with severe motor impairments to provide a demonstration of an assistive task. The experimental results demonstrate the potential of the developed robot learning framework to enable continuous humana robot synergy in industrial and assistive applications

    Multi-Modal Models for Fine-grained Action Segmentation in Situated Environments

    Get PDF
    Automated methods for analyzing human activities from video or sensor data are critical for enabling new applications in human-robot interaction, surgical data modeling, video summarization, and beyond. Despite decades of research in the fields of robotics and computer vision, current approaches are inadequate for modeling complex activities outside of constrained environments or without using heavily instrumented sensor suites. In this dissertation, I address the problem of fine-grained action segmentation by developing solutions that generalize from domain-specific to general-purpose for applications in surgical workflow, surveillance, and cooking. A key technical challenge, which is central to this dissertation, is how to capture complex temporal patterns from sensor data. For a given task, users may perform the same action at different speeds or styles, and each user may carry out actions in a different order. I present a series of temporal models that address these modes of variability. First, I define the notion of a convolutional action primitive, which captures how low-level sensor signals change as a function of the action a user is performing. Second, I generalize this idea to video with a Spatiotemporal Convolutional Neural Network, which captures relationships between objects in an image and how they change temporally. Lastly, I discuss a hierarchical variant that applies to video or sensor data, called a Temporal Convolutional Network (TCN), which models actions at multiple temporal scales. In certain domains (e.g., surgical training), TCNs can be used to successfully bridge the gap in performance between domain-specific and general-purpose solutions. A key scientific challenge concerns the evaluation of predicted action segmentations. In many applications, action labels may be ill-defined and if one asks two different annotators when a given action starts and stops they may give answers that are seconds apart. I argue that the standard action segmentation metrics are insufficient for evaluating real-world segmentation performance and propose two alternatives. Qualitatively, these metrics are better at capturing the efficacy of models in the described applications. I conclude with a case-study on surgical workflow analysis, which has the potential to improve surgical education and operating room efficiency. Current work almost exclusively relies on extensive instrumentation, which is difficult and costly to acquire. I show that our spatiotemporal video models are capable of capturing important surgical attributes (e.g., organs, tools) and achieve state-of-the-art performance on two challenging datasets. The models and methodology described have demonstrably improved the ability to temporally segment complex human activities, in many cases without sophisticated instrumentation
    corecore