2,707 research outputs found

    Translating Videos to Commands for Robotic Manipulation with Deep Recurrent Neural Networks

    Full text link
    We present a new method to translate videos to commands for robotic manipulation using Deep Recurrent Neural Networks (RNN). Our framework first extracts deep features from the input video frames with a deep Convolutional Neural Networks (CNN). Two RNN layers with an encoder-decoder architecture are then used to encode the visual features and sequentially generate the output words as the command. We demonstrate that the translation accuracy can be improved by allowing a smooth transaction between two RNN layers and using the state-of-the-art feature extractor. The experimental results on our new challenging dataset show that our approach outperforms recent methods by a fair margin. Furthermore, we combine the proposed translation module with the vision and planning system to let a robot perform various manipulation tasks. Finally, we demonstrate the effectiveness of our framework on a full-size humanoid robot WALK-MAN

    Grounding of Human Environments and Activities for Autonomous Robots

    Get PDF
    With the recent proliferation of robotic applications in domestic and industrial scenarios, it is vital for robots to continually learn about their environments and about the humans they share their environments with. In this paper, we present a framework for autonomous, unsupervised learning from various sensory sources of useful human ‘concepts’; including colours, people names, usable objects and simple activities. This is achieved by integrating state-of-the-art object segmentation, pose estimation, activity analysis and language grounding into a continual learning framework. Learned concepts are grounded to natural language if commentary is available, allowing the robot to communicate in a human-understandable way. We show, using a challenging, real-world dataset of human activities, that our framework is able to extract useful concepts, ground natural language descriptions to them, and, as a proof-of-concept, to generate simple sentences from templates to describe people and activities

    Joint Perceptual Learning and Natural Language Acquisition for Autonomous Robots

    Get PDF
    Understanding how children learn the components of their mother tongue and the meanings of each word has long fascinated linguists and cognitive scientists. Equally, robots face a similar challenge in understanding language and perception to allow for a natural and effortless human-robot interaction. Acquiring such knowledge is a challenging task, unless this knowledge is preprogrammed, which is no easy task either, nor does it solve the problem of language difference between individuals or learning the meaning of new words. In this thesis, the problem of bootstrapping knowledge in language and vision for autonomous robots is addressed through novel techniques in grammar induction and word grounding to the perceptual world. The learning is achieved in a cognitively plausible loosely-supervised manner from raw linguistic and visual data. The visual data is collected using different robotic platforms deployed in real-world and simulated environments and equipped with different sensing modalities, while the linguistic data is collected using online crowdsourcing tools and volunteers. The presented framework does not rely on any particular robot or any specific sensors; rather it is flexible to what the modalities of the robot can support. The learning framework is divided into three processes. First, the perceptual raw data is clustered into a number of Gaussian components to learn the ‘visual concepts’. Second, frequent co-occurrence of words and visual concepts are used to learn the language grounding, and finally, the learned language grounding and visual concepts are used to induce probabilistic grammar rules to model the language structure. In this thesis, the visual concepts refer to: (i) people’s faces and the appearance of their garments; (ii) objects and their perceptual properties; (iii) pairwise spatial relations; (iv) the robot actions; and (v) human activities. The visual concepts are learned by first processing the raw visual data to find people and objects in the scene using state-of-the-art techniques in human pose estimation, object segmentation and tracking, and activity analysis. Once found, the concepts are learned incrementally using a combination of techniques: Incremental Gaussian Mixture Models and a Bayesian Information Criterion to learn simple visual concepts such as object colours and shapes; spatio-temporal graphs and topic models to learn more complex visual concepts, such as human activities and robot actions. Language grounding is enabled by seeking frequent co-occurrence between words and learned visual concepts. Finding the correct language grounding is formulated as an integer programming problem to find the best many-to-many matches between words and concepts. Grammar induction refers to the process of learning a formal grammar (usually as a collection of re-write rules or productions) from a set of observations. In this thesis, Probabilistic Context Free Grammar rules are generated to model the language by mapping natural language sentences to learned visual concepts, as opposed to traditional supervised grammar induction techniques where the learning is only made possible by using manually annotated training examples on large datasets. The learning framework attains its cognitive plausibility from a number of sources. First, the learning is achieved by providing the robot with pairs of raw linguistic and visual inputs in a “show-and-tell” procedure akin to how human children learn about their environment. Second, no prior knowledge is assumed about the meaning of words or the structure of the language, except that there are different classes of words (corresponding to observable actions, spatial relations, and objects and their observable properties). Third, the knowledge in both language and vision is obtained in an incremental manner where the gained knowledge can evolve to adapt to new observations without the need to revisit previously seen ones (previous observations). Fourth, the robot learns about the visual world first, then it learns about how it maps to language, which aligns with the findings of cognitive studies on language acquisition in human infants that suggest children come to develop considerable cognitive understanding about their environment in the pre-linguistic period of their lives. It should be noted that this work does not claim to be modelling how humans learn about objects in their environments, but rather it is inspired by it. For validation, four different datasets are used which contain temporally aligned video clips of people or robots performing activities, and sentences describing these video clips. The video clips are collected using four robotic platforms, three robot arms in simple block-world scenarios and a mobile robot deployed in a challenging real-world office environment observing different people performing complex activities. The linguistic descriptions for these datasets are obtained using Amazon Mechanical Turk and volunteers. The analysis performed on these datasets suggest that the learning framework is suitable to learn from complex real-world scenarios. The experimental results show that the learning framework enables (i) acquiring correct visual concepts from visual data; (ii) learning the word grounding for each of the extracted visual concepts; (iii) inducing correct grammar rules to model the language structure; (iv) using the gained knowledge to understand previously unseen linguistic commands; and (v) using the gained knowledge to generate well-formed natural language descriptions of novel scenes

    Apprentissage structuré à partir de vidéos et langage

    Get PDF
    The goal of this thesis is to develop models, representations and structured learning algorithms for the automatic understanding of complex human activities from instructional videos narrated with natural language. We first introduce a model that, given a set of narrated instructional videos describing a task, is able to generate a list of action steps needed to complete the task and locate them in the visual and textual streams. To that end, we formulate two assumptions. First, people perform actions when they mention them. Second, we assume that complex tasks are composed of an ordered sequence of action steps. Equipped with these two hypotheses, our model first clusters the textual inputs and then uses this output to refine the location of the action steps in the video. We evaluate our model on a newly collected dataset of instructional videos depicting 5 different complex goal oriented tasks. We then present an approach to link action and the manipulated objects. More precisely, we focus on actions that aim at modifying the state of a specific object, such as pouring a coffee cup or opening a door. Such actions are an inherent part of instructional videos. Our method is based on the optimization of a joint cost between actions and object states under constraints. The constraints are reflecting our assumption that there is a consistent temporal order for the changes in object states and manipulation actions. We demonstrate experimentally that object states help localizing actions and conversely that action localization improves object state recognition. All our models are based on discriminative clustering, a technique which allows to leverage the readily available weak supervision contained in instructional videos. In order to deal with the resulting optimization problems, we take advantage of a highly adapted optimization technique: the Frank-Wolfe algorithm. Motivated by the fact that scaling our approaches to thousands of videos is essential in the context of narrated instructional videos, we also present several improvements to make the Frank- Wolfe algorithm faster and more computationally efficient. In particular, we propose three main modifications to the Block-Coordinate Frank-Wolfe algorithm: gap-based sampling, away and pairwise Block Frank-Wolfe steps and a solution to cache the oracle calls. We show the effectiveness of our improvements on four challenging structured prediction tasks.Le but de cette thĂšse est de dĂ©velopper des modĂšles, des reprĂ©sentations adaptĂ©es et des algorithmes de prĂ©diction structurĂ©e afin de pouvoir analyser de maniĂšre automatique des activitĂ©s humaines complexes commentĂ©es par du langage naturel. Dans un premier temps, nous prĂ©sentons un modĂšle capable de dĂ©couvrir quelle est la liste d’actions nĂ©cessaires Ă  l’accomplissement de la tĂąche ainsi que de localiser ces actions dans le flux vidĂ©o et dans la narration textuelle Ă  partir de plusieurs vidĂ©os tutorielles. La premiĂšre hypothĂšse est que les gens rĂ©alisent les actions au moment oĂč ils les dĂ©crivent. La seconde hypothĂšse est que ces tĂąches complexes sont rĂ©alisĂ©es en suivant un ordre prĂ©cis d’actions.. Notre modĂšle est Ă©valuĂ© sur un nouveau jeu de donnĂ©es de vidĂ©os tutorielles qui dĂ©crit 5 tĂąches complexes. Nous proposons ensuite de relier les actions avec les objets manipulĂ©s. Plus prĂ©cisĂ©ment, on se concentre sur un type d’action particuliĂšre qui vise Ă  modifier l’état d’un objet. Par exemple, cela arrive lorsqu’on sert une tasse de cafĂ© ou bien lorsqu’on ouvre une porte. Ce type d’action est particuliĂšrement important dans le contexte des vidĂ©os tutorielles. Notre mĂ©thode consiste Ă  minimiser un objectif commun entre les actions et les objets. Nous dĂ©montrons via des expĂ©riences numĂ©riques que localiser les actions aident Ă  mieux reconnaitre l’état des objets et inversement que modĂ©liser le changement d’état des objets permet de mieux dĂ©terminer le moment oĂč les actions se dĂ©roulent. Tous nos modĂšles sont basĂ©s sur du partionnement discriminatif, une mĂ©thode qui permet d’exploiter la faible supervision contenue dans ce type de vidĂ©os. Cela se rĂ©sume Ă  formuler un problĂšme d’optimisation qui peut se rĂ©soudre aisĂ©ment en utilisant l’algorithme de Frank- Wolfe qui est particuliĂšrement adaptĂ© aux contraintes envisagĂ©es. MotivĂ© par le fait qu’il est trĂšs important d’ĂȘtre en mesure d’exploiter les quelques milliers de vidĂ©os qui sont disponibles en ligne, nous portons enfin notre effort Ă  rendre l’algorithme de Frank-Wolfe plus rapide et plus efficace lorsque confrontĂ© Ă  beaucoup de donnĂ©es. En particulier, nous proposons trois modifications Ă  l’algorithme Block-Coordinate Frank-Wolfe : un Ă©chantillonage adaptatif des exemples d’entrainement, une version bloc des ‘away steps’ et des ‘pairwise steps’ initialement prĂ©vu dans l’algorithme original et enfin une maniĂšre de mettre en cache les appels Ă  l’oracle linĂ©aire
    • 

    corecore