536 research outputs found

    Integration of Action and Language Knowledge: A Roadmap for Developmental Robotics

    Get PDF
    “This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder." “Copyright IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.”This position paper proposes that the study of embodied cognitive agents, such as humanoid robots, can advance our understanding of the cognitive development of complex sensorimotor, linguistic, and social learning skills. This in turn will benefit the design of cognitive robots capable of learning to handle and manipulate objects and tools autonomously, to cooperate and communicate with other robots and humans, and to adapt their abilities to changing internal, environmental, and social conditions. Four key areas of research challenges are discussed, specifically for the issues related to the understanding of: 1) how agents learn and represent compositional actions; 2) how agents learn and represent compositional lexica; 3) the dynamics of social interaction and learning; and 4) how compositional action and language representations are integrated to bootstrap the cognitive system. The review of specific issues and progress in these areas is then translated into a practical roadmap based on a series of milestones. These milestones provide a possible set of cognitive robotics goals and test scenarios, thus acting as a research roadmap for future work on cognitive developmental robotics.Peer reviewe

    Context-Independent Task Knowledge for Neurosymbolic Reasoning in Cognitive Robotics

    Get PDF
    One of the current main goals of artificial intelligence and robotics research is the creation of an artificial assistant which can have flexible, human like behavior, in order to accomplish everyday tasks. A lot of what is context-independent task knowledge to the human is what enables this flexibility at multiple levels of cognition. In this scope the author analyzes how to acquire, represent and disambiguate symbolic knowledge representing context-independent task knowledge, abstracted from multiple instances: this thesis elaborates the incurred problems, implementation constraints, current state-of-the-art practices and ultimately the solutions newly introduced in this scope. The author specifically discusses acquisition of context-independent task knowledge from large amounts of human-written texts and their reusability in the robotics domain; the acquisition of knowledge on human musculoskeletal dependencies constraining motion which allows a better higher level representation of observed trajectories; the means of verbalization of partial contextual and instruction knowledge, increasing interaction possibilities with the human as well as contextual adaptation. All the aforementioned points are supported by evaluation in heterogeneous setups, to bring a view on how to make optimal use of statistical & symbolic applications (i.e. neurosymbolic reasoning) in cognitive robotics. This work has been performed to enable context-adaptable artificial assistants, by bringing together knowledge on what is usually regarded as context-independent task knowledge

    Can humain association norm evaluate latent semantic analysis?

    Get PDF
    This paper presents the comparison of word association norm created by a psycholinguistic experiment to association lists generated by algorithms operating on text corpora. We compare lists generated by Church and Hanks algorithm and lists generated by LSA algorithm. An argument is presented on how those automatically generated lists reflect real semantic relations

    Brain Rhythms in Object Recognition and Manipulation

    Get PDF
    Our manual interactions with objects represent the most fundamental activity in our everyday life. Whereas the grasp of an object is driven by the perceptual senses, using an object for its function relies on learnt experience to retrieve. Recent theories explain how the brain takes decisions based on perceptual information, yet the question of how does it retrieve object knowledge to use tools remains unanswered. Discovering the neuronal implementation of the retrieval of object knowledge would help understanding praxic impairments and provide appropriate neurorehabilitation. This thesis reports five investigations on the neuronal oscillatory activity involved in accessing object knowledge. Employing an original paradigm combining EEG recordings with tool use training in virtual reality, I demonstrated that beta oscillations are crucial to the retrieval of object knowledge during object recognition. Multiple evidence points toward an access to object knowledge during the 300 to 400 ms of visual processing. The different topographies of the beta oscillations suggest that tool knowledge is encoded in distinct brain areas but generally located within the left hemisphere. Importantly, learning action information about an object has consequences on its manipulations. Multiplying tool use knowledge about an object increases the beta desynchronization and slows down motor control. Furthermore, the present data report an influence of language on object manipulations and beta oscillations, in a way that learning the name of an object speeds up its use while impedes its grasp. This shred of evidence led to the formulation of three testable hypotheses extending contemporary theories of object manipulation and semantic memory. First, the preparation of object transportation or use could be distinguished by the synchronization/desynchronization patterns of mu and beta rhythms. Second, action competitions originate from both perceptuo-motor and memory systems. Third, accessing to semantic object knowledge during object processing could be indexed by the bursts of desynchronization of high-beta oscillations in the brain.MSCA-ETN SECURE [642667

    Action-oriented Scene Understanding

    Get PDF
    In order to allow robots to act autonomously it is crucial that they do not only describe their environment accurately but also identify how to interact with their surroundings. While we witnessed tremendous progress in descriptive computer vision, approaches that explicitly target action are scarcer. This cumulative dissertation approaches the goal of interpreting visual scenes “in the wild” with respect to actions implied by the scene. We call this approach action-oriented scene understanding. It involves identifying and judging opportunities for interaction with constituents of the scene (e.g. objects and their parts) as well as understanding object functions and how interactions will impact the future. All of these aspects are addressed on three levels of abstraction: elements, perception and reasoning. On the elementary level, we investigate semantic and functional grouping of objects by analyzing annotated natural image scenes. We compare object label-based and visual context definitions with respect to their suitability for generating meaningful object class representations. Our findings suggest that representations generated from visual context are on-par in terms of semantic quality with those generated from large quantities of text. The perceptive level concerns action identification. We propose a system to identify possible interactions for robots and humans with the environment (affordances) on a pixel level using state-of-the-art machine learning methods. Pixel-wise part annotations of images are transformed into 12 affordance maps. Using these maps, a convolutional neural network is trained to densely predict affordance maps from unknown RGB images. In contrast to previous work, this approach operates exclusively on RGB images during both, training and testing, and yet achieves state-of-the-art performance. At the reasoning level, we extend the question from asking what actions are possible to what actions are plausible. For this, we gathered a dataset of household images associated with human ratings of the likelihoods of eight different actions. Based on the judgement provided by the human raters, we train convolutional neural networks to generate plausibility scores from unseen images. Furthermore, having considered only static scenes previously in this thesis, we propose a system that takes video input and predicts plausible future actions. Since this requires careful identification of relevant features in the video sequence, we analyze this particular aspect in detail using a synthetic dataset for several state-of-the-art video models. We identify feature learning as a major obstacle for anticipation in natural video data. The presented projects analyze the role of action in scene understanding from various angles and in multiple settings while highlighting the advantages of assuming an action-oriented perspective. We conclude that action-oriented scene understanding can augment classic computer vision in many real-life applications, in particular robotics

    No fruits without color: Cross-modal priming and EEG reveal different roles for different features across semantic categories

    Get PDF
    Category-specific impairments witnessed in patients with semantic deficits have broadly dissociated into natural and artificial kinds. However, how the category of food (more specifically, fruits and vegetables) fits into this distinction has been difficult to interpret, given a pattern of deficit that has inconsistently mapped onto either kind, despite its intuitive membership to the natural domain. The present study explores the effects of a manipulation of a visual sensory (i.e., color) or functional (i.e., orientation) feature on the consequential semantic processing of fruits and vegetables (and tools, by comparison), first at the behavioral and then at the neural level. The categorization of natural (i.e., fruits/vegetables) and artificial (i.e., utensils) entities was investigated via cross-modal priming. Reaction time analysis indicated a reduction in priming for color-modified natural entities and orientationmodified artificial entities. Standard event-related potentials (ERP) analysis was performed, in addition to linear classification. For natural entities, a N400 effect at central channel sites was observed for the color-modified condition compared relative to normal and orientation conditions, with this difference confirmed by classification analysis. Conversely, there was no significant difference between conditions for the artificial category in either analysis. These findings provide strong evidence that color is an integral property to the categorization of fruits/vegetables, thus substantiating the claim that feature-based processing guides as a function of semantic category

    Action-Related Representations

    Get PDF
    Theories of grounded cognition state that there is a meaningful connection between action and cognition. Although these claims are widely accepted, the nature and structure of this connection is far from clear and is still a matter of controversy. This book argues for a type of cognitive representation that essentially combines cognition and action, and which is foundational for higher-order cognitive capacities

    LOOKING INTO ACTORS, OBJECTS AND THEIR INTERACTIONS FOR VIDEO UNDERSTANDING

    Get PDF
    Automatic video understanding is critical for enabling new applications in video surveillance, augmented reality, and beyond. Powered by deep networks that learn holistic representations of video clips, and large-scale annotated datasets, modern systems are capable of accurately recognizing hundreds of human activity classes. However, their performance significantly degrades as the number of actors in the scene or the complexity of the activities increases. Therefore, most of the research thus far has focused on videos that are short and/or contain a few activities performed only by adults. Furthermore, most current systems require expensive, spatio-temporal annotations for training. These limitations prevent the deployment of such systems in real-life applications, such as detecting activities of people and vehicles in an extended surveillance videos. To address these limitations, this thesis focuses on developing data-driven, compositional, region-based video understanding models motivated by the observation that actors, objects and their spatio-temporal interactions are the building blocks of activities and the main content of video descriptions provided by humans. This thesis makes three main contributions. First, we propose a novel Graph Neural Network for representation learning on heterogeneous graphs that encode spatio-temporal interactions between actor and object regions in videos. This model can learn context-aware representations for detected actors and objects, which we leverage for detecting complex activities. Second, we propose an attention-based deep conditional generative model of sentences, whose latent variables correspond to alignments between words in textual descriptions of videos and object regions. Building upon the framework of Conditional Variational Autoencoders, we train this model using only textual descriptions without bounding box annotations, and leverage its latent variables for localizing the actors and objects that are mentioned in generated or ground-truth descriptions of videos. Finally, we propose an actor-centric framework for real-time activity detection in videos that are extended both in space and time. Our framework leverages object detections and tracking to generate actor-centric tubelets, capturing all relevant spatio-temporal context for a single actor, and detects activities per tubelet based on contextual region embeddings. The models described have demonstrably improved the ability to temporally detect activities, as well as ground words in visual inputs
    • …
    corecore