269 research outputs found

    Joint Perceptual Learning and Natural Language Acquisition for Autonomous Robots

    Get PDF
    Understanding how children learn the components of their mother tongue and the meanings of each word has long fascinated linguists and cognitive scientists. Equally, robots face a similar challenge in understanding language and perception to allow for a natural and effortless human-robot interaction. Acquiring such knowledge is a challenging task, unless this knowledge is preprogrammed, which is no easy task either, nor does it solve the problem of language difference between individuals or learning the meaning of new words. In this thesis, the problem of bootstrapping knowledge in language and vision for autonomous robots is addressed through novel techniques in grammar induction and word grounding to the perceptual world. The learning is achieved in a cognitively plausible loosely-supervised manner from raw linguistic and visual data. The visual data is collected using different robotic platforms deployed in real-world and simulated environments and equipped with different sensing modalities, while the linguistic data is collected using online crowdsourcing tools and volunteers. The presented framework does not rely on any particular robot or any specific sensors; rather it is flexible to what the modalities of the robot can support. The learning framework is divided into three processes. First, the perceptual raw data is clustered into a number of Gaussian components to learn the ‘visual concepts’. Second, frequent co-occurrence of words and visual concepts are used to learn the language grounding, and finally, the learned language grounding and visual concepts are used to induce probabilistic grammar rules to model the language structure. In this thesis, the visual concepts refer to: (i) people’s faces and the appearance of their garments; (ii) objects and their perceptual properties; (iii) pairwise spatial relations; (iv) the robot actions; and (v) human activities. The visual concepts are learned by first processing the raw visual data to find people and objects in the scene using state-of-the-art techniques in human pose estimation, object segmentation and tracking, and activity analysis. Once found, the concepts are learned incrementally using a combination of techniques: Incremental Gaussian Mixture Models and a Bayesian Information Criterion to learn simple visual concepts such as object colours and shapes; spatio-temporal graphs and topic models to learn more complex visual concepts, such as human activities and robot actions. Language grounding is enabled by seeking frequent co-occurrence between words and learned visual concepts. Finding the correct language grounding is formulated as an integer programming problem to find the best many-to-many matches between words and concepts. Grammar induction refers to the process of learning a formal grammar (usually as a collection of re-write rules or productions) from a set of observations. In this thesis, Probabilistic Context Free Grammar rules are generated to model the language by mapping natural language sentences to learned visual concepts, as opposed to traditional supervised grammar induction techniques where the learning is only made possible by using manually annotated training examples on large datasets. The learning framework attains its cognitive plausibility from a number of sources. First, the learning is achieved by providing the robot with pairs of raw linguistic and visual inputs in a “show-and-tell” procedure akin to how human children learn about their environment. Second, no prior knowledge is assumed about the meaning of words or the structure of the language, except that there are different classes of words (corresponding to observable actions, spatial relations, and objects and their observable properties). Third, the knowledge in both language and vision is obtained in an incremental manner where the gained knowledge can evolve to adapt to new observations without the need to revisit previously seen ones (previous observations). Fourth, the robot learns about the visual world first, then it learns about how it maps to language, which aligns with the findings of cognitive studies on language acquisition in human infants that suggest children come to develop considerable cognitive understanding about their environment in the pre-linguistic period of their lives. It should be noted that this work does not claim to be modelling how humans learn about objects in their environments, but rather it is inspired by it. For validation, four different datasets are used which contain temporally aligned video clips of people or robots performing activities, and sentences describing these video clips. The video clips are collected using four robotic platforms, three robot arms in simple block-world scenarios and a mobile robot deployed in a challenging real-world office environment observing different people performing complex activities. The linguistic descriptions for these datasets are obtained using Amazon Mechanical Turk and volunteers. The analysis performed on these datasets suggest that the learning framework is suitable to learn from complex real-world scenarios. The experimental results show that the learning framework enables (i) acquiring correct visual concepts from visual data; (ii) learning the word grounding for each of the extracted visual concepts; (iii) inducing correct grammar rules to model the language structure; (iv) using the gained knowledge to understand previously unseen linguistic commands; and (v) using the gained knowledge to generate well-formed natural language descriptions of novel scenes

    Representational fluidity in embodied (artificial) cognition

    Get PDF
    Theories of embodied cognition agree that the body plays some role in human cognition, but disagree on the precise nature of this role. While it is (together with the environment) fundamentally engrained in the so-called 4E (or multi-E) cognition stance, there also exists interpretations wherein the body is merely an input/output interface for cognitive processes that are entirely computational. In the present paper, we show that even if one takes such a strong computationalist position, the role of the body must be more than an interface to the world. To achieve human cognition, the computational mechanisms of a cognitive agent must be capable not only of appropriate reasoning over a given set of symbolic representations; they must in addition be capable of updating the representational framework itself (leading to the titular representational fluidity). We demonstrate this by considering the necessary properties that an artificial agent with these abilities need to possess. The core of the argument is that these updates must be falsifiable in the Popperian sense while simultaneously directing representational shifts in a direction that benefits the agent. We show that this is achieved by the progressive, bottom-up symbolic abstraction of low-level sensorimotor connections followed by top-down instantiation of testable perception-action hypotheses. We then discuss the fundamental limits of this representational updating capacity, concluding that only fully embodied learners exhibiting such a priori perception-action linkages are able to sufficiently ground spontaneously-generated symbolic representations and exhibit the full range of human cognitive capabilities. The present paper therefore has consequences both for the theoretical understanding of human cognition, and for the design of autonomous artificial agents

    Representational fluidity in embodied (artificial) cognition

    Get PDF
    Theories of embodied cognition agree that the body plays some role in human cognition, but disagree on the precise nature of this role. While it is (together with the environment) fundamentally engrained in the so-called 4E (or multi-E) cognition stance, there also exists interpretations wherein the body is merely an input/output interface for cognitive processes that are entirely computational. In the present paper, we show that even if one takes such a strong computationalist position, the role of the body must be more than an interface to the world. To achieve human cognition, the computational mechanisms of a cognitive agent must be capable not only of appropriate reasoning over a given set of symbolic representations; they must in addition be capable of updating the representational framework itself (leading to the titular representational fluidity). We demonstrate this by considering the necessary properties that an artificial agent with these abilities need to possess. The core of the argument is that these updates must be falsifiable in the Popperian sense while simultaneously directing representational shifts in a direction that benefits the agent. We show that this is achieved by the progressive, bottom-up symbolic abstraction of low-level sensorimotor connections followed by top-down instantiation of testable perception-action hypotheses. We then discuss the fundamental limits of this representational updating capacity, concluding that only fully embodied learners exhibiting such a priori perception-action linkages are able to sufficiently ground spontaneously-generated symbolic representations and exhibit the full range of human cognitive capabilities. The present paper therefore has consequences both for the theoretical understanding of human cognition, and for the design of autonomous artificial agents

    Epistemic Constraints on Autonomous Symbolic Representation in Natural and Artificial Agents

    Full text link
    We set out to address, in the form of a survey, the fundamental constraints upon self-updating representation in cognitive agents of natural and artificial origin. The foundational epistemic problem encountered by such agents is that of distinguishing errors of representation from inappropriateness of the representational framework. Resolving this conceptual difficulty involves ensuring the empirical falsifiability of both the representational hypotheses and the entities so represented, while at the same time retaining their epistemic distinguishability. We shall thus argue that perception-action frameworks provide an appropriate basis for the development of an empirically meaningful criterion for validating perceptual categories. In this scenario, hypotheses about the agent’s world are defined in terms of environmental affordances (characterised in terms of the agent’s active capabilities). Agents with the capability to hierarchically-abstract this framework to a level consonant with performing syntactic manipulations and making deductive conjectures are consequently able to form an implicitly symbolic representation of the environment within which new, higher-level, modes of environment manipulation are implied (e.g. tool-use). This abstraction process is inherently open-ended, admitting a wide-range of possible representational hypotheses — only the form of the lowest-level of the hierarchy need be constrained a priori (being the minimally sufficient condition necessary for retention of the ability to falsify high-level hypotheses). In biological agents capable of autonomous cognitive-updating, we argue that the grounding of such a priori ‘bootstrap’ representational hypotheses is ensured via the process of natural selection
    • 

    corecore