13 research outputs found

    Effects of Training Data Variation and Temporal Representation in a QSR-Based Action Prediction System

    Get PDF
    Understanding of behaviour is a crucial skill for Artificial Intelligence systems expected to interact with external agents – whether other AI systems, or humans, in scenarios involving co-operation, such as domestic robots capable of helping out with household jobs, or disaster relief robots expected to collaborate and lend assistance to others. It is useful for such systems to be able to quickly learn and re-use models and skills in new situations. Our work centres around a behaviourlearning system utilising Qualitative Spatial Relations to lessen the amount of training data required by the system, and to aid generalisation. In this paper, we provide an analysis of the advantages provided to our system by the use of QSRs. We provide a comparison of a variety of machine learning techniques utilising both quantitative and qualitative representations, and show the effects of varying amounts of training data and temporal representations upon the system. The subject of our work is the game of simulated RoboCup Soccer Keepaway. Our results show that employing QSRs provides clear advantages in scenarios where training data is limited, and provides for better generalisation performance in classifiers. In addition, we show that adopting a qualitative representation of time can provide significant performance gains for QSR systems

    Benchmarking qualitative spatial calculi for video activity analysis

    Get PDF
    This paper presents a general way of addressing problems in video activity understanding using graph based relational learning. Video activities are described using relational spatio-temporal graphs, that represent qualitative spatio- temporal relations between interacting objects. A wide range of spatio-temporal relations are introduced, as being well suited for describing video activities. Then, a formulation is proposed, in which standard problems in video activity under- standing such as event detection, are naturally mapped to problems in graph based relational learning. Experiments on video understanding tasks, for a video dataset consisting of common outdoor verbs, validate the significance of the proposed approach

    Determining Interacting Objects in Human-Centric Activities via Qualitative Spatio-Temporal Reasoning

    Full text link
    Abstract. Understanding the activities taking place in a video is a chal-lenging problem in Artificial Intelligence. Complex video sequences con-tain many activities and involve a multitude of interacting objects. De-termining which objects are relevant to a particular activity is the first step in understanding the activity. Indeed many objects in the scene are irrelevant to the main activity taking place. In this work, we consider human-centric activities and look to identify which objects in the scene are involved in the activity. We take an activity-agnostic approach and rank every moving object in the scene with how likely it is to be involved in the activity. We use a comprehensive spatio-temporal representation that captures the joint movement between humans and each object. We then use supervised machine learning techniques to recognize relevant objects based on these features. Our approach is tested on the challeng-ing Mind’s Eye dataset.

    Propositional and Activity Monitoring Using Qualitative Spatial Reasoning

    Get PDF
    SM thesisCommunication is the key to effective teamwork regardless of whether the team members are humans or machines. Much of the communication that makes human teams so effective is non-verbal; they are able to recognize the actions that the other team members are performing and take their own actions in order to assist. A robotic team member should be able to make the same inferences, observing the state of the environment and inferring what actions are being taken. In this thesis I introduce a novel approach to the combined problem of activity recognition and propositional monitoring. This approach breaks down the problem into smaller sub-tasks. First, the raw sensor input is parsed into simple, easy to understand primitive semantic relationships known as qualitative spatial relations (QSRs). These primitives are then combined to estimate the state of the world in the same language used by most planners, planning domain definition language (PDDL) propositions. Both the primitives and propositions are combined to infer the status of the actions that the human is taking. I describe an algorithm for solving each of these smaller problems and describe the modeling process for a variety of tasks from an abstracted electronic component assembly (ECA) scenario. I implemented this scenario on a robotic testbed and collected data of a human performing the example actions

    Non-parametric Methods for Correlation Analysis in Multivariate Data with Applications in Data Mining

    Get PDF
    In this thesis, we develop novel methods for correlation analysis in multivariate data, with a special focus on mining correlated subspaces. Our methods handle major open challenges arisen when combining correlation analysis with subspace mining. Besides traditional correlation analysis, we explore interaction-preserving discretization of multivariate data and causality analysis. We conduct experiments on a variety of real-world data sets. The results validate the benefits of our methods

    Categorization of Affordances and Prediction of Future Object Interactions using Qualitative Spatial Relations

    Get PDF
    The application of deep neural networks on robotic platforms has successfully advanced robot perception in tasks related to human-robot collaboration scenarios. Tasks such as scene understanding, object categorization, affordance detection, interaction anticipation, are facilitated by the acquisition of knowledge about the object interactions taking place in the scene. The contributions of this thesis are two-fold: 1) it shows how representations of object interactions learned in an unsupervised way can be used to predict categories of objects depending on the affordances; 2) it shows how future frame-independent interaction can be learned in a self-supervised way by exploiting high-level graph representations of the object interactions. The aim of this research is to create representations and perform predictions of interactions which abstract from the image space and attain generalization across various scenes and objects. Interactions can be static, eg. holding a bottle, as well as dynamic, eg. playing with a ball, where the temporal aspect of the sequence of several static interactions is of importance to make the dynamic interaction distinguishable. Moreover, occlusion of objects in the 2D domain should be handled to avoid false positive interaction detections. Thus, RGB-D video data is exploited for these tasks. As humans tend to use objects in many different ways depending on the scene and the objects' availability, learning object affordances in everyday-life scenarios is a challenging task, particularly in the presence of an open-set of interactions and class-agnostic objects. In order to abstract from the continuous representation of spatio-temporal interactions in video data, a novel set of high-level qualitative depth-informed spatial relations is presented. Learning similarities via an unsupervised method exploiting graph representations of object interactions induces a hierarchy of clusters of objects with similar affordances. The proposed method handles object occlusions by capturing effectively possible interactions and without imposing any object or scene constraints. Moreover, interaction and action anticipation remains a challenging problem, especially considering the generalizability constraints of trained models from visual data or exploiting visual video embeddings. State of the art methods allow predictions approximately up to three seconds of time in the future. Hence, most everyday-life activities, which consist of actions of more than five seconds in duration, are not predictable. This thesis presents a novel approach for solving the task of interaction anticipation between objects in a video scene by utilizing high-level qualitative frame-number-independent spatial graphs to represent object interactions. A deep recurrent neural network learns in a self-supervised way to predict graph structures of future object interactions, whilst being decoupled from the visual information, the underlying activity, and the duration of each interaction taking place. Finally, the proposed methods are evaluated on RGB-D video datasets capturing everyday-life activities of human agents, and are compared against closely-related and state-of-the-art methods

    Joint Perceptual Learning and Natural Language Acquisition for Autonomous Robots

    Get PDF
    Understanding how children learn the components of their mother tongue and the meanings of each word has long fascinated linguists and cognitive scientists. Equally, robots face a similar challenge in understanding language and perception to allow for a natural and effortless human-robot interaction. Acquiring such knowledge is a challenging task, unless this knowledge is preprogrammed, which is no easy task either, nor does it solve the problem of language difference between individuals or learning the meaning of new words. In this thesis, the problem of bootstrapping knowledge in language and vision for autonomous robots is addressed through novel techniques in grammar induction and word grounding to the perceptual world. The learning is achieved in a cognitively plausible loosely-supervised manner from raw linguistic and visual data. The visual data is collected using different robotic platforms deployed in real-world and simulated environments and equipped with different sensing modalities, while the linguistic data is collected using online crowdsourcing tools and volunteers. The presented framework does not rely on any particular robot or any specific sensors; rather it is flexible to what the modalities of the robot can support. The learning framework is divided into three processes. First, the perceptual raw data is clustered into a number of Gaussian components to learn the ‘visual concepts’. Second, frequent co-occurrence of words and visual concepts are used to learn the language grounding, and finally, the learned language grounding and visual concepts are used to induce probabilistic grammar rules to model the language structure. In this thesis, the visual concepts refer to: (i) people’s faces and the appearance of their garments; (ii) objects and their perceptual properties; (iii) pairwise spatial relations; (iv) the robot actions; and (v) human activities. The visual concepts are learned by first processing the raw visual data to find people and objects in the scene using state-of-the-art techniques in human pose estimation, object segmentation and tracking, and activity analysis. Once found, the concepts are learned incrementally using a combination of techniques: Incremental Gaussian Mixture Models and a Bayesian Information Criterion to learn simple visual concepts such as object colours and shapes; spatio-temporal graphs and topic models to learn more complex visual concepts, such as human activities and robot actions. Language grounding is enabled by seeking frequent co-occurrence between words and learned visual concepts. Finding the correct language grounding is formulated as an integer programming problem to find the best many-to-many matches between words and concepts. Grammar induction refers to the process of learning a formal grammar (usually as a collection of re-write rules or productions) from a set of observations. In this thesis, Probabilistic Context Free Grammar rules are generated to model the language by mapping natural language sentences to learned visual concepts, as opposed to traditional supervised grammar induction techniques where the learning is only made possible by using manually annotated training examples on large datasets. The learning framework attains its cognitive plausibility from a number of sources. First, the learning is achieved by providing the robot with pairs of raw linguistic and visual inputs in a “show-and-tell” procedure akin to how human children learn about their environment. Second, no prior knowledge is assumed about the meaning of words or the structure of the language, except that there are different classes of words (corresponding to observable actions, spatial relations, and objects and their observable properties). Third, the knowledge in both language and vision is obtained in an incremental manner where the gained knowledge can evolve to adapt to new observations without the need to revisit previously seen ones (previous observations). Fourth, the robot learns about the visual world first, then it learns about how it maps to language, which aligns with the findings of cognitive studies on language acquisition in human infants that suggest children come to develop considerable cognitive understanding about their environment in the pre-linguistic period of their lives. It should be noted that this work does not claim to be modelling how humans learn about objects in their environments, but rather it is inspired by it. For validation, four different datasets are used which contain temporally aligned video clips of people or robots performing activities, and sentences describing these video clips. The video clips are collected using four robotic platforms, three robot arms in simple block-world scenarios and a mobile robot deployed in a challenging real-world office environment observing different people performing complex activities. The linguistic descriptions for these datasets are obtained using Amazon Mechanical Turk and volunteers. The analysis performed on these datasets suggest that the learning framework is suitable to learn from complex real-world scenarios. The experimental results show that the learning framework enables (i) acquiring correct visual concepts from visual data; (ii) learning the word grounding for each of the extracted visual concepts; (iii) inducing correct grammar rules to model the language structure; (iv) using the gained knowledge to understand previously unseen linguistic commands; and (v) using the gained knowledge to generate well-formed natural language descriptions of novel scenes

    Recognition of complex human activities in multimedia streams using machine learning and computer vision

    Get PDF
    Modelling human activities observed in multimedia streams as temporal sequences of their constituent actions has been the object of much research effort in recent years. However, most of this work concentrates on tasks where the action vocabulary is relatively small and/or each activity can be performed in a limited number of ways. In this Thesis, a novel and robust framework for modelling and analysing composite, prolonged activities arising in tasks which can be effectively executed in a variety of ways is proposed. Additionally, the proposed framework is designed to handle cognitive tasks, which cannot be captured using conventional types of sensors. It is shown that the proposed methodology is able to efficiently analyse and recognise complex activities arising in such tasks and also detect potential errors in their execution. To achieve this, a novel activity classification method comprising a feature selection stage based on the novel Key Actions Discovery method and a classification stage based on the combination of Random Forests and Hierarchical Hidden Markov Models is introduced. Experimental results captured in several scenarios arising from real-life applications, including a novel application to a bridge design problem, show that the proposed framework offers higher classification accuracy compared to current activity identification schemes
    corecore