8 research outputs found

    Occlusion resistant learning of intuitive physics from videos

    Get PDF
    To reach human performance on complex tasks, a key ability for artificial systems is to understand physical interactions between objects, and predict future outcomes of a situation. This ability, often referred to as intuitive physics, has recently received attention and several methods were proposed to learn these physical rules from video sequences. Yet, most of these methods are restricted to the case where no, or only limited, occlusions occur. In this work we propose a probabilistic formulation of learning intuitive physics in 3D scenes with significant inter-object occlusions. In our formulation, object positions are modeled as latent variables enabling the reconstruction of the scene. We then propose a series of approximations that make this problem tractable. Object proposals are linked across frames using a combination of a recurrent interaction network, modeling the physics in object space, and a compositional renderer, modeling the way in which objects project onto pixel space. We demonstrate significant improvements over state-of-the-art in the intuitive physics benchmark of IntPhys. We apply our method to a second dataset with increasing levels of occlusions, showing it realistically predicts segmentation masks up to 30 frames in the future. Finally, we also show results on predicting motion of objects in real videos

    Occlusion resistant learning of intuitive physics from videos

    Get PDF
    To reach human performance on complex tasks, akey ability for artificial systems is to understandphysical interactions between objects, and predictfuture outcomes of a situation. This ability, of-ten referred to asintuitive physics, has recentlyreceived attention and several methods were pro-posed to learn these physical rules from video se-quences. Yet, most of these methods are restrictedto the case where no, or only limited, occlusionsoccur. In this work we propose a probabilisticformulation of learning intuitive physics in 3Dscenes with significant inter-object occlusions. Inour formulation, object positions are modelledas latent variables enabling the reconstruction ofthe scene. We then propose a series of approx-imations that make this problem tractable. Ob-ject proposals are linked across frames using acombination of a recurrent interaction network,modeling the physics in object space, and a com-positional renderer, modeling the way in whichobjects project onto pixel space. We demonstratesignificant improvements over state-of-the-art inthe intuitive physics benchmark of Riochet et al.(2018). We apply our method to a second datasetwith increasing levels of occlusions, showing itrealistically predicts segmentation masks up to 30frames in the future. Finally, we also show resultson predicting motion of objects in real video

    Apprentissage non-supervisé de la physique intuitive

    No full text
    To reach human performance on complex tasks, a key ability for artificial intelligence systems is to understand physical interactions between objects, and predict future outcomes of a situation. In this thesis we investigate how a system can learn this ability, often referred to as intuitive physics, from videos with minimal annotation.Our first contribution is an evaluation benchmark, named IntPhys, which diagnoses how much a system understands intuitive physics. Inspired by works in infant development, we propose a Violation-of-Expection procedure in which the system must tell apart well matched videos of possible versus impossible events constructed with a game engine. We describe two Convolutional Neural Networks trained on a forward prediction task, and compare their results with human data acquired with Amazon Mechanical Turk.The analysis of these results show limitations of CNN encoder-decoders with no structured representation of objects when it comes to predict long-term object trajectories, especially in case of occlusions. In a second work, we propose a probabilistic formulation of learning intuitive physics in 3D scenes with significant inter-object occlusions, in which object positions are modelled as latent variables, enabling the reconstruction of the scene. We propose a series of approximations that make this problem tractable and introduce a compositional neural network demonstrating significant improvements on the intuitive physics benchmark IntPhys. We evaluate this model on a second dataset with increasing levels of occlusions, showing it realistically predicts segmentation masks up to 30 frames in the future.In a third work, we adapt this approach to a real life application: predicting future instance masks of objects in the Cityscapes Dataset, made of video sequences recorded in streets from 50 cities. We use a state-of-the-art objects detector to estimate object states, then apply the model presented above to predict objects instance masks up to 9 frames in the future. In addition, we propose a method to decouple ego-motion from objects’ motion, making it easier to learn long term object dynamics.Pour effectuer des tâches complexes de manière autonome, les systèmes d’intelligence artificielle doivent pouvoir com- prendre les interactions physiques entre les objets, afin d’anticiper les conséquences de situations diverses. L’objectif de cette thèse est d’étudier l’acquisition de cette notion (souvent appelée physique intuitive), pour un système, de manière autonome et non supervisée à partir de vidéos.La première contribution consiste en un protocole de test, IntPhys, dont le but est d’évaluer les capacités d’un tel système à comprendre la physique intuitive. Inspiré de la littérature en sciences cognitive, ce protocole consiste à en évaluer la capacité à différencier les évèvement physiques possibles et impossibles au sein de vidéos. Après avoir décrit en détail cette procédure, nous évaluons les performances de deux réseaux de neuronnes convolutifs et les comparons avec les performances humaines.L’analyse de ces résultats montre les limites des réseaux de neuronnes convolutifs pour prédire la trajectoire des objects à long terme, notamment en présence d’occlusions. Pour cette raison, nous proposons une formulation probabiliste du problème dans laquelle chaque object a sa propre représentation en variables latentes. Nous proposons également une série d’approximations pour trouver une solution acceptable à ce problème d’optimisation.Dans un dernier chapitre, nous proposons d’appliquer cette approche à un cas pratique: l’anticipation du mouvements des objets alentours lors de la conduite en ville. Nous y montrons qu’il est possible d’entraîner un système à anticiper les futurs masques d’instance d’objets, au sein de séquences vidéos enregistrées lors de la conduite dans 50 villes europénnes

    Apprentissage non-supervisé de la physique intuitive

    No full text
    To reach human performance on complex tasks, a key ability for artificial intelligence systems is to understand physical interactions between objects, and predict future outcomes of a situation. In this thesis we investigate how a system can learn this ability, often referred to as intuitive physics, from videos with minimal annotation.Our first contribution is an evaluation benchmark, named IntPhys, which diagnoses how much a system understands intuitive physics. Inspired by works in infant development, we propose a Violation-of-Expection procedure in which the system must tell apart well matched videos of possible versus impossible events constructed with a game engine. We describe two Convolutional Neural Networks trained on a forward prediction task, and compare their results with human data acquired with Amazon Mechanical Turk.The analysis of these results show limitations of CNN encoder-decoders with no structured representation of objects when it comes to predict long-term object trajectories, especially in case of occlusions. In a second work, we propose a probabilistic formulation of learning intuitive physics in 3D scenes with significant inter-object occlusions, in which object positions are modelled as latent variables, enabling the reconstruction of the scene. We propose a series of approximations that make this problem tractable and introduce a compositional neural network demonstrating significant improvements on the intuitive physics benchmark IntPhys. We evaluate this model on a second dataset with increasing levels of occlusions, showing it realistically predicts segmentation masks up to 30 frames in the future.In a third work, we adapt this approach to a real life application: predicting future instance masks of objects in the Cityscapes Dataset, made of video sequences recorded in streets from 50 cities. We use a state-of-the-art objects detector to estimate object states, then apply the model presented above to predict objects instance masks up to 9 frames in the future. In addition, we propose a method to decouple ego-motion from objects’ motion, making it easier to learn long term object dynamics.Pour effectuer des tâches complexes de manière autonome, les systèmes d’intelligence artificielle doivent pouvoir com- prendre les interactions physiques entre les objets, afin d’anticiper les conséquences de situations diverses. L’objectif de cette thèse est d’étudier l’acquisition de cette notion (souvent appelée physique intuitive), pour un système, de manière autonome et non supervisée à partir de vidéos.La première contribution consiste en un protocole de test, IntPhys, dont le but est d’évaluer les capacités d’un tel système à comprendre la physique intuitive. Inspiré de la littérature en sciences cognitive, ce protocole consiste à en évaluer la capacité à différencier les évèvement physiques possibles et impossibles au sein de vidéos. Après avoir décrit en détail cette procédure, nous évaluons les performances de deux réseaux de neuronnes convolutifs et les comparons avec les performances humaines.L’analyse de ces résultats montre les limites des réseaux de neuronnes convolutifs pour prédire la trajectoire des objects à long terme, notamment en présence d’occlusions. Pour cette raison, nous proposons une formulation probabiliste du problème dans laquelle chaque object a sa propre représentation en variables latentes. Nous proposons également une série d’approximations pour trouver une solution acceptable à ce problème d’optimisation.Dans un dernier chapitre, nous proposons d’appliquer cette approche à un cas pratique: l’anticipation du mouvements des objets alentours lors de la conduite en ville. Nous y montrons qu’il est possible d’entraîner un système à anticiper les futurs masques d’instance d’objets, au sein de séquences vidéos enregistrées lors de la conduite dans 50 villes europénnes

    IntPhys: A Benchmark for Visual Intuitive Physics Reasoning

    Get PDF
    In order to reach human performance on complex visual tasks, artificial systems need to incorporate a significant amount of understanding of the world in terms of macroscopic objects, movements, forces, etc. Inspired by work on intuitive physics in infants, we propose an evaluation framework which diagnoses how much a given system understands about physics by testing whether it can tell apart well matched videos of possible versus impossible events. The test requires systems to compute a physical plausibility score over an entire video. It is free of bias and can test a range of specific physical reasoning skills. We then describe the first release of a benchmark dataset aimed at learning intuitive physics in an unsupervised way, using videos constructed with a game engine. We describe two Deep Neural Network baseline systems trained with a future frame prediction objective and tested on the possible versus impossible discrimination task. The analysis of their results compared to human data gives novel insights in the potentials and limitations of next frame prediction architectures

    IntPhys: A Framework and Benchmark for Visual Intuitive Physics Reasoning

    No full text
    International audienceIn order to reach human performance on complex visual tasks, artificial systems need to incorporate a significant amount of understanding of the world in terms of macroscopic objects, movements, forces, etc. Inspired by work on intuitive physics in infants, we propose an evaluation framework which diagnoses how much a given system understands about physics by testing whether it can tell apart well matched videos of possible versus impossible events. The test requires systems to compute a physical plausibility score over an entire video. It is free of bias and can test a range of specific physical reasoning skills. We then describe the first release of a benchmark dataset aimed at learning intuitive physics in an unsupervised way, using videos constructed with a game engine. We describe two Deep Neural Network baseline systems trained with a future frame prediction objective and tested on the possible versus impossible discrimination task. The analysis of their results compared to human data gives novel insights in the potentials and limitations of next frame prediction architectures

    IntPhys 2019: A Benchmark for Visual Intuitive Physics Understanding

    No full text
    International audienceIn order to reach human performance on complex visual tasks, artificial systems need to incorporate a significant amount of understanding of the world in terms of macroscopic objects, movements, forces, etc. Inspired by work on intuitive physics in infants, we propose an evaluation framework which diagnoses how much a given system understands about physics by testing whether it can tell apart well matched videos of possible versus impossible events. The test requires systems to compute a physical plausibility score over an entire video. It is free of bias and can test a range of specific physical reasoning skills. We then describe the first release of a benchmark dataset aimed at learning intuitive physics in an unsupervised way, using videos constructed with a game engine. We describe two Deep Neural Network baseline systems trained with a future frame prediction objective and tested on the possible versus impossible discrimination task. The analysis of their results compared to human data gives novel insights in the potentials and limitations of next frame prediction architectures
    corecore