110 research outputs found

    ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation

    Get PDF
    Humans intuitively understand that inanimate objects do not move by themselves, but that state changes are typically caused by human manipulation (e.g., the opening of a book). This is not yet the case for machines. In part this is because there exist no datasets with ground-truth 3D annotations for the study of physically consistent and synchronised motion of hands and articulated objects. To this end, we introduce ARCTIC -- a dataset of two hands that dexterously manipulate objects, containing 2.1M video frames paired with accurate 3D hand and object meshes and detailed, dynamic contact information. It contains bi-manual articulation of objects such as scissors or laptops, where hand poses and object states evolve jointly in time. We propose two novel articulated hand-object interaction tasks: (1) Consistent motion reconstruction: Given a monocular video, the goal is to reconstruct two hands and articulated objects in 3D, so that their motions are spatio-temporally consistent. (2) Interaction field estimation: Dense relative hand-object distances must be estimated from images. We introduce two baselines ArcticNet and InterField, respectively and evaluate them qualitatively and quantitatively on ARCTIC

    Getting the Upper Hand: Natural Gesture Interfaces Improve Instructional Efficiency on a Conceptual Computer Lesson

    Get PDF
    As gesture-based interactions with computer interfaces become more technologically feasible for educational and training systems, it is important to consider what interactions are best for the learner. Computer interactions should not interfere with learning nor increase the mental effort of completing the lesson. The purpose of the current set of studies was to determine whether natural gesture-based interactions, or instruction of those gestures, help the learner in a computer lesson by increasing learning and reducing mental effort. First, two studies were conducted to determine what gestures were considered natural by participants. Then, those gestures were implemented in an experiment to compare type of gesture and type of gesture instruction on learning conceptual information from a computer lesson. The goal of these studies was to determine the instructional efficiency – that is, the extent of learning taking into account the amount of mental effort – of implementing gesture-based interactions in a conceptual computer lesson. To test whether the type of gesture interaction affects conceptual learning in a computer lesson, the gesture-based interactions were either naturally- or arbitrarily-mapped to the learning material on the fundamentals of optics. The optics lesson presented conceptual information about reflection and refraction, and participants used the gesture-based interactions during the lesson to manipulate on-screen lenses and mirrors in a beam of light. The beam of light refracted/reflected at the angle corresponding with type of lens/mirror. The natural gesture-based interactions were those that mimicked the physical movement used to manipulate the lenses and mirrors in the optics lesson, while the arbitrary gestures were those that did not match the movement of the lens or mirror being manipulated. The natural gestures implemented in the computer lesson were determined from Study 1, in which participants performed gestures they considered natural for a set of actions, and rated in Study 2 as most closely resembling the physical interaction they represent. The arbitrary gestures were rated by participants as most arbitrary for each computer action in Study 2. To test whether the effect of novel gesture-based interactions depends on how they are taught, the way the gestures were instructed was varied in the main experiment by using either video- or text-based tutorials. Results of the experiment support that natural gesture-based interactions were better for learning than arbitrary gestures, and instruction of the gestures largely did not affect learning and amount of mental effort felt during the task. To further investigate the factors affecting instructional efficiency in using gesture-based interactions for a computer lesson, individual differences of the learner were taken into account. Results indicated that the instructional efficiency of the gestures and their instruction depended on an individual\u27s spatial ability, such that arbitrary gesture interactions taught with a text-based tutorial were particularly inefficient for those with lower spatial ability. These findings are explained in the context of Embodied Cognition and Cognitive Load Theory, and guidelines are provided for instructional design of computer lessons using natural user interfaces. The theoretical frameworks of Embodied Cognition and Cognitive Load Theory were used to explain why gesture-based interactions and their instructions impacted the instructional efficiency of these factors in a computer lesson. Gesture-based interactions that are natural (i.e., mimic the physical interaction by corresponding to the learning material) were more instructionally efficient than arbitrary gestures because natural gestures may help schema development of conceptual information through physical enactment of the learning material. Furthermore, natural gestures resulted in lower cognitive load than arbitrary gestures, because arbitrary gestures that do not match the learning material may increase the working memory processing not associated with the learning material during the lesson. Additionally, the way in which the gesture-based interactions were taught was varied by either instructing the gestures with video- or text-based tutorials, and it was hypothesized that video-based tutorials would be a better way to instruct gesture-based interactions because the videos may help the learner to visualize the interactions and create a more easily recalled sensorimotor representation for the gestures; however, this hypothesis was not supported and there was not strong evidence that video-based tutorials were more instructionally efficient than text-based instructions. The results of the current set of studies can be applied to educational and training systems that incorporate a gesture-based interface. The finding that more natural gestures are better for learning efficiency, cognitive load, and a variety of usability factors should encourage instructional designers and researchers to keep the user in mind when developing gesture-based interactions

    Long-term future prediction under uncertainty and multi-modality

    Get PDF
    Humans have an innate ability to excel at activities that involve prediction of complex object dynamics such as predicting the possible trajectory of a billiard ball after it has been hit by the player or the prediction of motion of pedestrians while on the road. A key feature that enables humans to perform such tasks is anticipation. There has been continuous research in the area of Computer Vision and Artificial Intelligence to mimic this human ability for autonomous agents to succeed in the real world scenarios. Recent advances in the field of deep learning and the availability of large scale datasets has enabled the pursuit of fully autonomous agents with complex decision making abilities such as self-driving vehicles or robots. One of the main challenges encompassing the deployment of these agents in the real world is their ability to perform anticipation tasks with at least human level efficiency. To advance the field of autonomous systems, particularly, self-driving agents, in this thesis, we focus on the task of future prediction in diverse real world settings, ranging from deterministic scenarios such as prediction of paths of balls on a billiard table to the predicting the future of non-deterministic street scenes. Specifically, we identify certain core challenges for long-term future prediction: long-term prediction, uncertainty, multi-modality, and exact inference. To address these challenges, this thesis makes the following core contributions. Firstly, for accurate long-term predictions, we develop approaches that effectively utilize available observed information in the form of image boundaries in videos or interactions in street scenes. Secondly, as uncertainty increases into the future in case of non-deterministic scenarios, we leverage Bayesian inference frameworks to capture calibrated distributions of likely future events. Finally, to further improve performance in highly-multimodal non-deterministic scenarios such as street scenes, we develop deep generative models based on conditional variational autoencoders as well as normalizing flow based exact inference methods. Furthermore, we introduce a novel dataset with dense pedestrian-vehicle interactions to further aid the development of anticipation methods for autonomous driving applications in urban environments.Menschen haben die angeborene FĂ€higkeit, VorgĂ€nge mit komplexer Objektdynamik vorauszusehen, wie z. B. die Vorhersage der möglichen Flugbahn einer Billardkugel, nachdem sie vom Spieler gestoßen wurde, oder die Vorhersage der Bewegung von FußgĂ€ngern auf der Straße. Eine SchlĂŒsseleigenschaft, die es dem Menschen ermöglicht, solche Aufgaben zu erfĂŒllen, ist die Antizipation. Im Bereich der Computer Vision und der KĂŒnstlichen Intelligenz wurde kontinuierlich daran geforscht, diese menschliche FĂ€higkeit nachzuahmen, damit autonome Agenten in der realen Welt erfolgreich sein können. JĂŒngste Fortschritte auf dem Gebiet des Deep Learning und die VerfĂŒgbarkeit großer DatensĂ€tze haben die Entwicklung vollstĂ€ndig autonomer Agenten mit komplexen EntscheidungsfĂ€higkeiten wie selbstfahrende Fahrzeugen oder Roboter ermöglicht. Eine der grĂ¶ĂŸten Herausforderungen beim Einsatz dieser Agenten in der realen Welt ist ihre FĂ€higkeit, Antizipationsaufgaben mit einer Effizienz durchzufĂŒhren, die mindestens der menschlichen entspricht. Um das Feld der autonomen Systeme, insbesondere der selbstfahrenden Agenten, voranzubringen, konzentrieren wir uns in dieser Arbeit auf die Aufgabe der Zukunftsvorhersage in verschiedenen realen Umgebungen, die von deterministischen Szenarien wie der Vorhersage der Bahnen von Kugeln auf einem Billardtisch bis zur Vorhersage der Zukunft von nicht-deterministischen Straßenszenen reichen. Insbesondere identifizieren wir bestimmte grundlegende Herausforderungen fĂŒr langfristige Zukunftsvorhersagen: Langzeitvorhersage, Unsicherheit, MultimodalitĂ€t und exakte Inferenz. Um diese Herausforderungen anzugehen, leistet diese Arbeit die folgenden grundlegenden BeitrĂ€ge. Erstens: FĂŒr genaue Langzeitvorhersagen entwickeln wir AnsĂ€tze, die verfĂŒgbare Beobachtungsinformationen in Form von Bildgrenzen in Videos oder Interaktionen in Straßenszenen effektiv nutzen. Zweitens: Da die Unsicherheit in der Zukunft bei nicht-deterministischen Szenarien zunimmt, nutzen wir Bayes’sche Inferenzverfahren, um kalibrierte Verteilungen wahrscheinlicher zukĂŒnftiger Ereignisse zu erfassen. Drittens: Um die Leistung in hochmultimodalen, nichtdeterministischen Szenarien wie Straßenszenen weiter zu verbessern, entwickeln wir tiefe generative Modelle, die sowohl auf konditionalen Variations-Autoencodern als auch auf normalisierenden fließenden exakten Inferenzmethoden basieren. DarĂŒber hinaus stellen wir einen neuartigen Datensatz mit dichten FußgĂ€nger-Fahrzeug- Interaktionen vor, um Antizipationsmethoden fĂŒr autonome Fahranwendungen in urbanen Umgebungen weiter zu entwickeln

    Visual system identiïŹcation: learning physical parameters and latent spaces from pixels

    Get PDF
    In this thesis, we develop machine learning systems that are able to leverage the knowledge of equations of motion (scene-specific or scene-agnostic) to perform object discovery, physical parameter estimation, position and velocity estimation, camera pose estimation, and learn structured latent spaces that satisfy physical dynamics rules. These systems are unsupervised, learning from unlabelled videos, and use as inductive biases the general equations of motion followed by objects of interest in the scene. This is an important task as in many complex real world environments ground-truth states are not available, although there is physical knowledge of the underlying system. Our goals with this approach, i.e. integration of physics knowledge with unsupervised learning models, are to improve vision-based prediction, enable new forms of control, increase data-efficiency and provide model interpretability, all of which are key areas of interest in machine learning. With the above goals in mind, we start by asking the following question: given a scene in which the objects’ motions are known up to some physical parameters (e.g. a ball bouncing off the floor with unknown restitution coefficient), how do we build a model that uses such knowledge to discover the objects in the scene and estimate these physical parameters? Our first model, PAIG (Physics-as-Inverse-Graphics), approaches this problem from a vision-as-inverse-graphics perspective, describing the visual scene as a composition of objects defined by their location and appearance, which are rendered onto the frame in a graphics manner. This is a known approach in the unsupervised learning literature, where the fundamental problem then becomes that of derendering, that is, inferring and discovering these locations and appearances for each object. In PAIG we introduce a key rendering component, the Coordinate-Consistent Decoder, which enables the integration of the known equations of motion with an inverse-graphics autoencoder architecture (trainable end-to-end), to perform simultaneous object discovery and physical parameter estimation. Although trained on simple simulated 2D scenes, we show that knowledge of the physical equations of motion of the objects in the scene can be used to greatly improve future prediction and provide physical scene interpretability. Our second model, V-SysId, tackles the limitations shown by the PAIG architecture, namely the training difficulty, the restriction to simulated 2D scenes, and the need for noiseless scenes without distractors. Here, we approach the problem from rst principles by asking the question: are neural networks a necessary component to solve this problem? Can we use simpler ideas from classical computer vision instead? With V- SysId, we approach the problem of object discovery and physical parameter estimation from a keypoint extraction, tracking and selection perspective, composed of 3 separate stages: proposal keypoint extraction and tracking, 3D equation tting and camera pose estimation from 2D trajectories, and entropy-based trajectory selection. Since all the stages use lightweight algorithms and optimisers, V-SysId is able to perform joint object discovery, physical parameter and camera pose estimation from even a single video, drastically improving data-efficiency. Additionally, due to the fact that it does not use a rendering/derendering approach, it can be used in real 3D scenes with many distractor objects. We show that this approach enables a number of interest applications, such as vision-based robot end-effector localisation and remote breath rate measurement. Finally, we move into the area of structured recurrent variational models from vision, where we are motivated by the following observation: in existing models, applying a force in the direction from a start point and an end point (in latent space), does not result in a movement from the start point towards the end point, even on the simplest unconstrained environments. This means that the latent space learned by these models does not follow Newton’s law, where the acceleration vector has the same direction as the force vector (in point-mass systems), and prevents the use of PID controllers, which are the simplest and most well understood type of controller. We solve this problem by building inductive biases from Newtonian physics into the latent variable model, which we call NewtonianVAE. Crucially, Newtonian correctness in the latent space brings about the ability to perform proportional (or PID) control, as opposed to the more computationally expensive model predictive control (MPC). PID controllers are ubiquitous in industrial applications, but had thus far lacked integration with unsupervised vision models. We show that the NewtonianVAE learns physically correct latent spaces in simulated 2D and 3D control systems, which can be used to perform goal-based discovery and control in imitation learning, and path following via Dynamic Motion Primitives

    An investigation of factual and counterfactual feedback information in early visual cortex

    Get PDF
    Primary visual cortex receives approximately 90% of the input to the retina, however this only accounts for around 5% of the input to V1 (Muckli, 2010). The majority of the input to V1 is in fact from other cortical and sub-cortical parts of the brain that arrive there via lateral and feedback pathways. It is therefore critical to our knowledge of visual perception to understand how these feedback responses influence visual processing. The aim of this thesis is to investigate different sources of non-visual feedback to early visual cortex. To do this we use a combination of an occlusion paradigm, derived from F. W. Smith and Muckli (2010), and functional magnetic resonance imagining. Occlusion offers us a method to inhibit the feedforward flow of information to the retina from a specific part of the visual field. By inhibiting the feedforward information we exploit the highly precise retinotopic organisation of visual cortex by rendering a corresponding patch of cortex free of feedforward input. From this isolated patch of cortex we can ask questions about the information content of purely feedback information. In Chapter 3 we investigated whether or not information about valance was present in non-stimulated early visual cortex. We constructed a 900 image set that contained an equal number of images for neutral, positive and negative valance across animal, food and plant categories. We used an m-sequence design to allow us to present image set within a standard period of time for fMRI. We were concerned about low-level image properties being a potential confound, so a large image set would allow us to average out these low-level properties. We occluded the lower-right quadrant of each image and presented each image only once to our subjects. The image set was rated for valance and arousal after fMRI so that individual subjectivity could be accounted for. We used multivariate pattern analysis (MVPA) to decode pairs of neutral, positive and negative valance. We found that in both stimulated and non-stimulated V1, V2 and V3, and the amygdala and pulvinar only information about negative valance could be decoded. In a second analysis we again used MVPA to cross-decode between pairs of valance and category. By training the classifier on pairs of valance that each contained two categories, we could ask the question of whether the classifier generalises to the left out category for the same pair of valance. We found that valance does generalise across category in both stimulated and non-stimulated cortex, and in the amygdala and pulvinar. These results demonstrate that information about valance, particularly negative valance, is represented in low level visual areas and is generalisable across animal, food and plant categories. In Chapter 4 we explored the retinotopic organisation of object and scene sound responses in non-stimulated early visual cortex. We embedded a repeating object sound (axe chopping or motor starting) in to a scene sound (blizzard wind or forest) and used MVPA to read out object or scene information from non-stimulated early visual cortex. We found that object sounds were decodable in the fovea and scene sounds were decodable in the periphery. This finding demonstrates that auditory feedback to visual cortex has an eccentricity bias corresponding to the functional role involved. We suggest that object information feeds back to the fovea for fine-scaled discrimination whereas abstract information feeds back to the periphery to provide a modulatory contextual template for vision. In a second experiment in Chapter 4 we further explored the similarity between categorical representations between sound and video stimuli in non-stimulated early visual cortex. We use video stimuli and separate the audio and visual parts in to unimodal stimuli. We occlude the bottom right quadrant of the videos and use MVPA to cross-decode between sounds and videos (and vice-versa) from responses in occluded cortex. We find that a classifier trained on one modality can decode the other in occluded cortex. This finding tells us that there is an overlap in the neural representation of aural and visual stimuli in early visual cortex. In Chapter 5 we probe the internal thought processes of subjects after occluding a short video sequence. We use a priming sequence to generate predictions as subjects are asked to imagine how events from a video unfold during occlusion. We then probe these predictions with a series of test frames corresponding to points in time, either close in time to the offset of the video, just before the video would be expected to reappear, the matching frame from when the video would be expected to reappear or a frame from the very distant future. In an adaption paradigm we find that predictions best match the test frames around the point in time that subjects expect the video to reappear. The test frame from a point close in time to the offset of the video was rarely a match. This tells us that the predictions that subjects make are not related to the offset of the priming sequence but represent a future state of the world that they have not seen. In a second control experiment we show that these predictions are absent when the priming sequence is randomised, and that predictions take between 600ms and 1200ms to fully develop. These findings demonstrate the dynamic flexibility of internal models, that information about these predictions can be read out in early visual cortex and that stronger representations form if given additional time. In Chapter 6 we again probe at internal dynamic predictions by using virtual navigation paradigm. We use virtual reality to train subjects in a new environment where they can build strong representations of four categorical rooms (kitchen, bedroom, office and game room). Later in fMRI we provide subjects with a direction cue and a starting room and ask them to predict the upcoming room by combining the information. The starting room is shown as a short video clip with the bottom right quadrant occluded. During the video sequence of the starting room, we find that we can read out information about the future room from non-stimulated early visual cortex. In a second control experiment, when we remove the direction cue information about the future room can no longer be decoded. This finding demonstrates that dynamic predictions about the immediate future are present in early visual cortex during simultaneous visual stimulation and that we can read out these predictions with 3T fMRI. These findings increase our knowledge about the types of non-visual information available to early visual cortical areas and provide insight in to the influence they have on vision. These results lend support to the idea that early visual areas may act as a blackboard for read and write operations for communication around the brain (Muckli et al., 2015; Mumford, 1991; Murray et al., 2016; Roelfsema & de Lange, 2016; Williams et al., 2008). Current models of predictive coding will need to be updated to account for the brains ability to switch between two different processing streams, one that is factual and related to an external stimulus and one that is stimulus independent and internal
    • 

    corecore