19 research outputs found

    A Binocular, Foveated Active Vision System

    Get PDF
    This report documents the design and implementation of a binocular, foveated active vision system as part of the Cog project at the MIT Artificial Intelligence Laboratory. The active vision system features a three degree of freedom mechanical platform that supports four color cameras, a motion control system, and a parallel network of digital signal processors for image processing. To demonstrate the capabilities of the system, we present results from four sample visual-motor tasks

    Perception-driven approaches to real-time remote immersive visualization

    Get PDF
    In remote immersive visualization systems, real-time 3D perception through RGB-D cameras, combined with modern Virtual Reality (VR) interfaces, enhances the user’s sense of presence in a remote scene through 3D reconstruction rendered in a remote immersive visualization system. Particularly, in situations when there is a need to visualize, explore and perform tasks in inaccessible environments, too hazardous or distant. However, a remote visualization system requires the entire pipeline from 3D data acquisition to VR rendering satisfies the speed, throughput, and high visual realism. Mainly when using point-cloud, there is a fundamental quality difference between the acquired data of the physical world and the displayed data because of network latency and throughput limitations that negatively impact the sense of presence and provoke cybersickness. This thesis presents state-of-the-art research to address these problems by taking the human visual system as inspiration, from sensor data acquisition to VR rendering. The human visual system does not have a uniform vision across the field of view; It has the sharpest visual acuity at the center of the field of view. The acuity falls off towards the periphery. The peripheral vision provides lower resolution to guide the eye movements so that the central vision visits all the interesting crucial parts. As a first contribution, the thesis developed remote visualization strategies that utilize the acuity fall-off to facilitate the processing, transmission, buffering, and rendering in VR of 3D reconstructed scenes while simultaneously reducing throughput requirements and latency. As a second contribution, the thesis looked into attentional mechanisms to select and draw user engagement to specific information from the dynamic spatio-temporal environment. It proposed a strategy to analyze the remote scene concerning the 3D structure of the scene, its layout, and the spatial, functional, and semantic relationships between objects in the scene. The strategy primarily focuses on analyzing the scene with models the human visual perception uses. It sets a more significant proportion of computational resources on objects of interest and creates a more realistic visualization. As a supplementary contribution, A new volumetric point-cloud density-based Peak Signal-to-Noise Ratio (PSNR) metric is proposed to evaluate the introduced techniques. An in-depth evaluation of the presented systems, comparative examination of the proposed point cloud metric, user studies, and experiments demonstrated that the methods introduced in this thesis are visually superior while significantly reducing latency and throughput

    Memory-Based Active Visual Search for Humanoid Robots

    Get PDF

    Gaze control for visually guided manipulation

    Get PDF
    Human studies have shown that gaze shifts are mostly driven by the task. One explanation is that fixations gather information about task relevant properties, where task relevance is signalled by reward. This thesis pursues primarily an engineering science goal to determine what mechanisms a rational decision maker could employ to select a gaze location optimally, or near optimally, given limited information and limited computation time. To do so we formulate and characterise three computational models of gaze shifting (implemented on a simulated humanoid robot), which use lookahead to imagine the informational effects of possible gaze fixations. Our first model selects the gaze that most reduces uncertainty in the scene (Unc), the second maximises expected rewards by reducing uncertainty (Rew+Unc), and the third maximises the expected gain in cumulative reward by reducing uncertainty (Rew+Unc+Gain). We also present an integrated account of a visual search process into the Rew+Unc+Gain gaze scheme. Our secondary goal is concerned with the way in which humans might select the next gaze location. We compare the hand-eye coordination timings of our models to previously published human data, and we provide evidence that only the models that incorporate both uncertainty and reward (Rew+Unc and Rew+Unc+Gain) match human data

    Active vision for sociable robots

    Full text link

    Computational principles for an autonomous active vision system

    Full text link
    Vision research has uncovered computational principles that generalize across species and brain area. However, these biological mechanisms are not frequently implemented in computer vision algorithms. In this thesis, models suitable for application in computer vision were developed to address the benefits of two biologically-inspired computational principles: multi-scale sampling and active, space-variant, vision. The first model investigated the role of multi-scale sampling in motion integration. It is known that receptive fields of different spatial and temporal scales exist in the visual cortex; however, models addressing how this basic principle is exploited by species are sparse and do not adequately explain the data. The developed model showed that the solution to a classical problem in motion integration, the aperture problem, can be reframed as an emergent property of multi-scale sampling facilitated by fast, parallel, bi-directional connections at different spatial resolutions. Humans and most other mammals actively move their eyes to sample a scene (active vision); moreover, the resolution of detail in this sampling process is not uniform across spatial locations (space-variant). It is known that these eye-movements are not simply guided by image saliency, but are also influenced by factors such as spatial attention, scene layout, and task-relevance. However, it is seldom questioned how previous eye movements shape how one learns and recognizes an object in a continuously-learning system. To explore this question, a model (CogEye) was developed that integrates active, space-variant sampling with eye-movement selection (the where visual stream), and object recognition (the what visual stream). The model hypothesizes that a signal from the recognition system helps the where stream select fixation locations that best disambiguate object identity between competing alternatives. The third study used eye-tracking coupled with an object disambiguation psychophysics experiment to validate the second model, CogEye. While humans outperformed the model in recognition accuracy, when the model used information from the recognition pathway to help select future fixations, it was more similar to human eye movement patterns than when the model relied on image saliency alone. Taken together these results show that computational principles in the mammalian visual system can be used to improve computer vision models

    A hierarchical active binocular robot vision architecture for scene exploration and object appearance learning

    Get PDF
    This thesis presents an investigation of a computational model of hierarchical visual behaviours within an active binocular robot vision architecture. The robot vision system is able to localise multiple instances of the same object class, while simultaneously maintaining vergence and directing its gaze to attend and recognise objects within cluttered, complex scenes. This is achieved by implementing all image analysis in an egocentric symbolic space without creating explicit pixel-space maps and without the need for calibration or other knowledge of the camera geometry. One of the important aspects of the active binocular vision paradigm requires that visual features in both camera eyes must be bound together in order to drive visual search to saccade, locate and recognise putative objects or salient locations in the robot's field of view. The system structure is based on the “attentional spotlight” metaphor of biological systems and a collection of abstract and reactive visual behaviours arranged in a hierarchical structure. Several studies have shown that the human brain represents and learns objects for recognition by snapshots of 2-dimensional views of the imaged scene that happens to contain the object of interest during active interaction (exploration) of the environment. Likewise, psychophysical findings specify that the primate’s visual cortex represents common everyday objects by a hierarchical structure of their parts or sub-features and, consequently, recognise by simple but imperfect 2D view object part approximations. This thesis incorporates the above observations into an active visual learning behaviour in the hierarchical active binocular robot vision architecture. By actively exploring the object viewing sphere (as higher mammals do), the robot vision system automatically synthesises and creates its own part-based object representation from multiple observations while a human teacher indicates the object and supplies a classification name. Its is proposed to adopt the computational concepts of a visual learning exploration mechanism that controls the accumulation of visual evidence and directs attention towards the spatial salient object parts. The behavioural structure of the binocular robot vision architecture is loosely modelled by a WHAT and WHERE visual streams. The WHERE stream maintains and binds spatial attention on the object part coordinates that egocentrically characterises the location of the object of interest and extracts spatio-temporal properties of feature coordinates and descriptors. The WHAT stream either determines the identity of an object or triggers a learning behaviour that stores view-invariant feature descriptions of the object part. Therefore, the robot vision is capable to perform a collection of different specific visual tasks such as vergence, detection, discrimination, recognition localisation and multiple same-instance identification. This classification of tasks enables the robot vision system to execute and fulfil specified high-level tasks, e.g. autonomous scene exploration and active object appearance learning

    TOWARDS A COMPUTATIONAL MODEL OF RETINAL STRUCTURE AND BEHAVIOR

    Get PDF
    Human vision is our most important sensory system, allowing us to perceive our surroundings. It is an extremely complex process that starts with light entering the eye and ends inside of the brain, with most of its mechanisms still to be explained. When we observe a scene, the optics of the eye focus an image on the retina, where light signals are processed and sent all the way to the visual cortex of the brain, enabling our visual sensation. The progress of retinal research, especially on the topography of photoreceptors, is often tied to the progress of retinal imaging systems. The latest adaptive optics techniques have been essential for the study of the photoreceptors and their spatial characteristics, leading to discoveries that challenge the existing theories on color sensation. The organization of the retina is associated with various perceptive phenomena, some of them are straightforward and strictly related to visual performance like visual acuity or contrast sensitivity, but some of them are more difficult to analyze and test and can be related to the submosaics of the three classes of cone photoreceptors, like how the huge interpersonal differences between the ratio of different cone classes result in negligible differences in color sensation, suggesting the presence of compensation mechanisms in some stage of the visual system. In this dissertation will be discussed and addressed issues regarding the spatial organization of the photoreceptors in the human retina. A computational model has been developed, organized into a modular pipeline of extensible methods each simulating a different stage of visual processing. It does so by creating a model of spatial distribution of cones inside of a retina, then applying descriptive statistics for each photoreceptor to contribute to the creation of a graphical representation, based on a behavioral model that determines the absorption of photoreceptors. These apparent color stimuli are reconstructed in a representation of the observed scene. The model allows the testing of different parameters regulating the photoreceptor's topography, in order to formulate hypothesis on the perceptual differences arising from variations in spatial organization

    Conference on Intelligent Robotics in Field, Factory, Service, and Space (CIRFFSS 1994), volume 1

    Get PDF
    The AIAA/NASA Conference on Intelligent Robotics in Field, Factory, Service, and Space (CIRFFSS '94) was originally proposed because of the strong belief that America's problems of global economic competitiveness and job creation and preservation can partly be solved by the use of intelligent robotics, which are also required for human space exploration missions. Individual sessions addressed nuclear industry, agile manufacturing, security/building monitoring, on-orbit applications, vision and sensing technologies, situated control and low-level control, robotic systems architecture, environmental restoration and waste management, robotic remanufacturing, and healthcare applications
    corecore