105 research outputs found
Modeling the Contribution of Central Versus Peripheral Vision in Scene, Object, and Face Recognition
It is commonly believed that the central visual field is important for
recognizing objects and faces, and the peripheral region is useful for scene
recognition. However, the relative importance of central versus peripheral
information for object, scene, and face recognition is unclear. In a behavioral
study, Larson and Loschky (2009) investigated this question by measuring the
scene recognition accuracy as a function of visual angle, and demonstrated that
peripheral vision was indeed more useful in recognizing scenes than central
vision. In this work, we modeled and replicated the result of Larson and
Loschky (2009), using deep convolutional neural networks. Having fit the data
for scenes, we used the model to predict future data for large-scale scene
recognition as well as for objects and faces. Our results suggest that the
relative order of importance of using central visual field information is face
recognition>object recognition>scene recognition, and vice-versa for peripheral
information.Comment: CogSci 2016 Conference Pape
Foveated image processing for faster object detection and recognition in embedded systems using deep convolutional neural networks
Object detection and recognition algorithms using deep convolutional neural networks (CNNs) tend to be computationally intensive to implement. This presents a particular challenge for embedded systems, such as mobile robots, where the computational resources tend to be far less than for workstations. As an alternative to standard, uniformly sampled images, we propose the use of foveated image sampling here to reduce the size of images, which are faster to process in a CNN due to the reduced number of convolution operations. We evaluate object detection and recognition on the Microsoft COCO database, using foveated image sampling at different image sizes, ranging from 416×416 to 96×96 pixels, on an embedded GPU – an NVIDIA Jetson TX2 with 256 CUDA cores. The results show that it is possible to achieve a 4× speed-up in frame rates, from 3.59 FPS to 15.24 FPS, using 416×416 and 128×128 pixel images respectively. For foveated sampling, this image size reduction led to just a small decrease in recall performance in the foveal region, to 92.0% of the baseline performance with full-sized images, compared to a significant decrease to 50.1% of baseline recall performance in uniformly sampled images, demonstrating the advantage of foveated sampling
Cross-Resolution Flow Propagation for Foveated Video Super-Resolution
The demand of high-resolution video contents has grown over the years.
However, the delivery of high-resolution video is constrained by either
computational resources required for rendering or network bandwidth for remote
transmission. To remedy this limitation, we leverage the eye trackers found
alongside existing augmented and virtual reality headsets. We propose the
application of video super-resolution (VSR) technique to fuse low-resolution
context with regional high-resolution context for resource-constrained
consumption of high-resolution content without perceivable drop in quality. Eye
trackers provide us the gaze direction of a user, aiding us in the extraction
of the regional high-resolution context. As only pixels that falls within the
gaze region can be resolved by the human eye, a large amount of the delivered
content is redundant as we can't perceive the difference in quality of the
region beyond the observed region. To generate a visually pleasing frame from
the fusion of high-resolution region and low-resolution region, we study the
capability of a deep neural network of transferring the context of the observed
region to other regions (low-resolution) of the current and future frames. We
label this task a Foveated Video Super-Resolution (FVSR), as we need to
super-resolve the low-resolution regions of current and future frames through
the fusion of pixels from the gaze region. We propose Cross-Resolution Flow
Propagation (CRFP) for FVSR. We train and evaluate CRFP on REDS dataset on the
task of 8x FVSR, i.e. a combination of 8x VSR and the fusion of foveated
region. Departing from the conventional evaluation of per frame quality using
SSIM or PSNR, we propose the evaluation of past foveated region, measuring the
capability of a model to leverage the noise present in eye trackers during
FVSR. Code is made available at https://github.com/eugenelet/CRFP.Comment: 12 pages, 8 figures, to appear in IEEE/CVF Winter Conference on
Applications of Computer Vision (WACV) 202
Learning Foveated Reconstruction to Preserve Perceived Image Statistics
Foveated image reconstruction recovers full image from a sparse set of samples distributed according to the human visual system's retinal sensitivity that rapidly drops with eccentricity. Recently, the use of Generative Adversarial Networks was shown to be a promising solution for such a task as they can successfully hallucinate missing image information. Like for other supervised learning approaches, also for this one, the definition of the loss function and training strategy heavily influences the output quality. In this work, we pose the question of how to efficiently guide the training of foveated reconstruction techniques such that they are fully aware of the human visual system's capabilities and limitations, and therefore, reconstruct visually important image features. Due to the nature of GAN-based solutions, we concentrate on the human's sensitivity to hallucination for different input sample densities. We present new psychophysical experiments, a dataset, and a procedure for training foveated image reconstruction. The strategy provides flexibility to the generator network by penalizing only perceptually important deviations in the output. As a result, the method aims to preserve perceived image statistics rather than natural image statistics. We evaluate our strategy and compare it to alternative solutions using a newly trained objective metric and user experiments
Learning GAN-based Foveated Reconstruction to Recover Perceptually Important Image Features
A foveated image can be entirely reconstructed from a sparse set of samples distributed according to the retinal sensitivity of the human visual system, which rapidly decreases with increasing eccentricity. The use of Generative Adversarial Networks has recently been shown to be a promising solution for such a task, as they can successfully hallucinate missing image information. As in the case of other supervised learning approaches, the definition of the loss function and the training strategy heavily influence the quality of the output. In this work,we consider the problem of efficiently guiding thetraining of foveated reconstruction techniques such that they are more aware of the capabilities and limitations of the human visual system, and thus can reconstruct visually important image features. Our primary goal is to make the training procedure less sensitive to distortions that humans cannot detect and focus on penalizing perceptually important artifacts. Given the nature of GAN-based solutions, we focus on the sensitivity of human vision to hallucination in case of input samples with different densities. We propose psychophysical experiments, a dataset, and a procedure for training foveated image reconstruction. The proposed strategy renders the generator network flexible by penalizing only perceptually important deviations in the output. As a result, the method emphasized the recovery of perceptually important image features. We evaluated our strategy and compared it with alternative solutions by using a newly trained objective metric, a recent foveated video quality metric, and user experiments. Our evaluations revealed significant improvements in the perceived image reconstruction quality compared with the standard GAN-based training approach
Biologically-inspired hierarchical architectures for object recognition
PhD ThesisThe existing methods for machine vision translate the three-dimensional
objects in the real world into two-dimensional images. These methods
have achieved acceptable performances in recognising objects. However,
the recognition performance drops dramatically when objects are transformed, for instance, the background, orientation, position in the image,
and scale. The human’s visual cortex has evolved to form an efficient
invariant representation of objects from within a scene. The superior
performance of human can be explained by the feed-forward multi-layer
hierarchical structure of human visual cortex, in addition to, the utilisation of different fields of vision depending on the recognition task.
Therefore, the research community investigated building systems that
mimic the hierarchical architecture of the human visual cortex as an
ultimate objective.
The aim of this thesis can be summarised as developing hierarchical
models of the visual processing that tackle the remaining challenges of
object recognition. To enhance the existing models of object recognition
and to overcome the above-mentioned issues, three major contributions
are made that can be summarised as the followings
1. building a hierarchical model within an abstract architecture that
achieves good performances in challenging image object datasets;
2. investigating the contribution for each region of vision for object
and scene images in order to increase the recognition performance
and decrease the size of the processed data;
3. further enhance the performance of all existing models of object
recognition by introducing hierarchical topologies that utilise the
context in which the object is found to determine the identity of
the object.
Statement ofHigher Committee For Education Development in Iraq (HCED
- …