21 research outputs found
Appropriate kernels for Divisive Normalization explained by Wilson-Cowan equations
Cascades of standard Linear+NonLinear-Divisive Normalization transforms [Carandini&Heeger12] can be easily fitted using the appropriate formulation introduced in [Martinez17a] to reproduce the perception of image distortion in naturalistic environments. However, consistently with [Rust&Movshon05], training the model in naturalistic environments does not guarantee the prediction of well known phenomena illustrated by artificial stimuli. For example, the cascade of Divisive Normalizations fitted with image quality databases has to be modified to include a variety aspects of masking of simple patterns. Specifically, the standard Gaussian kernels of [Watson&Solomon97] have to be augmented with extra weights [Martinez17b]. These can be introduced ad-hoc using the intuition to solve the empirical failures found in the original model, but it would be nice a better justification for this hack. In this work we give a theoretical justification of such empirical modification of the Watson&Solomon kernel based on the Wilson-Cowan [WilsonCowan73] model of cortical interactions. Specifically, we show that the analytical relation between the Divisive Normalization model and the Wilson-Cowan model proposed here leads to the kind of extra factors that have to be included and its qualitative dependence with frequency
Derivatives and Inverse of a Linear-Nonlinear Multi-Layer Spatial Vision Model
Analyzing the mathematical properties of perceptually meaningful linear-nonlinear transforms is interesting because this computation is at the core of many vision models. Here we make such analysis in detail using a specific model [Malo & Simoncelli, SPIE Human Vision Electr. Imag. 2015] which is illustrative because it consists of a cascade of standard linear-nonlinear modules. The interest of the analytic results and the numerical methods involved transcend the particular model because of the ubiquity of the linear-nonlinear structure.
Here we extend [Malo&Simoncelli 15] by considering 4 layers: (1) linear spectral integration and nonlinear brightness response, (2) definition of local contrast by using linear filters and divisive normalization, (3) linear CSF filter and nonlinear local con- trast masking, and (4) linear wavelet-like decomposition and nonlinear divisive normalization to account for orientation and scale-dependent masking. The extra layers were measured using Maximum Differentiation [Malo et al. VSS 2016].
First, we describe the general architecture using a unified notation in which every module is composed by isomorphic linear and nonlinear transforms. The chain-rule is interesting to simplify the analysis of systems with this modular architecture, and invertibility is related to the non-singularity of the Jacobian matrices. Second, we consider the details of the four layers in our particular model, and how they improve the original version of the model. Third, we explicitly list the derivatives of every module, which are relevant for the definition of perceptual distances, perceptual gradient descent, and characterization of the deformation of space. Fourth, we address the inverse, and we find different analytical and numerical problems in each specific module. Solutions are proposed for all of them. Finally, we describe through examples how to use the toolbox to apply and check the above theory.
In summary, the formulation and toolbox are ready to explore the geometric and perceptual issues addressed in the introductory section (giving all the technical information that was missing in [Malo&Simoncelli 15])
Vision models for wide color gamut imaging in cinema
Gamut mapping is the problem of transforming the colors of image or video content so as to fully exploit the color palette of the display device where the content will be shown, while preserving the artistic intent of the original content's creator. In particular, in the cinema industry, the rapid advancement in display technologies has created a pressing need to develop automatic and fast gamut mapping algorithms. In this article, we propose a novel framework that is based on vision science models, performs both gamut reduction and gamut extension, is of low computational complexity, produces results that are free from artifacts and outperforms state-of-the-art methods according to psychophysical tests. Our experiments also highlight the limitations of existing objective metrics for the gamut mapping problem
Contrast Sensitivity Functions in Autoencoders
Three decades ago, Atick et al. suggested that human frequency sensitivity
may emerge from the enhancement required for a more efficient analysis of
retinal images. Here we reassess the relevance of low-level vision tasks in the
explanation of the Contrast Sensitivity Functions (CSFs) in light of (1) the
current trend of using artificial neural networks for studying vision, and (2)
the current knowledge of retinal image representations.
As a first contribution, we show that a very popular type of convolutional
neural networks (CNNs), called autoencoders, may develop human-like CSFs in the
spatio-temporal and chromatic dimensions when trained to perform some basic
low-level vision tasks (like retinal noise and optical blur removal), but not
others (like chromatic adaptation) or pure reconstruction after simple
bottlenecks). As an illustrative example, the best CNN (in the considered set
of simple architectures for enhancement of the retinal signal) reproduces the
CSFs with an RMSE error of 11\% of the maximum sensitivity.
As a second contribution, we provide experimental evidence of the fact that,
for some functional goals (at low abstraction level), deeper CNNs that are
better in reaching the quantitative goal are actually worse in replicating
human-like phenomena (such as the CSFs). This low-level result (for the
explored networks) is not necessarily in contradiction with other works that
report advantages of deeper nets in modeling higher-level vision goals.
However, in line with a growing body of literature, our results suggests
another word of caution about CNNs in vision science since the use of
simplified units or unrealistic architectures in goal optimization may be a
limitation for the modeling and understanding of human vision
Video inpainting of occluding and occluded objects
We present a basic technique to fill-in missing parts of a video sequence taken from a static camera. Two important cases are considered. The first case is concerned with the removal of non-stationary objects that occlude stationary background. We use a priority based spatio-temporal synthesis scheme for inpainting the stationary background. The second and more difficult case involves filling-in moving objects when they are partially occluded. For this, we propose a priority scheme to first inpaint the occluded moving objects and then fill-in the remaining area with stationary background using the method proposed for the first case. We use as input an optical-flow based mask, which tells if an undamaged pixel is moving or is stationary. The moving object is inpainted by copying patches from undamaged frames, and this copying is independent of the background of the moving object in either frame. This work has applications in a variety of different areas, including video special effects and restoration and enhancement of damaged videos. The examples shown in the paper illustrate these ideas. 1