553 research outputs found

    Canonical views of scenes depend on the shape of the space

    Get PDF
    When recognizing or depicting objects, people show a preference for particular “canonical” views. Are there similar preferences for particular views of scenes? We investigated this question using panoramic images, which show a 360-degree view of a location. Observers used an interactive viewer to explore the scene and select the best view. We found that agreement between observers on the “best” view of each scene was generally high. We attempted to predict the selected views using a model based on the shape of the space around the camera location and on the navigational constraints of the scene. The model performance suggests that observers select views which capture as much of the surrounding space as possible, but do not consider navigational constraints when selecting views. These results seem analogous to findings with objects, which suggest that canonical views maximize the visible surfaces of an object, but are not necessarily functional views.National Science Foundation (U.S.) (NSF Career award (0546262))National Science Foundation (U.S.) (Grant 0705677)National Institutes of Health (U.S.) (Grant 1016862)National Eye Institute (grant EY02484)National Science Foundation (U.S.) (NSF Graduate Research Fellowship

    Improving Denoising Diffusion Models via Simultaneous Estimation of Image and Noise

    Full text link
    This paper introduces two key contributions aimed at improving the speed and quality of images generated through inverse diffusion processes. The first contribution involves reparameterizing the diffusion process in terms of the angle on a quarter-circular arc between the image and noise, specifically setting the conventional αˉ=cos(η)\displaystyle \sqrt{\bar{\alpha}}=\cos(\eta). This reparameterization eliminates two singularities and allows for the expression of diffusion evolution as a well-behaved ordinary differential equation (ODE). In turn, this allows higher order ODE solvers such as Runge-Kutta methods to be used effectively. The second contribution is to directly estimate both the image (x0\mathbf{x}_0) and noise (ϵ\mathbf{\epsilon}) using our network, which enables more stable calculations of the update step in the inverse diffusion steps, as accurate estimation of both the image and noise are crucial at different stages of the process. Together with these changes, our model achieves faster generation, with the ability to converge on high-quality images more quickly, and higher quality of the generated images, as measured by metrics such as Frechet Inception Distance (FID), spatial Frechet Inception Distance (sFID), precision, and recall

    Functional Properties of Subretinal Transplants in Rabbit

    Get PDF

    Rethinking the Role of Top-Down Attention in Vision: Effects Attributable to a Lossy Representation in Peripheral Vision

    Get PDF
    According to common wisdom in the field of visual perception, top-down selective attention is required in order to bind features into objects. In this view, even simple tasks, such as distinguishing a rotated T from a rotated L, require selective attention since they require feature binding. Selective attention, in turn, is commonly conceived as involving volition, intention, and at least implicitly, awareness. There is something non-intuitive about the notion that we might need so expensive (and possibly human) a resource as conscious awareness in order to perform so basic a function as perception. In fact, we can carry out complex sensorimotor tasks, seemingly in the near absence of awareness or volitional shifts of attention (“zombie behaviors”). More generally, the tight association between attention and awareness, and the presumed role of attention on perception, is problematic. We propose that under normal viewing conditions, the main processes of feature binding and perception proceed largely independently of top-down selective attention. Recent work suggests that there is a significant loss of information in early stages of visual processing, especially in the periphery. In particular, our texture tiling model (TTM) represents images in terms of a fixed set of “texture” statistics computed over local pooling regions that tile the visual input. We argue that this lossy representation produces the perceptual ambiguities that have previously been as ascribed to a lack of feature binding in the absence of selective attention. At the same time, the TTM representation is sufficiently rich to explain performance in such complex tasks as scene gist recognition, pop-out target search, and navigation. A number of phenomena that have previously been explained in terms of voluntary attention can be explained more parsimoniously with the TTM. In this model, peripheral vision introduces a specific kind of information loss, and the information available to an observer varies greatly depending upon shifts of the point of gaze (which usually occur without awareness). The available information, in turn, provides a key determinant of the visual system’s capabilities and deficiencies. This scheme dissociates basic perceptual operations, such as feature binding, from both top-down attention and conscious awareness

    Estimating scene typicality from human ratings and image features

    Get PDF
    Scenes, like objects, are visual entities that can be categorized into functional and semantic groups. One of the core concepts of human categorization is the idea that category membership is graded: some exemplars are more typical than others. Here, we obtain human typicality rankings for more than 120,000 images from 706 scene categories through an online rating task on Amazon Mechanical Turk. We use these rankings to identify the most typical examples of each scene category. Using computational models of scene classification based on global image features, we find that images which are rated as more typical examples of their category are more likely to be classified correctly. This indicates that the most typical scene examples contain the diagnostic visual features that are relevant for their categorization. Objectless, holistic representations of scenes might serve as a good basis for understanding how semantic categories are defined in term of perceptual representations.National Science Foundation (U.S.) (NSF-CAREER Award (0546262)National Science Foundation (U.S.) (grant 0705677)National Science Foundation (U.S.) (grant 1016862)Nuclear Energy Institute (grant EY02484)National Science Foundation (U.S.) (Career Award (0747120))Google (Firm) (Research Award

    Use of Local Image Information in Depth Edge Classification by Humans and Neural Networks

    Get PDF
    Humans can use local cues to distinguish image edges caused by a depth change from other types of edges (Vilankar et al., 2014). But which local cues? Here we use the SYNS database (Adams et al., 2016) to automatically label edges in images of natural scenes as depth or non-depth. We use this ground truth to identify the cues used by human observers and convolutional neural networks (CNNs) for edge classification. Eight observers viewed square image patches, each centered on an image edge, ranging in width from 0.6 to 2.4 degrees (8 to 32 pixels). Human judgments (depth/non-depth) were compared to responses of a CNN trained on the same task. Human performance improved with patch size (65%-74% correct) but remained well below CNN accuracy (82-86% correct). Agreement between humans and the CNN was above chance but lower than human-human agreement. Decision Variable Correlation (Sebastian & Geisler, in press) was used to evaluate the relationships between depth responses and local edge cues. Humans seem to rely primarily on contrast cues, specifically luminance contrast and red-green contrast across the edge. The CNN also relies on luminance contrast, but unlike humans it seems to make use of mean luminance and red-green intensity as well. These local luminance and color features provide valid cues for depth edge discrimination in natural scenes

    Invertible Concept-based Explanations for CNN Models with Non-negative Concept Activation Vectors

    Full text link
    Convolutional neural network (CNN) models for computer vision are powerful but lack explainability in their most basic form. This deficiency remains a key challenge when applying CNNs in important domains. Recent work for explanations through feature importance of approximate linear models has moved from input-level features (pixels or segments) to features from mid-layer feature maps in the form of concept activation vectors (CAVs). CAVs contain concept-level information and could be learnt via clustering. In this work, we rethink the ACE algorithm of Ghorbani et al., proposing an alternative inevitable concept-based explanation (ICE) framework to overcome its shortcomings. Based on the requirements of fidelity (approximate models to target models) and interpretability (being meaningful to people), we design measurements and evaluate a range of matrix factorization methods with our framework. We find that \emph{non-negative concept activation vectors} (NCAVs) from non-negative matrix factorization provide superior performance in interpretability and fidelity based on computational and human subject experiments. Our framework provides both local and global concept-level explanations for pre-trained CNN models

    Basic level scene understanding: categories, attributes and structures

    Get PDF
    A longstanding goal of computer vision is to build a system that can automatically understand a 3D scene from a single image. This requires extracting semantic concepts and 3D information from 2D images which can depict an enormous variety of environments that comprise our visual world. This paper summarizes our recent efforts toward these goals. First, we describe the richly annotated SUN database which is a collection of annotated images spanning 908 different scene categories with object, attribute, and geometric labels for many scenes. This database allows us to systematically study the space of scenes and to establish a benchmark for scene and object recognition. We augment the categorical SUN database with 102 scene attributes for every image and explore attribute recognition. Finally, we present an integrated system to extract the 3D structure of the scene and objects depicted in an image.Google U.S./Canada Ph.D. Fellowship in Computer VisionNational Science Foundation (U.S.) (grant 1016862)Google Faculty Research AwardNational Science Foundation (U.S.) (Career Award 1149853)National Science Foundation (U.S.) (Career Award 0747120)United States. Office of Naval Research. Multidisciplinary University Research Initiative (N000141010933
    corecore