177 research outputs found

    Mapping the spatio-temporal dynamics of vision in the human brain

    Get PDF
    Recognition of objects and scenes is a fundamental function of the human brain, necessitating a complex neural machinery that transforms low level visual information into semantic content. Despite significant advances in characterizing the locus and function of key visual areas, integrating the temporal and spatial dynamics of this processing stream has posed a decades-long challenge to human neuroscience. In this talk I will describe a brain mapping approach to combine magnetoencephalography (MEG), functional MRI (fMRI) measurements, and convolutional neural networks (CNN) by representational similarity analysis to yield a spatially and temporally integrated characterization of neuronal representations when observers perceive visual events. The approach is well suited to characterize the duration and sequencing of perceptual and cognitive tasks, and to place new constraints on the computational architecture of cognition. In collaboration with: D. Pantazis, R.M Cichy, A. Torralba, S.M. Khaligh-Razavi, C. Mullin, Y. Mohsenzadeh, B.Zhou, A. Khosl

    Interpreting Deep Visual Representations via Network Dissection

    Full text link
    The success of recent deep convolutional neural networks (CNNs) depends on learning hidden representations that can summarize the important factors of variation behind the data. However, CNNs often criticized as being black boxes that lack interpretability, since they have millions of unexplained model parameters. In this work, we describe Network Dissection, a method that interprets networks by providing labels for the units of their deep visual representations. The proposed method quantifies the interpretability of CNN representations by evaluating the alignment between individual hidden units and a set of visual semantic concepts. By identifying the best alignments, units are given human interpretable labels across a range of objects, parts, scenes, textures, materials, and colors. The method reveals that deep representations are more transparent and interpretable than expected: we find that representations are significantly more interpretable than they would be under a random equivalently powerful basis. We apply the method to interpret and compare the latent representations of various network architectures trained to solve different supervised and self-supervised training tasks. We then examine factors affecting the network interpretability such as the number of the training iterations, regularizations, different initializations, and the network depth and width. Finally we show that the interpreted units can be used to provide explicit explanations of a prediction given by a CNN for an image. Our results highlight that interpretability is an important property of deep neural networks that provides new insights into their hierarchical structure.Comment: *B. Zhou and D. Bau contributed equally to this work. 15 pages, 27 figure

    Global Depth Perception from Familiar Scene Structure

    Get PDF
    In the absence of cues for absolute depth measurements as binocular disparity, motion, or defocus, the absolute distance between the observer and a scene cannot be measured. The interpretation of shading, edges and junctions may provide a 3D model of the scene but it will not inform about the actual "size" of the space. One possible source of information for absolute depth estimation is the image size of known objects. However, this is computationally complex due to the difficulty of the object recognition process. Here we propose a source of information for absolute depth estimation that does not rely on specific objects: we introduce a procedure for absolute depth estimation based on the recognition of the whole scene. The shape of the space of the scene and the structures present in the scene are strongly related to the scale of observation. We demonstrate that, by recognizing the properties of the structures present in the image, we can infer the scale of the scene, and therefore its absolute mean depth. We illustrate the interest in computing the mean depth of the scene with application to scene recognition and object detection

    Recognition of natural scenes from global properties: Seeing the forest without representing the trees

    Get PDF
    Human observers are able to rapidly and accurately categorize natural scenes, but the representation mediating this feat is still unknown. Here we propose a framework of rapid scene categorization that does not segment a scene into objects and instead uses a vocabulary of global, ecological properties that describe spatial and functional aspects of scene space (such as navigability or mean depth). In Experiment 1, we obtained ground truth rankings on global properties for use in Experiments 2–4. To what extent do human observers use global property information when rapidly categorizing natural scenes? In Experiment 2, we found that global property resemblance was a strong predictor of both false alarm rates and reaction times in a rapid scene categorization experiment. To what extent is global property information alone a sufficient predictor of rapid natural scene categorization? In Experiment 3, we found that the performance of a classifier representing only these properties is indistinguishable from human performance in a rapid scene categorization task in terms of both accuracy and false alarms. To what extent is this high predictability unique to a global property representation? In Experiment 4, we compared two models that represent scene object information to human categorization performance and found that these models had lower fidelity at representing the patterns of performance than the global property model. These results provide support for the hypothesis that rapid categorization of natural scenes may not be mediated primarily though objects and parts, but also through global properties of structure and affordance.National Science Foundation (U.S.) (Graduate Research Fellowship)National Science Foundation (U.S.) (Grant 0705677)National Science Foundation (U.S.) (Career Award 0546262)NEC Corporation Fund for Research in Computers and Communication

    The Briefest of Glances: The Time Course of Natural Scene Understanding

    Get PDF
    What information is available from a brief glance at a novel scene? Although previous efforts to answer this question have focused on scene categorization or object detection, real-world scenes contain a wealth of information whose perceptual availability has yet to be explored. We compared image exposure thresholds in several tasks involving basic-level categorization or global-property classification. All thresholds were remarkably short: Observers achieved 75%-correct performance with presentations ranging from 19 to 67 ms, reaching maximum performance at about 100 ms. Global-property categorization was performed with significantly less presentation time than basic-level categorization, which suggests that there exists a time during early visual processing when a scene may be classified as, for example, a large space or navigable, but not yet as a mountain or lake. Comparing the relative availability of visual information reveals bottlenecks in the accumulation of meaning. Understanding these bottlenecks provides critical insight into the computations underlying rapid visual understanding.National Science Foundation (U.S.) (CAREER Award (0546262))National Science Foundation (U.S.) (Grant 0705677)National Science Foundation (U.S.) (Graduate Research Fellowship

    Canonical views of scenes depend on the shape of the space

    Get PDF
    When recognizing or depicting objects, people show a preference for particular “canonical” views. Are there similar preferences for particular views of scenes? We investigated this question using panoramic images, which show a 360-degree view of a location. Observers used an interactive viewer to explore the scene and select the best view. We found that agreement between observers on the “best” view of each scene was generally high. We attempted to predict the selected views using a model based on the shape of the space around the camera location and on the navigational constraints of the scene. The model performance suggests that observers select views which capture as much of the surrounding space as possible, but do not consider navigational constraints when selecting views. These results seem analogous to findings with objects, which suggest that canonical views maximize the visible surfaces of an object, but are not necessarily functional views.National Science Foundation (U.S.) (NSF Career award (0546262))National Science Foundation (U.S.) (Grant 0705677)National Institutes of Health (U.S.) (Grant 1016862)National Eye Institute (grant EY02484)National Science Foundation (U.S.) (NSF Graduate Research Fellowship

    Artifact magnification on deepfake videos increases human detection and subjective confidence

    Full text link
    The development of technologies for easily and automatically falsifying video has raised practical questions about people's ability to detect false information online. How vulnerable are people to deepfake videos? What technologies can be applied to boost their performance? Human susceptibility to deepfake videos is typically measured in laboratory settings, which do not reflect the challenges of real-world browsing. In typical browsing, deepfakes are rare, engagement with the video may be short, participants may be distracted, or the video streaming quality may be degraded. Here, we tested deepfake detection under these ecological viewing conditions, and found that detection was lowered in all cases. Principles from signal detection theory indicated that different viewing conditions affected different dimensions of detection performance. Overall, this suggests that the current literature underestimates people's susceptibility to deepfakes. Next, we examined how computer vision models might be integrated into users' decision process to increase accuracy and confidence during deepfake detection. We evaluated the effectiveness of communicating the model's prediction to the user by amplifying artifacts in fake videos. We found that artifact amplification was highly effective at making fake video distinguishable from real, in a manner that was robust across viewing conditions. Additionally, compared to a traditional text-based prompt, artifact amplification was more convincing: people accepted the model's suggestion more often, and reported higher final confidence in their model-supported decision, particularly for more challenging videos. Overall, this suggests that visual indicators that cause distortions on fake videos may be highly effective at mitigating the impact of falsified video.Comment: 8 pages, 4 figure

    Quadri-stability of a spatially ambiguous auditory illusion

    Get PDF
    In addition to vision, audition plays an important role in sound localization in our world. One way we estimate the motion of an auditory object moving towards or away from us is from changes in volume intensity. However, the human auditory system has unequally distributed spatial resolution, including difficulty distinguishing sounds in front vs. behind the listener. Here, we introduce a novel quadri-stable illusion, the Transverse-and-Bounce Auditory Illusion, which combines front-back confusion with changes in volume levels of a nonspatial sound to create ambiguous percepts of an object approaching and withdrawing from the listener. The sound can be perceived as traveling transversely from front to back or back to front, or “bouncing” to remain exclusively in front of or behind the observer. Here we demonstrate how human listeners experience this illusory phenomenon by comparing ambiguous and unambiguous stimuli for each of the four possible motion percepts. When asked to rate their confidence in perceiving each sound’s motion, participants reported equal confidence for the illusory and unambiguous stimuli. Participants perceived all four illusory motion percepts, and could not distinguish the illusion from the unambiguous stimuli. These results show that this illusion is effectively quadri-stable. In a second experiment, the illusory stimulus was looped continuously in headphones while participants identified its perceived path of motion to test properties of perceptual switching, locking, and biases. Participants were biased towards perceiving transverse compared to bouncing paths, and they became perceptually locked into alternating between front-to-back and back-to-front percepts, perhaps reflecting how auditory objects commonly move in the real world. This multi-stable auditory illusion opens opportunities for studying the perceptual, cognitive, and neural representation of objects in motion, as well as exploring multimodal perceptual awareness.United States. Dept. of Defense (National Defense Science and Engineering Graduate (NDSEG) Fellowships
    • …