475 research outputs found
Recognition of natural scenes from global properties: Seeing the forest without representing the trees
Human observers are able to rapidly and accurately categorize natural scenes, but the representation mediating this feat is still unknown. Here we propose a framework of rapid scene categorization that does not segment a scene into objects and instead uses a vocabulary of global, ecological properties that describe spatial and functional aspects of scene space (such as navigability or mean depth). In Experiment 1, we obtained ground truth rankings on global properties for use in Experiments 2–4. To what extent do human observers use global property information when rapidly categorizing natural scenes? In Experiment 2, we found that global property resemblance was a strong predictor of both false alarm rates and reaction times in a rapid scene categorization experiment. To what extent is global property information alone a sufficient predictor of rapid natural scene categorization? In Experiment 3, we found that the performance of a classifier representing only these properties is indistinguishable from human performance in a rapid scene categorization task in terms of both accuracy and false alarms. To what extent is this high predictability unique to a global property representation? In Experiment 4, we compared two models that represent scene object information to human categorization performance and found that these models had lower fidelity at representing the patterns of performance than the global property model. These results provide support for the hypothesis that rapid categorization of natural scenes may not be mediated primarily though objects and parts, but also through global properties of structure and affordance.National Science Foundation (U.S.) (Graduate Research Fellowship)National Science Foundation (U.S.) (Grant 0705677)National Science Foundation (U.S.) (Career Award 0546262)NEC Corporation Fund for Research in Computers and Communication
The Briefest of Glances: The Time Course of Natural Scene Understanding
What information is available from a brief glance at a novel scene? Although previous efforts to answer this question have focused on scene categorization or object detection, real-world scenes contain a wealth of information whose perceptual availability has yet to be explored. We compared image exposure thresholds in several tasks involving basic-level categorization or global-property classification. All thresholds were remarkably short: Observers achieved 75%-correct performance with presentations ranging from 19 to 67 ms, reaching maximum performance at about 100 ms. Global-property categorization was performed with significantly less presentation time than basic-level categorization, which suggests that there exists a time during early visual processing when a scene may be classified as, for example, a large space or navigable, but not yet as a mountain or lake. Comparing the relative availability of visual information reveals bottlenecks in the accumulation of meaning. Understanding these bottlenecks provides critical insight into the computations underlying rapid visual understanding.National Science Foundation (U.S.) (CAREER Award (0546262))National Science Foundation (U.S.) (Grant 0705677)National Science Foundation (U.S.) (Graduate Research Fellowship
Disentangling the Independent Contributions of Visual and Conceptual Features to the Spatiotemporal Dynamics of Scene Categorization
Human scene categorization is characterized by its remarkable speed. While many visual and conceptual features have been linked to this ability, significant correlations exist between feature spaces, impeding our ability to determine their relative contributions to scene categorization. Here, we used a whitening transformation to decorrelate a variety of visual and conceptual features and assess the time course of their unique contributions to scene categorization. Participants (both sexes) viewed 2250 full-color scene images drawn from 30 different scene categories while having their brain activity measured through 256-channel EEG. We examined the variance explained at each electrode and time point of visual event-related potential (vERP) data from nine different whitened encoding models. These ranged from low-level features obtained from filter outputs to high-level conceptual features requiring human annotation. The amount of category information in the vERPs was assessed through multivariate decoding methods. Behavioral similarity measures were obtained in separate crowdsourced experiments. We found that all nine models together contributed 78% of the variance of human scene similarity assessments and were within the noise ceiling of the vERP data. Low-level models explained earlier vERP variability (88 ms after image onset), whereas high-level models explained later variance (169 ms). Critically, only high-level models shared vERP variability with behavior. Together, these results suggest that scene categorization is primarily a high-level process, but reliant on previously extracted low-level features
A global framework for scene gist
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Brain and Cognitive Sciences, 2009.Cataloged from PDF version of thesis.Includes bibliographical references.Human observers are able to rapidly and accurately categorize natural scenes, but the representation mediating this feat is still unknown. Here we propose a framework of rapid scene categorization that does not segment a scene into objects and instead uses a vocabulary of global, ecological properties that describe spatial and functional aspects of scene space (such as navigability or mean depth). In Chapter 1, four experiments explore the human sensitivity to global properties for rapid scene categorization, as well as the computational sufficiency of these properties for predicting scene categories. Chapter 2 explores the time course of scene understanding, finding that global properties can be perceived with less image exposure than the computation of a scene's basic-level category. Finally, in Chapter 3, I explore aftereffects to adaptation to global properties, showing that repeated exposure to many global properties produces robust high-level aftereffects, thus providing evidence for the neural coding of these properties. Altogether, these results provide support for the hypothesis that rapid categorization of natural scenes may not be mediated primarily though objects and parts, but also through global properties of structure and affordance.by Michelle R. Greene.Ph.D
Dynamic Electrode-to-Image (DETI) mapping reveals the human brain’s spatiotemporal code of visual information
A number of neuroimaging techniques have been employed to understand how visual information is transformed along the visual pathway. Although each technique has spatial and temporal limitations, they can each provide important insights into the visual code. While the BOLD signal of fMRI can be quite informative, the visual code is not static and this can be obscured by fMRI’s poor temporal resolution. In this study, we leveraged the high temporal resolution of EEG to develop an encoding technique based on the distribution of responses generated by a population of real-world scenes. This approach maps neural signals to each pixel within a given image and reveals location-specific transformations of the visual code, providing a spatiotemporal signature for the image at each electrode. Our analyses of the mapping results revealed that scenes undergo a series of nonuniform transformations that prioritize different spatial frequencies at different regions of scenes over time. This mapping technique offers a potential avenue for future studies to explore how dynamic feedforward and recurrent processes inform and refine high-level representations of our visual world
Digital Divides in Scene Recognition: Uncovering Socioeconomic Biases in Deep Learning Systems
Computer-based scene understanding has influenced fields ranging from urban
planning to autonomous vehicle performance, yet little is known about how well
these technologies work across social differences. We investigate the biases of
deep convolutional neural networks (dCNNs) in scene classification, using
nearly one million images from global and US sources, including user-submitted
home photographs and Airbnb listings. We applied statistical models to quantify
the impact of socioeconomic indicators such as family income, Human Development
Index (HDI), and demographic factors from public data sources (CIA and US
Census) on dCNN performance. Our analyses revealed significant socioeconomic
bias, where pretrained dCNNs demonstrated lower classification accuracy, lower
classification confidence, and a higher tendency to assign labels that could be
offensive when applied to homes (e.g., "ruin", "slum"), especially in images
from homes with lower socioeconomic status (SES). This trend is consistent
across two datasets of international images and within the diverse economic and
racial landscapes of the United States. This research contributes to
understanding biases in computer vision, emphasizing the need for more
inclusive and representative training datasets. By mitigating the bias in the
computer vision pipelines, we can ensure fairer and more equitable outcomes for
applied computer vision, including home valuation and smart home security
systems. There is urgency in addressing these biases, which can significantly
impact critical decisions in urban development and resource allocation. Our
findings also motivate the development of AI systems that better understand and
serve diverse communities, moving towards technology that equitably benefits
all sectors of society.Comment: 20 pages, 3 figures, 3 table
Towards a state-space geometry of neural responses to natural scenes: A steady-state approach
Our understanding of information processing by the mammalian visual system has come through a variety of techniques ranging from psychophysics and fMRI to single unit recording and EEG. Each technique provides unique insights into the processing framework of the early visual system. Here, we focus on the nature of the information that is carried by steady state visual evoked potentials (SSVEPs). To study the information provided by SSVEPs, we presented human participants with a population of natural scenes and measured the relative SSVEP response. Rather than focus on particular features of this signal, we focused on the full state-space of possible responses and investigated how the evoked responses are mapped onto this space. Our results show that it is possible to map the relatively high-dimensional signal carried by SSVEPs onto a 2-dimensional space with little loss. We also show that a simple biologically plausible model can account for a high proportion of the explainable variance (~73%) in that space. Finally, we describe a technique for measuring the mutual information that is available about images from SSVEPs. The techniques introduced here represent a new approach to understanding the nature of the information carried by SSVEPs. Crucially, this approach is general and can provide a means of comparing results across different neural recording methods. Altogether, our study sheds light on the encoding principles of early vision and provides a much needed reference point for understanding subsequent transformations of the early visual response space to deeper knowledge structures that link different visual environments
Distinct contributions of functional and deep neural network features to representational similarity of scenes in human brain and behavior
Inherent correlations between visual and semantic features in real-world scenes make it difficult to determine how different scene properties contribute to neural representations. Here, we assessed the contributions of multiple properties to scene representation by partitioning the variance explained in human behavioral and brain measurements by three feature models whose inter-correlations were minimized a priori through stimulus preselection. Behavioral assessments of scene similarity reflected unique contributions from a functional feature model indicating potential actions in scenes as well as high-level visual features from a deep neural network (DNN). In contrast, similarity of cortical responses in scene-selective areas was uniquely explained by mid-and high-level DNN features only, while an object label model did not contribute uniquely to either domain. The striking dissociation between functional and DNN features in their contribution to behavioral and brain representations of scenes indicates that scene-selective cortex represents only a subset of behaviorally relevant scene information
- …