100,322 research outputs found

    Finding any Waldo: zero-shot invariant and efficient visual search

    Full text link
    Searching for a target object in a cluttered scene constitutes a fundamental challenge in daily vision. Visual search must be selective enough to discriminate the target from distractors, invariant to changes in the appearance of the target, efficient to avoid exhaustive exploration of the image, and must generalize to locate novel target objects with zero-shot training. Previous work has focused on searching for perfect matches of a target after extensive category-specific training. Here we show for the first time that humans can efficiently and invariantly search for natural objects in complex scenes. To gain insight into the mechanisms that guide visual search, we propose a biologically inspired computational model that can locate targets without exhaustive sampling and generalize to novel objects. The model provides an approximation to the mechanisms integrating bottom-up and top-down signals during search in natural scenes.Comment: Number of figures: 6 Number of supplementary figures: 1

    Bottom-up retinotopic organization supports top-down mental imagery

    Get PDF
    Finding a path between locations is a routine task in daily life. Mental navigation is often used to plan a route to a destination that is not visible from the current location. We first used functional magnetic resonance imaging (fMRI) and surface-based averaging methods to find high-level brain regions involved in imagined navigation between locations in a building very familiar to each participant. This revealed a mental navigation network that includes the precuneus, retrosplenial cortex (RSC), parahippocampal place area (PPA), occipital place area (OPA), supplementary motor area (SMA), premotor cortex, and areas along the medial and anterior intraparietal sulcus. We then visualized retinotopic maps in the entire cortex using wide-field, natural scene stimuli in a separate set of fMRI experiments. This revealed five distinct visual streams or ‘fingers’ that extend anteriorly into middle temporal, superior parietal, medial parietal, retrosplenial and ventral occipitotemporal cortex. By using spherical morphing to overlap these two data sets, we showed that the mental navigation network primarily occupies areas that also contain retinotopic maps. Specifically, scene-selective regions RSC, PPA and OPA have a common emphasis on the far periphery of the upper visual field. These results suggest that bottom-up retinotopic organization may help to efficiently encode scene and location information in an eye-centered reference frame for top-down, internally generated mental navigation. This study pushes the border of visual cortex further anterior than was initially expected

    Adapting Visual Question Answering Models for Enhancing Multimodal Community Q&A Platforms

    Full text link
    Question categorization and expert retrieval methods have been crucial for information organization and accessibility in community question & answering (CQA) platforms. Research in this area, however, has dealt with only the text modality. With the increasing multimodal nature of web content, we focus on extending these methods for CQA questions accompanied by images. Specifically, we leverage the success of representation learning for text and images in the visual question answering (VQA) domain, and adapt the underlying concept and architecture for automated category classification and expert retrieval on image-based questions posted on Yahoo! Chiebukuro, the Japanese counterpart of Yahoo! Answers. To the best of our knowledge, this is the first work to tackle the multimodality challenge in CQA, and to adapt VQA models for tasks on a more ecologically valid source of visual questions. Our analysis of the differences between visual QA and community QA data drives our proposal of novel augmentations of an attention method tailored for CQA, and use of auxiliary tasks for learning better grounding features. Our final model markedly outperforms the text-only and VQA model baselines for both tasks of classification and expert retrieval on real-world multimodal CQA data.Comment: Submitted for review at CIKM 201

    Task-demands can immediately reverse the effects of sensory-driven saliency in complex visual stimuli

    Get PDF
    In natural vision both stimulus features and task-demands affect an observer's attention. However, the relationship between sensory-driven (“bottom-up”) and task-dependent (“top-down”) factors remains controversial: Can task-demands counteract strong sensory signals fully, quickly, and irrespective of bottom-up features? To measure attention under naturalistic conditions, we recorded eye-movements in human observers, while they viewed photographs of outdoor scenes. In the first experiment, smooth modulations of contrast biased the stimuli's sensory-driven saliency towards one side. In free-viewing, observers' eye-positions were immediately biased toward the high-contrast, i.e., high-saliency, side. However, this sensory-driven bias disappeared entirely when observers searched for a bull's-eye target embedded with equal probability to either side of the stimulus. When the target always occurred in the low-contrast side, observers' eye-positions were immediately biased towards this low-saliency side, i.e., the sensory-driven bias reversed. Hence, task-demands do not only override sensory-driven saliency but also actively countermand it. In a second experiment, a 5-Hz flicker replaced the contrast gradient. Whereas the bias was less persistent in free viewing, the overriding and reversal took longer to deploy. Hence, insufficient sensory-driven saliency cannot account for the bias reversal. In a third experiment, subjects searched for a spot of locally increased contrast (“oddity”) instead of the bull's-eye (“template”). In contrast to the other conditions, a slight sensory-driven free-viewing bias prevails in this condition. In a fourth experiment, we demonstrate that at known locations template targets are detected faster than oddity targets, suggesting that the former induce a stronger top-down drive when used as search targets. Taken together, task-demands can override sensory-driven saliency in complex visual stimuli almost immediately, and the extent of overriding depends on the search target and the overridden feature, but not on the latter's free-viewing saliency
    corecore