18,397 research outputs found

    Chunks hierarchies and retrieval structures: Comments on Saariluoma and Laine

    Get PDF
    The empirical results of Saariluoma and Laine (in press) are discussed and their computer simulations are compared with CHREST, a computational model of perception, memory and learning in chess. Mathematical functions such as power functions and logarithmic functions account for Saariluoma and Laine's (in press) correlation heuristic and for CHREST very well. However, these functions fit human data well only with game positions, not with random positions. As CHREST, which learns using spatial proximity, accounts for the human data as well as Saariluoma and Laine's (in press) correlation heuristic, their conclusion that frequency-based heuristics match the data better than proximity-based heuristics is questioned. The idea of flat chunk organisation and its relation to retrieval structures is discussed. In the conclusion, emphasis is given to the need for detailed empirical data, including information about chunk structure and types of errors, for discriminating between various learning algorithms

    Visual-Linguistic Semantic Alignment: Fusing Human Gaze and Spoken Narratives for Image Region Annotation

    Get PDF
    Advanced image-based application systems such as image retrieval and visual question answering depend heavily on semantic image region annotation. However, improvements in image region annotation are limited because of our inability to understand how humans, the end users, process these images and image regions. In this work, we expand a framework for capturing image region annotations where interpreting an image is influenced by the end user\u27s visual perception skills, conceptual knowledge, and task-oriented goals. Human image understanding is reflected by individuals\u27 visual and linguistic behaviors, but the meaningful computational integration and interpretation of their multimodal representations (e.g. gaze, text) remain a challenge. Our work explores the hypothesis that eye movements can help us understand experts\u27 perceptual processes and that spoken language descriptions can reveal conceptual elements of image inspection tasks. We propose that there exists a meaningful relation between gaze, spoken narratives, and image content. Using unsupervised bitext alignment, we create meaningful mappings between participants\u27 eye movements (which reveal key areas of images) and spoken descriptions of those images. The resulting alignments are then used to annotate image regions with concept labels. Our alignment accuracy exceeds baseline alignments that are obtained using both simultaneous and a fixed-delay temporal correspondence. Additionally, comparison of alignment accuracy between a method that identifies clusters in the images based on eye movements and a method that identifies clusters using image features shows that the two approaches perform well on different types of images and concept labels. This suggests that an image annotation framework could integrate information from more than one technique to handle heterogeneous images. The resulting alignments can be used to create a database of low-level image features and high-level semantic annotations corresponding to perceptually important image regions. We demonstrate the applicability of the proposed framework with two datasets: one consisting of general-domain images and another with images from the domain of medicine. This work is an important contribution toward the highly challenging problem of fusing human-elicited multimodal data sources, a problem that will become increasingly important as low-resource scenarios become more common

    Ranking algorithms for implicit feedback

    No full text
    This report presents novel algorithms to use eye movements as an implicit relevance feedback in order to improve the performance of the searches. The algorithms are evaluated on "Transport Rank Five" Dataset which were previously collected in Task 8.3. We demonstrated that simple linear combination or tensor product of eye movement and image features can improve the retrieval accuracy

    A perceptual comparison of empirical and predictive region-of-interest video

    Get PDF
    When viewing multimedia presentations, a user only attends to a relatively small part of the video display at any one point in time. By shifting allocation of bandwidth from peripheral areas to those locations where a user’s gaze is more likely to rest, attentive displays can be produced. Attentive displays aim to reduce resource requirements while minimizing negative user perception—understood in this paper as not only a user’s ability to assimilate and understand information but also his/her subjective satisfaction with the video content. This paper introduces and discusses a perceptual comparison between two region-of-interest display (RoID) adaptation techniques. A RoID is an attentive display where bandwidth has been preallocated around measured or highly probable areas of user gaze. In this paper, video content was manipulated using two sources of data: empirical measured data (captured using eye-tracking technology) and predictive data (calculated from the physical characteristics of the video data). Results show that display adaptation causes significant variation in users’ understanding of specific multimedia content. Interestingly, RoID adaptation and the type of video being presented both affect user perception of video quality. Moreover, the use of frame rates less than 15 frames per second, for any video adaptation technique, caused a significant reduction in user perceived quality, suggesting that although users are aware of video quality reduction, it does impact level of information assimilation and understanding. Results also highlight that user level of enjoyment is significantly affected by the type of video yet is not as affected by the quality or type of video adaptation—an interesting implication in the field of entertainment

    Attention mechanisms in the CHREST cognitive architecture

    Get PDF
    In this paper, we describe the attention mechanisms in CHREST, a computational architecture of human visual expertise. CHREST organises information acquired by direct experience from the world in the form of chunks. These chunks are searched for, and verified, by a unique set of heuristics, comprising the attention mechanism. We explain how the attention mechanism combines bottom-up and top-down heuristics from internal and external sources of information. We describe some experimental evidence demonstrating the correspondence of CHREST’s perceptual mechanisms with those of human subjects. Finally, we discuss how visual attention can play an important role in actions carried out by human experts in domains such as chess

    The CHREST architecture of cognition : the role of perception in general intelligence

    Get PDF
    Original paper can be found at: http://www.atlantis-press.com/publications/aisr/AGI-10/ Copyright Atlantis Press. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits non-commercial use, distribution and reproduction in any medium, provided the original work is properly cited.This paper argues that the CHREST architecture of cognition can shed important light on developing artificial general intelligence. The key theme is that "cognition is perception." The description of the main components and mechanisms of the architecture is followed by a discussion of several domains where CHREST has already been successfully applied, such as the psychology of expert behaviour, the acquisition of language by children, and the learning of multiple representations in physics. The characteristics of CHREST that enable it to account for empirical data include: self-organisation, an emphasis on cognitive limitations, the presence of a perception-learning cycle, and the use of naturalistic data as input for learning. We argue that some of these characteristics can help shed light on the hard questions facing theorists developing artificial general intelligence, such as intuition, the acquisition and use of concepts and the role of embodiment

    Prediction of Search Targets From Fixations in Open-World Settings

    Full text link
    Previous work on predicting the target of visual search from human fixations only considered closed-world settings in which training labels are available and predictions are performed for a known set of potential targets. In this work we go beyond the state of the art by studying search target prediction in an open-world setting in which we no longer assume that we have fixation data to train for the search targets. We present a dataset containing fixation data of 18 users searching for natural images from three image categories within synthesised image collages of about 80 images. In a closed-world baseline experiment we show that we can predict the correct target image out of a candidate set of five images. We then present a new problem formulation for search target prediction in the open-world setting that is based on learning compatibilities between fixations and potential targets

    Culture shapes how we look at faces

    Get PDF
    Background: Face processing, amongst many basic visual skills, is thought to be invariant across all humans. From as early as 1965, studies of eye movements have consistently revealed a systematic triangular sequence of fixations over the eyes and the mouth, suggesting that faces elicit a universal, biologically-determined information extraction pattern. Methodology/Principal Findings: Here we monitored the eye movements of Western Caucasian and East Asian observers while they learned, recognized, and categorized by race Western Caucasian and East Asian faces. Western Caucasian observers reproduced a scattered triangular pattern of fixations for faces of both races and across tasks. Contrary to intuition, East Asian observers focused more on the central region of the face. Conclusions/Significance: These results demonstrate that face processing can no longer be considered as arising from a universal series of perceptual events. The strategy employed to extract visual information from faces differs across cultures
    corecore