2,034 research outputs found

    A network model of referent identification by toddlers in a visual world task

    Get PDF
    We present a neural network model of referent identification in a visual world task. Inputs are visual representations of item pairs unfolding with sequences of phonemes identifying the target item. The model is trained to output the semantic representation of the target and to suppress the distractor. The training set uses a 200-word lexicon typically known by toddlers. The phonological, visual, and semantic representations are derived from real corpora. Successful performance requires correct association between labels and visual and semantic representations, as well as correct location identification. The model reproduces experimental evidence that phonological, perceptual, and categorical relationships modulate item preferences. The model provides an account of how language can drive visual attention in the inter-modal preferential looking task

    Semantic Visual Localization

    Full text link
    Robust visual localization under a wide range of viewing conditions is a fundamental problem in computer vision. Handling the difficult cases of this problem is not only very challenging but also of high practical relevance, e.g., in the context of life-long localization for augmented reality or autonomous robots. In this paper, we propose a novel approach based on a joint 3D geometric and semantic understanding of the world, enabling it to succeed under conditions where previous approaches failed. Our method leverages a novel generative model for descriptor learning, trained on semantic scene completion as an auxiliary task. The resulting 3D descriptors are robust to missing observations by encoding high-level 3D geometric and semantic information. Experiments on several challenging large-scale localization datasets demonstrate reliable localization under extreme viewpoint, illumination, and geometry changes

    Interactions between view changes and shape changes in picture-picture matching

    Get PDF
    Four studies presented pictures of different morphs of novel, complex, three-dimensional objects, similar to objects which we must identify in the real world. We investigated how viewpoint changes influence our ability to discriminate between morphs. View changes had a powerful effect on performance in picture-picture matching tasks when similarly shaped morphs had to be discriminated. Shape changes were detected faster and more accurately when morphs were depicted from the same rather than different views. In contrast, view change had no effect when dissimilarly shaped morphs had to be discriminated. This interaction between the effects of view change and shape change was found for both simultaneous stimulus presentation and for sequential presentation with interstimulus intervals of up to 3600ms. The interaction was found following repeated presentations of the stimuli prior to the matching task and following practice at the matching task as well as after no such pre-exposure to the stimuli or to the task. The results demonstrate the importance of view changes relative to other task manipulations in modulating the shape discrimination abilities of the human visual recognition system

    Modeling Visual Rhetoric and Semantics in Multimedia

    Get PDF
    Recent advances in machine learning have enabled computer vision algorithms to model complicated visual phenomena with accuracies unthinkable a mere decade ago. Their high-performance on a plethora of vision-related tasks has enabled computer vision researchers to begin to move beyond traditional visual recognition problems to tasks requiring higher-level image understanding. However, most computer vision research still focuses on describing what images, text, or other media literally portrays. In contrast, in this dissertation we focus on learning how and why such content is portrayed. Rather than viewing media for its content, we recast the problem as understanding visual communication and visual rhetoric. For example, the same content may be portrayed in different ways in order to present the story the author wishes to convey. We thus seek to model not only the content of the media, but its authorial intent and latent messaging. Understanding how and why visual content is portrayed a certain way requires understanding higher level abstract semantic concepts which are themselves latent within visual media. By latent, we mean the concept is not readily visually accessible within a single image (e.g. right vs left political bias), in contrast to explicit visual semantic concepts such as objects. Specifically, we study the problems of modeling photographic style (how professional photographers portray their subjects), understanding visual persuasion in image advertisements, modeling political bias in multimedia (image and text) news articles, and learning cross-modal semantic representations. While most past research in vision and natural language processing studies the case where visual content and paired text are highly aligned (as in the case of image captions), we target the case where each modality conveys complementary information to tell a larger story. We particularly focus on the problem of learning cross-modal representations from multimedia exhibiting weak alignment between the image and text modalities. A variety of techniques are presented which improve modeling of multimedia rhetoric in real-world data and enable more robust artificially intelligent systems

    Evolution of A Common Vector Space Approach to Multi-Modal Problems

    Get PDF
    A set of methods to address computer vision problems has been developed. Video un- derstanding is an activate area of research in recent years. If one can accurately identify salient objects in a video sequence, these components can be used in information retrieval and scene analysis. This research started with the development of a course-to-fine frame- work to extract salient objects in video sequences. Previous work on image and video frame background modeling involved methods that ranged from simple and efficient to accurate but computationally complex. It will be shown in this research that the novel approach to implement object extraction is efficient and effective that outperforms the existing state-of-the-art methods. However, the drawback to this method is the inability to deal with non-rigid motion. With the rapid development of artificial neural networks, deep learning approaches are explored as a solution to computer vision problems in general. Focusing on image and text, the image (or video frame) understanding can be achieved using CVS. With this concept, modality generation and other relevant applications such as automatic im- age description, text paraphrasing, can be explored. Specifically, video sequences can be modeled by Recurrent Neural Networks (RNN), the greater depth of the RNN leads to smaller error, but that makes the gradient in the network unstable during training.To overcome this problem, a Batch-Normalized Recurrent Highway Network (BNRHN) was developed and tested on the image captioning (image-to-text) task. In BNRHN, the highway layers are incorporated with batch normalization which diminish the gradient vanishing and exploding problem. In addition, a sentence to vector encoding framework that is suitable for advanced natural language processing is developed. This semantic text embedding makes use of the encoder-decoder model which is trained on sentence paraphrase pairs (text-to-text). With this scheme, the latent representation of the text is shown to encode sentences with common semantic information with similar vector rep- resentations. In addition to image-to-text and text-to-text, an image generation model is developed to generate image from text (text-to-image) or another image (image-to- image) based on the semantics of the content. The developed model, which refers to the Multi-Modal Vector Representation (MMVR), builds and encodes different modalities into a common vector space that achieve the goal of keeping semantics and conversion between text and image bidirectional. The concept of CVS is introduced in this research to deal with multi-modal conversion problems. In theory, this method works not only on text and image, but also can be generalized to other modalities, such as video and audio. The characteristics and performance are supported by both theoretical analysis and experimental results. Interestingly, the MMVR model is one of the many possible ways to build CVS. In the final stages of this research, a simple and straightforward framework to build CVS, which is considered as an alternative to the MMVR model, is presented

    Types of interference and their resolution in monolingual language production

    Get PDF
    There is accumulating evidence that speakers recruit inhibitory control to manage the conflicting demands of online language production, e.g., when selecting from among co-activated representations during object naming or when suppressing alternative competing terms in referential language use. However, little is known about the types of conflict resolution mechanisms underlying the production processes. The aim of this research was to assess the relative contribution of various forms of interference arising at different stages of information processing as well as their control to single- and multi-word utterance production. The systematic review of picture-word interference (PWI) studies (Study 1) was conducted to trace the origins of semantic context effects in order to address the question of whether spoken word production can be seen as a competitive process. The various manipulations of PWI task parameters in the reviewed studies produced a mixture of findings that were either contradictory, unable to discriminate between the rival theories of lexical access, or of questionable validity. Critically, manipulations of distractor format and of whole-part relations with varied association strength produced sufficiently strong evidence to discount post-lexical non-competitive accounts as the dominant explanations for observed interference effects, constraining their locus to early rather than late processing stages. The viability of competitive hypotheses was upheld; however, this is contingent on the relative contribution of pre-lexical processes, which remains to be confirmed by future research. The relative contribution of different conflict resolution mechanisms (measured by the anti-saccade, arrow flanker and Simon arrow tasks) to object naming under prepotent (the PWI task) and underdetermined competition (picture naming task with name agreement, NA, manipulation) was further investigated in Study 2, while Study 3 extended the notion of separability of the inhibitory processes to grammatical encoding (grammatical voice construction and number agreement computation). In Study 2, only the flanker effect was a significant predictor of the PWI but not NA effect, while the remaining inhibitory measures made no significant contribution to either the PWI or NA effect. Participants with smaller flanker effects, indicative of better resolution of representational conflict, were faster to name objects in the face of competing stimuli. In Study 3, only utterance repairs were reliably predicted by the flanker and anti-saccade effects. Those who resolved representational conflict or inhibited incorrect eye saccades more efficiently were found to self-correct less often during online passive voice construction than those with poorer resolution of inhibition at the representational and motor output level. No association was found between the various inhibitory measures and subject-verb agreement computation. The negative priming study with novel associations (Study 4) was an attempt at establishing the causal link between inhibition and object naming, and specifically whether inhibition that is ostensibly applied to irrelevant representations spreads to its associatively related nodes. Response times to the associated probe targets that served as distractors in previous prime trials were no different than response times to non-associated probe targets. Possible explanations are discussed for the lack of the associative negative priming effect. The studies described here implicate two types of interference resolution abilities as potential sources of variability in online production skills, with the underlying assumption that better resolution of conflict at the representational and motor output level translates to faster naming and more fluent speech. There is insufficient evidence to determine whether the representational conflict is lexical or conceptual in nature, or indeed whether it is inhibitory in the strict sense. It also remains to be established whether interference that likely ensues at the response output stage is due to some criterion checking process (self-monitoring), recruitment of an inhibitory mechanism (response blocking) or both

    A graph theory-based online keywords model for image semantic extraction

    Get PDF
    Image captions and keywords are the semantic descriptions of the dominant visual content features in a targeted visual scene. Traditional image keywords extraction processes involves intensive data- and knowledge-level operations by using computer vision and machine learning techniques. However, recent studies have shown that the gap between pixel-level processing and the semantic definition of an image is difficult to bridge by counting only the visual features. In this paper, augmented image semantic information has been introduced through harnessing functions of online image search engines. A graphical model named as the “Head-words Relationship Network” (HWRN) has been devised for tackling the aforementioned problems. The proposed algorithm starts from retrieving online images of similarly visual features from the input image, the text content of their hosting webpages are then extracted, classified and analysed for semantic clues. The relationships of those “head-words” from relevant webpages can then be modelled and quantified using linguistic tools. Experiments on the prototype system have proven the effectiveness of this novel approach. Performance evaluation over benchmarking state-of-the-art approaches has also shown satisfactory results and promising future applications
    • …
    corecore