1,214 research outputs found

    Computing Thresholds of Linguistic Saliency

    Get PDF
    PACLIC 21 / Seoul National University, Seoul, Korea / November 1-3, 200

    What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator?

    Full text link
    In neural image captioning systems, a recurrent neural network (RNN) is typically viewed as the primary `generation' component. This view suggests that the image features should be `injected' into the RNN. This is in fact the dominant view in the literature. Alternatively, the RNN can instead be viewed as only encoding the previously generated words. This view suggests that the RNN should only be used to encode linguistic features and that only the final representation should be `merged' with the image features at a later stage. This paper compares these two architectures. We find that, in general, late merging outperforms injection, suggesting that RNNs are better viewed as encoders, rather than generators.Comment: Appears in: Proceedings of the 10th International Conference on Natural Language Generation (INLG'17

    Abstractive Multi-Document Summarization based on Semantic Link Network

    Get PDF
    The key to realize advanced document summarization is semantic representation of documents. This paper investigates the role of Semantic Link Network in representing and understanding documents for multi-document summarization. It proposes a novel abstractive multi-document summarization framework by first transforming documents into a Semantic Link Network of concepts and events and then transforming the Semantic Link Network into the summary of the documents based on the selection of important concepts and events while keeping semantics coherence. Experiments on benchmark datasets show that the proposed summarization approach significantly outperforms relevant state-of-the-art baselines and the Semantic Link Network plays an important role in representing and understanding documents

    Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition

    Full text link
    This paper presents a self-supervised method for visual detection of the active speaker in a multi-person spoken interaction scenario. Active speaker detection is a fundamental prerequisite for any artificial cognitive system attempting to acquire language in social settings. The proposed method is intended to complement the acoustic detection of the active speaker, thus improving the system robustness in noisy conditions. The method can detect an arbitrary number of possibly overlapping active speakers based exclusively on visual information about their face. Furthermore, the method does not rely on external annotations, thus complying with cognitive development. Instead, the method uses information from the auditory modality to support learning in the visual domain. This paper reports an extensive evaluation of the proposed method using a large multi-person face-to-face interaction dataset. The results show good performance in a speaker dependent setting. However, in a speaker independent setting the proposed method yields a significantly lower performance. We believe that the proposed method represents an essential component of any artificial cognitive system or robotic platform engaging in social interactions.Comment: 10 pages, IEEE Transactions on Cognitive and Developmental System

    The information gathering framework - a cognitive model of regressive eye movements during reading

    Get PDF
    In this article we present a new eye movement control framework that describes the interaction between fixation durations and regressive saccades during reading: The Information Gathering Framework (IGF). Based on the FC model proposed by Bicknell and Levy (2010), the basic idea of the IGF is that a confidence level for each word is computed while being monitored by three independent thresholds. These thresholds shape eye movement behavior by increasing fixation duration, triggering a regression, or guiding regression target selection. In this way, the IGF does not only account for regressive eye movements but also provides a framework able to model eye movement control during reading across different scenarios. Importantly, within the IGF it is assumed that two different types of regressive eye movements exist which differ with regard to their releases (integrations difficulties vs. missing evidence) but also with regard to their time course. We tested the predictions of the IGF by re-analyzing an experiment of Weiss et al. (2018) and found, inter alia, clear evidence for shorter fixation durations before regressive saccades relative to progressive saccades, with the exception of the last region. This clearly supports the assumptions of the IGF. In addition, we found evidence that there exists a window of about 15–20 characters to the left of the current fixation that plays an important role in target selection, probably indicating the perceptual span during a regressive saccade

    Quantifying aesthetics of visual design applied to automatic design

    Get PDF
    In today\u27s Instagram world, with advances in ubiquitous computing and access to social networks, digital media is adopted by art and culture. In this dissertation, we study what makes a good design by investigating mechanisms to bring aesthetics of design from realm of subjection to objection. These mechanisms are a combination of three main approaches: learning theories and principles of design by collaborating with professional designers, mathematically and statistically modeling good designs from large scale datasets, and crowdscourcing to model perceived aesthetics of designs from general public responses. We then apply the knowledge gained in automatic design creation tools to help non-designers in self-publishing, and designers in inspiration and creativity. Arguably, unlike visual arts where the main goals may be abstract, visual design is conceptualized and created to convey a message and communicate with audiences. Therefore, we develop a semantic design mining framework to automatically link the design elements, layout, color, typography, and photos to linguistic concepts. The inferred semantics are applied to a design expert system to leverage user interactions in order to create personalized designs via recommendation algorithms based on the user\u27s preferences

    Weakly-supervised Visual Grounding of Phrases with Linguistic Structures

    Full text link
    We propose a weakly-supervised approach that takes image-sentence pairs as input and learns to visually ground (i.e., localize) arbitrary linguistic phrases, in the form of spatial attention masks. Specifically, the model is trained with images and their associated image-level captions, without any explicit region-to-phrase correspondence annotations. To this end, we introduce an end-to-end model which learns visual groundings of phrases with two types of carefully designed loss functions. In addition to the standard discriminative loss, which enforces that attended image regions and phrases are consistently encoded, we propose a novel structural loss which makes use of the parse tree structures induced by the sentences. In particular, we ensure complementarity among the attention masks that correspond to sibling noun phrases, and compositionality of attention masks among the children and parent phrases, as defined by the sentence parse tree. We validate the effectiveness of our approach on the Microsoft COCO and Visual Genome datasets.Comment: CVPR 201
    • …
    corecore