77 research outputs found

    Clue: Cross-modal Coherence Modeling for Caption Generation

    Full text link
    We use coherence relations inspired by computational models of discourse to study the information needs and goals of image captioning. Using an annotation protocol specifically devised for capturing image--caption coherence relations, we annotate 10,000 instances from publicly-available image--caption pairs. We introduce a new task for learning inferences in imagery and text, coherence relation prediction, and show that these coherence annotations can be exploited to learn relation classifiers as an intermediary step, and also train coherence-aware, controllable image captioning models. The results show a dramatic improvement in the consistency and quality of the generated captions with respect to information needs specified via coherence relations.Comment: Accepted as a long paper to ACL 202

    Sequence Transduction Neural Networks With Localized Self-Attention

    Get PDF
    A system for transducing an input sequence into a target sequence is described. The system includes a sequence transduction neural network for transducing an input sequence having a respective network input at each of a plurality of input positions in an input order into an output sequence having a respective network output at each of a plurality of output positions in an output order. The sequence transduction neural network includes an encoder neural network and a decoder neural network. The encoder neural network is configured to receive the input sequence and generate a respective encoded representation of each of the network inputs in the input sequence. The encoder neural network includes a sequence of one or more encoder subnetworks, in which each encoder subnetwork is configured to receive a respective encoder subnetwork input for each of the plurality of input positions and to generate a respective encoder subnetwork output for each of the plurality of input positions. Each encoder subnetwork includes an encoder localized self-attention module that is configured to receive the subnetwork input for each of the plurality of input positions and, for each particular input position in the input order, the encoder localized self-attention module is configured to apply a localized self-attention mechanism over the encoder subnetwork inputs at input positions within a window of a fixed size of the particular input position to generate a respective output for the particular input position. The decoder neural network is configured to receive the encoded representations and generate the output sequence

    TeaForN: Teacher-Forcing with N-grams

    Full text link
    Sequence generation models trained with teacher-forcing suffer from issues related to exposure bias and lack of differentiability across timesteps. Our proposed method, Teacher-Forcing with N-grams (TeaForN), addresses both these problems directly, through the use of a stack of N decoders trained to decode along a secondary time axis that allows model parameter updates based on N prediction steps. TeaForN can be used with a wide class of decoder architectures and requires minimal modifications from a standard teacher-forcing setup. Empirically, we show that TeaForN boosts generation quality on one Machine Translation benchmark, WMT 2014 English-French, and two News Summarization benchmarks, CNN/Dailymail and Gigaword.Comment: to be published in EMNLP 202
    • …
    corecore