60,609 research outputs found
Hierarchically Structured Reinforcement Learning for Topically Coherent Visual Story Generation
We propose a hierarchically structured reinforcement learning approach to
address the challenges of planning for generating coherent multi-sentence
stories for the visual storytelling task. Within our framework, the task of
generating a story given a sequence of images is divided across a two-level
hierarchical decoder. The high-level decoder constructs a plan by generating a
semantic concept (i.e., topic) for each image in sequence. The low-level
decoder generates a sentence for each image using a semantic compositional
network, which effectively grounds the sentence generation conditioned on the
topic. The two decoders are jointly trained end-to-end using reinforcement
learning. We evaluate our model on the visual storytelling (VIST) dataset.
Empirical results from both automatic and human evaluations demonstrate that
the proposed hierarchically structured reinforced training achieves
significantly better performance compared to a strong flat deep reinforcement
learning baseline.Comment: Accepted to AAAI 201
How to Describe Images in a More Funny Way? Towards a Modular Approach to Cross-Modal Sarcasm Generation
Sarcasm generation has been investigated in previous studies by considering
it as a text-to-text generation problem, i.e., generating a sarcastic sentence
for an input sentence. In this paper, we study a new problem of cross-modal
sarcasm generation (CMSG), i.e., generating a sarcastic description for a given
image. CMSG is challenging as models need to satisfy the characteristics of
sarcasm, as well as the correlation between different modalities. In addition,
there should be some inconsistency between the two modalities, which requires
imagination. Moreover, high-quality training data is insufficient. To address
these problems, we take a step toward generating sarcastic descriptions from
images without paired training data and propose an
Extraction-Generation-Ranking based Modular method (EGRM) for cross-model
sarcasm generation. Specifically, EGRM first extracts diverse information from
an image at different levels and uses the obtained image tags, sentimental
descriptive caption, and commonsense-based consequence to generate candidate
sarcastic texts. Then, a comprehensive ranking algorithm, which considers
image-text relation, sarcasticness, and grammaticality, is proposed to select a
final text from the candidate texts. Human evaluation at five criteria on a
total of 1200 generated image-text pairs from eight systems and auxiliary
automatic evaluation show the superiority of our method
Improving Radiology Summarization with Radiograph and Anatomy Prompts
The impression is crucial for the referring physicians to grasp key
information since it is concluded from the findings and reasoning of
radiologists. To alleviate the workload of radiologists and reduce repetitive
human labor in impression writing, many researchers have focused on automatic
impression generation. However, recent works on this task mainly summarize the
corresponding findings and pay less attention to the radiology images. In
clinical, radiographs can provide more detailed valuable observations to
enhance radiologists' impression writing, especially for complicated cases.
Besides, each sentence in findings usually focuses on single anatomy, so they
only need to be matched to corresponding anatomical regions instead of the
whole image, which is beneficial for textual and visual features alignment.
Therefore, we propose a novel anatomy-enhanced multimodal model to promote
impression generation. In detail, we first construct a set of rules to extract
anatomies and put these prompts into each sentence to highlight anatomy
characteristics. Then, two separate encoders are applied to extract features
from the radiograph and findings. Afterward, we utilize a contrastive learning
module to align these two representations at the overall level and use a
co-attention to fuse them at the sentence level with the help of
anatomy-enhanced sentence representation. Finally, the decoder takes the fused
information as the input to generate impressions. The experimental results on
two benchmark datasets confirm the effectiveness of the proposed method, which
achieves state-of-the-art results.Comment: 11 pages, ACL2023 Finding
Learning a Recurrent Visual Representation for Image Caption Generation
In this paper we explore the bi-directional mapping between images and their
sentence-based descriptions. We propose learning this mapping using a recurrent
neural network. Unlike previous approaches that map both sentences and images
to a common embedding, we enable the generation of novel sentences given an
image. Using the same model, we can also reconstruct the visual features
associated with an image given its visual description. We use a novel recurrent
visual memory that automatically learns to remember long-term visual concepts
to aid in both sentence generation and visual feature reconstruction. We
evaluate our approach on several tasks. These include sentence generation,
sentence retrieval and image retrieval. State-of-the-art results are shown for
the task of generating novel image descriptions. When compared to human
generated captions, our automatically generated captions are preferred by
humans over of the time. Results are better than or comparable to
state-of-the-art results on the image and sentence retrieval tasks for methods
using similar visual features
- …