31 research outputs found
Modeling Paragraph-Level Vision-Language Semantic Alignment for Multi-Modal Summarization
Most current multi-modal summarization methods follow a cascaded manner,
where an off-the-shelf object detector is first used to extract visual
features, then these features are fused with language representations to
generate the summary with an encoder-decoder model. The cascaded way cannot
capture the semantic alignments between images and paragraphs, which are
crucial to a precise summary. In this paper, we propose ViL-Sum to jointly
model paragraph-level \textbf{Vi}sion-\textbf{L}anguage Semantic Alignment and
Multi-Modal \textbf{Sum}marization. The core of ViL-Sum is a joint multi-modal
encoder with two well-designed tasks, image reordering and image selection. The
joint multi-modal encoder captures the interactions between modalities, where
the reordering task guides the model to learn paragraph-level semantic
alignment and the selection task guides the model to selected summary-related
images in the final summary. Experimental results show that our proposed
ViL-Sum significantly outperforms current state-of-the-art methods. In further
analysis, we find that two well-designed tasks and joint multi-modal encoder
can effectively guide the model to learn reasonable paragraphs-images and
summary-images relations
Low-Resource Response Generation with Template Prior
We study open domain response generation with limited message-response pairs.
The problem exists in real-world applications but is less explored by the
existing work. Since the paired data now is no longer enough to train a neural
generation model, we consider leveraging the large scale of unpaired data that
are much easier to obtain, and propose response generation with both paired and
unpaired data. The generation model is defined by an encoder-decoder
architecture with templates as prior, where the templates are estimated from
the unpaired data as a neural hidden semi-markov model. By this means, response
generation learned from the small paired data can be aided by the semantic and
syntactic knowledge in the large unpaired data. To balance the effect of the
prior and the input message to response generation, we propose learning the
whole generation model with an adversarial approach. Empirical studies on
question response generation and sentiment response generation indicate that
when only a few pairs are available, our model can significantly outperform
several state-of-the-art response generation models in terms of both automatic
and human evaluation.Comment: Accepted by EMNLP201
Retrieval-Augmented Classification with Decoupled Representation
Retrieval augmented methods have shown promising results in various
classification tasks. However, existing methods focus on retrieving extra
context to enrich the input, which is noise sensitive and non-expandable. In
this paper, following this line, we propose a -nearest-neighbor (KNN) -based
method for retrieval augmented classifications, which interpolates the
predicted label distribution with retrieved instances' label distributions.
Different from the standard KNN process, we propose a decoupling mechanism as
we find that shared representation for classification and retrieval hurts
performance and leads to training instability. We evaluate our method on a wide
range of classification datasets. Experimental results demonstrate the
effectiveness and robustness of our proposed method. We also conduct extra
experiments to analyze the contributions of different components in our
model.\footnote{\url{https://github.com/xnliang98/knn-cls-w-decoupling}}Comment: preprin