74 research outputs found
Bag-of-Words as Target for Neural Machine Translation
A sentence can be translated into more than one correct sentences. However,
most of the existing neural machine translation models only use one of the
correct translations as the targets, and the other correct sentences are
punished as the incorrect sentences in the training stage. Since most of the
correct translations for one sentence share the similar bag-of-words, it is
possible to distinguish the correct translations from the incorrect ones by the
bag-of-words. In this paper, we propose an approach that uses both the
sentences and the bag-of-words as targets in the training stage, in order to
encourage the model to generate the potentially correct sentences that are not
appeared in the training set. We evaluate our model on a Chinese-English
translation dataset, and experiments show our model outperforms the strong
baselines by the BLEU score of 4.55.Comment: accepted by ACL 201
Decoding-History-Based Adaptive Control of Attention for Neural Machine Translation
Attention-based sequence-to-sequence model has proved successful in Neural
Machine Translation (NMT). However, the attention without consideration of
decoding history, which includes the past information in the decoder and the
attention mechanism, often causes much repetition. To address this problem, we
propose the decoding-history-based Adaptive Control of Attention (ACA) for the
NMT model. ACA learns to control the attention by keeping track of the decoding
history and the current information with a memory vector, so that the model can
take the translated contents and the current information into consideration.
Experiments on Chinese-English translation and the English-Vietnamese
translation have demonstrated that our model significantly outperforms the
strong baselines. The analysis shows that our model is capable of generating
translation with less repetition and higher accuracy. The code will be
available at https://github.com/lancopk
Autoencoder as Assistant Supervisor: Improving Text Representation for Chinese Social Media Text Summarization
Most of the current abstractive text summarization models are based on the
sequence-to-sequence model (Seq2Seq). The source content of social media is
long and noisy, so it is difficult for Seq2Seq to learn an accurate semantic
representation. Compared with the source content, the annotated summary is
short and well written. Moreover, it shares the same meaning as the source
content. In this work, we supervise the learning of the representation of the
source content with that of the summary. In implementation, we regard a summary
autoencoder as an assistant supervisor of Seq2Seq. Following previous work, we
evaluate our model on a popular Chinese social media dataset. Experimental
results show that our model achieves the state-of-the-art performances on the
benchmark dataset.Comment: accepted by ACL 201
Future-Prediction-Based Model for Neural Machine Translation
We propose a novel model for Neural Machine Translation (NMT). Different from
the conventional method, our model can predict the future text length and words
at each decoding time step so that the generation can be helped with the
information from the future prediction. With such information, the model does
not stop generation without having translated enough content. Experimental
results demonstrate that our model can significantly outperform the baseline
models. Besides, our analysis reflects that our model is effective in the
prediction of the length and words of the untranslated content
An Auto-Encoder Matching Model for Learning Utterance-Level Semantic Dependency in Dialogue Generation
Generating semantically coherent responses is still a major challenge in
dialogue generation. Different from conventional text generation tasks, the
mapping between inputs and responses in conversations is more complicated,
which highly demands the understanding of utterance-level semantic dependency,
a relation between the whole meanings of inputs and outputs. To address this
problem, we propose an Auto-Encoder Matching (AEM) model to learn such
dependency. The model contains two auto-encoders and one mapping module. The
auto-encoders learn the semantic representations of inputs and responses, and
the mapping module learns to connect the utterance-level representations.
Experimental results from automatic and human evaluations demonstrate that our
model is capable of generating responses of high coherence and fluency compared
to baseline models. The code is available at https://github.com/lancopku/AMMComment: Accepted by EMNLP 201
A Deep Reinforced Sequence-to-Set Model for Multi-Label Text Classification
Multi-label text classification (MLTC) aims to assign multiple labels to each
sample in the dataset. The labels usually have internal correlations. However,
traditional methods tend to ignore the correlations between labels. In order to
capture the correlations between labels, the sequence-to-sequence (Seq2Seq)
model views the MLTC task as a sequence generation problem, which achieves
excellent performance on this task. However, the Seq2Seq model is not suitable
for the MLTC task in essence. The reason is that it requires humans to
predefine the order of the output labels, while some of the output labels in
the MLTC task are essentially an unordered set rather than an ordered sequence.
This conflicts with the strict requirement of the Seq2Seq model for the label
order. In this paper, we propose a novel sequence-to-set framework utilizing
deep reinforcement learning, which not only captures the correlations between
labels, but also reduces the dependence on the label order. Extensive
experimental results show that our proposed method outperforms the competitive
baselines by a large margin
Global Encoding for Abstractive Summarization
In neural abstractive summarization, the conventional sequence-to-sequence
(seq2seq) model often suffers from repetition and semantic irrelevance. To
tackle the problem, we propose a global encoding framework, which controls the
information flow from the encoder to the decoder based on the global
information of the source context. It consists of a convolutional gated unit to
perform global encoding to improve the representations of the source-side
information. Evaluations on the LCSTS and the English Gigaword both demonstrate
that our model outperforms the baseline models, and the analysis shows that our
model is capable of reducing repetition.Comment: Accepted by ACL 201
Understanding and Improving Layer Normalization
Layer normalization (LayerNorm) is a technique to normalize the distributions
of intermediate layers. It enables smoother gradients, faster training, and
better generalization accuracy. However, it is still unclear where the
effectiveness stems from. In this paper, our main contribution is to take a
step further in understanding LayerNorm. Many of previous studies believe that
the success of LayerNorm comes from forward normalization. Unlike them, we find
that the derivatives of the mean and variance are more important than forward
normalization by re-centering and re-scaling backward gradients. Furthermore,
we find that the parameters of LayerNorm, including the bias and gain, increase
the risk of over-fitting and do not work in most cases. Experiments show that a
simple version of LayerNorm (LayerNorm-simple) without the bias and gain
outperforms LayerNorm on four datasets. It obtains the state-of-the-art
performance on En-Vi machine translation. To address the over-fitting problem,
we propose a new normalization method, Adaptive Normalization (AdaNorm), by
replacing the bias and gain with a new transformation function. Experiments
show that AdaNorm demonstrates better results than LayerNorm on seven out of
eight datasets.Comment: Accepted by NeurIPS 201
Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models
The complementary potential of Large Language Models (LLM) assumes
off-the-shelf LLMs have heterogeneous expertise in a wide range of domains and
tasks so that an ensemble of LLMs can achieve consistently better performance.
Existing ensemble methods for LLMs mainly focus on reward model ranking of
outputs, leading to significant computation overhead. To combat this issue, we
revisit the complementary potential of LLMs and further elaborate it by mining
latent expertise with off-the-shelf reward models. We propose Zooter, a
reward-guided routing method distilling rewards on training queries to train a
routing function, which can precisely distribute each query to the LLM with
expertise about it. We also integrate a tag-based label enhancement to mitigate
noise from uncertainty when using rewards as silver supervision. Zooter shows
computation efficiency in inference as it introduces only a minor computation
overhead of a routing function compared with reward model ranking methods. We
evaluate Zooter on a comprehensive benchmark collection with 26 subsets on
different domains and tasks. Zooter outperforms the best single model on
average and ranks first on 44% of tasks, even surpassing multiple reward model
ranking methods
M6-v0: Vision-and-Language Interaction for Multi-modal Pretraining
Multi-modal pretraining for learning high-level multi-modal representation is
a further step towards deep learning and artificial intelligence. In this work,
we propose a novel model, namely InterBERT (BERT for Interaction), which is the
first model of our series of multimodal pretraining methods M6
(MultiModality-to-MultiModality Multitask Mega-transformer). The model owns
strong capability of modeling interaction between the information flows of
different modalities. The single-stream interaction module is capable of
effectively processing information of multiple modalilties, and the two-stream
module on top preserves the independence of each modality to avoid performance
downgrade in single-modal tasks. We pretrain the model with three pretraining
tasks, including masked segment modeling (MSM), masked region modeling (MRM)
and image-text matching (ITM); and finetune the model on a series of
vision-and-language downstream tasks. Experimental results demonstrate that
InterBERT outperforms a series of strong baselines, including the most recent
multi-modal pretraining methods, and the analysis shows that MSM and MRM are
effective for pretraining and our method can achieve performances comparable to
BERT in single-modal tasks. Besides, we propose a large-scale dataset for
multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which
is the first Chinese multi-modal pretrained model. We pretrain the Chinese
InterBERT on our proposed dataset of 3.1M image-text pairs from the mobile
Taobao, the largest Chinese e-commerce platform. We finetune the model for
text-based image retrieval, and recently we deployed the model online for
topic-based recommendation.Comment: 11 page
- …