37 research outputs found
Learning When to Concentrate or Divert Attention: Self-Adaptive Attention Temperature for Neural Machine Translation
Most of the Neural Machine Translation (NMT) models are based on the
sequence-to-sequence (Seq2Seq) model with an encoder-decoder framework equipped
with the attention mechanism. However, the conventional attention mechanism
treats the decoding at each time step equally with the same matrix, which is
problematic since the softness of the attention for different types of words
(e.g. content words and function words) should differ. Therefore, we propose a
new model with a mechanism called Self-Adaptive Control of Temperature (SACT)
to control the softness of attention by means of an attention temperature.
Experimental results on the Chinese-English translation and English-Vietnamese
translation demonstrate that our model outperforms the baseline models, and the
analysis and the case study show that our model can attend to the most relevant
elements in the source-side contexts and generate the translation of high
quality.Comment: To appear in EMNLP 201
Query and Output: Generating Words by Querying Distributed Word Representations for Paraphrase Generation
Most recent approaches use the sequence-to-sequence model for paraphrase
generation. The existing sequence-to-sequence model tends to memorize the words
and the patterns in the training dataset instead of learning the meaning of the
words. Therefore, the generated sentences are often grammatically correct but
semantically improper. In this work, we introduce a novel model based on the
encoder-decoder framework, called Word Embedding Attention Network (WEAN). Our
proposed model generates the words by querying distributed word representations
(i.e. neural word embeddings), hoping to capturing the meaning of the according
words. Following previous work, we evaluate our model on two
paraphrase-oriented tasks, namely text simplification and short text
abstractive summarization. Experimental results show that our model outperforms
the sequence-to-sequence baseline by the BLEU score of 6.3 and 5.5 on two
English text simplification datasets, and the ROUGE-2 F1 score of 5.7 on a
Chinese summarization dataset. Moreover, our model achieves state-of-the-art
performances on these three benchmark datasets.Comment: arXiv admin note: text overlap with arXiv:1710.0231
Layer-Wise Cross-View Decoding for Sequence-to-Sequence Learning
In sequence-to-sequence learning, the decoder relies on the attention
mechanism to efficiently extract information from the encoder. While it is
common practice to draw information from only the last encoder layer, recent
work has proposed to use representations from different encoder layers for
diversified levels of information. Nonetheless, the decoder still obtains only
a single view of the source sequences, which might lead to insufficient
training of the encoder layer stack due to the hierarchy bypassing problem. In
this work, we propose layer-wise cross-view decoding, where for each decoder
layer, together with the representations from the last encoder layer, which
serve as a global view, those from other encoder layers are supplemented for a
stereoscopic view of the source sequences. Systematic experiments show that we
successfully address the hierarchy bypassing problem and substantially improve
the performance of sequence-to-sequence learning with deep representations on
diverse tasks.Comment: 9 pages, 6 figure