We consider the problem of learning general-purpose, paraphrastic sentence
embeddings, revisiting the setting of Wieting et al. (2016b). While they found
LSTM recurrent networks to underperform word averaging, we present several
developments that together produce the opposite conclusion. These include
training on sentence pairs rather than phrase pairs, averaging states to
represent sequences, and regularizing aggressively. These improve LSTMs in both
transfer learning and supervised settings. We also introduce a new recurrent
architecture, the Gated Recurrent Averaging Network, that is inspired by
averaging and LSTMs while outperforming them both. We analyze our learned
models, finding evidence of preferences for particular parts of speech and
dependency relations.Comment: Published as a long paper at ACL 201