14 research outputs found
Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations
We propose a novel data augmentation for labeled sentences called contextual
augmentation. We assume an invariance that sentences are natural even if the
words in the sentences are replaced with other words with paradigmatic
relations. We stochastically replace words with other words that are predicted
by a bi-directional language model at the word positions. Words predicted
according to a context are numerous but appropriate for the augmentation of the
original words. Furthermore, we retrofit a language model with a
label-conditional architecture, which allows the model to augment sentences
without breaking the label-compatibility. Through the experiments for six
various different text classification tasks, we demonstrate that the proposed
method improves classifiers based on the convolutional or recurrent neural
networks.Comment: NAACL 201
Data Augmentation for Lyrics Emotion Estimation
Lyrics emotion estimation can allow us to realise song retrieval systems and song recommendation systems which are based on not only text retrieval nor melody matching but also emotions in lyrics or transitions of emotions within lyrics of a whole song. This requires lyrics emotion corpora of phrase. However, it is difficult to build large scale lyrics emotion corpora because emotions are labelled manually. In this paper, we propose a method to augment lyrics emotion corpora. As a result, we augmented a corpus consisting of 366 phrases into a larger corpus consisting of 2145 phrases. We also evaluate the proposed method using 2 convolutional neural networks trained on original corpus and augmented corpus respectively. We define the target emotion classes as Joy, Love, Anger, Sorrow and Anxiety. Mean accuracy of the model trained on the augmented corpus was 75.9% whilst the model trained on the original corpus performed 70.7%
Applying the Transformer to Character-level Transduction
The transformer has been shown to outperform recurrent neural network-based
sequence-to-sequence models in various word-level NLP tasks. Yet for
character-level transduction tasks, e.g. morphological inflection generation
and historical text normalization, there are few works that outperform
recurrent models using the transformer. In an empirical study, we uncover that,
in contrast to recurrent sequence-to-sequence models, the batch size plays a
crucial role in the performance of the transformer on character-level tasks,
and we show that with a large enough batch size, the transformer does indeed
outperform recurrent models. We also introduce a simple technique to handle
feature-guided character-level transduction that further improves performance.
With these insights, we achieve state-of-the-art performance on morphological
inflection and historical text normalization. We also show that the transformer
outperforms a strong baseline on two other character-level transduction tasks:
grapheme-to-phoneme conversion and transliteration.Comment: EACL 202
Data Boost: Text Data Augmentation Through Reinforcement Learning Guided Conditional Generation
Data augmentation is proven to be effective in many NLU tasks, especially for
those suffering from data scarcity. In this paper, we present a powerful and
easy to deploy text augmentation framework, Data Boost, which augments data
through reinforcement learning guided conditional generation. We evaluate Data
Boost on three diverse text classification tasks under five different
classifier architectures. The result shows that Data Boost can boost the
performance of classifiers especially in low-resource data scenarios. For
instance, Data Boost improves F1 for the three tasks by 8.7% on average when
given only 10% of the whole data for training. We also compare Data Boost with
six prior text augmentation methods. Through human evaluations (N=178), we
confirm that Data Boost augmentation has comparable quality as the original
data with respect to readability and class consistency.Comment: In proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP 2020). Onlin
Pushing the Limits of Low-Resource Morphological Inflection
Recent years have seen exceptional strides in the task of automatic
morphological inflection generation. However, for a long tail of languages the
necessary resources are hard to come by, and state-of-the-art neural methods
that work well under higher resource settings perform poorly in the face of a
paucity of data. In response, we propose a battery of improvements that greatly
improve performance under such low-resource conditions. First, we present a
novel two-step attention architecture for the inflection decoder. In addition,
we investigate the effects of cross-lingual transfer from single and multiple
languages, as well as monolingual data hallucination. The macro-averaged
accuracy of our models outperforms the state-of-the-art by 15 percentage
points. Also, we identify the crucial factors for success with cross-lingual
transfer for morphological inflection: typological similarity and a common
representation across languages.Comment: to appear at EMNLP 201