Data augmentation techniques are widely used in low-resource automatic
morphological inflection to address the issue of data sparsity. However, the
full implications of these techniques remain poorly understood. In this study,
we aim to shed light on the theoretical aspects of the data augmentation
strategy StemCorrupt, a method that generates synthetic examples by randomly
substituting stem characters in existing gold standard training examples. Our
analysis uncovers that StemCorrupt brings about fundamental changes in the
underlying data distribution, revealing inherent compositional concatenative
structure. To complement our theoretical analysis, we investigate the
data-efficiency of StemCorrupt. Through evaluation across a diverse set of
seven typologically distinct languages, we demonstrate that selecting a subset
of datapoints with both high diversity and high predictive uncertainty
significantly enhances the data-efficiency of StemCorrupt compared to
competitive baselines. Furthermore, we explore the impact of typological
features on the choice of augmentation strategy and find that languages
incorporating non-concatenativity, such as morphonological alternations, derive
less benefit from synthetic examples with high predictive uncertainty. We
attribute this effect to phonotactic violations induced by StemCorrupt,
emphasizing the need for further research to ensure optimal performance across
the entire spectrum of natural language morphology.Comment: 13 pages, 7 figure