Journal of Computer Engineering and Applications Beijing Co., Ltd., Science Press
Doi
Abstract
In recent years, neural network models such as Transformer have achieved significant success in machine translation. However, training these models relies on rich labeled data, posing a challenge for low-resource machine translation due to the limited scale of parallel corpora. This limitation often leads to subpar performance and a susceptibility to overfitting on high-frequency vocabulary, thereby reducing the model’s generalization ability on the test set. To alleviate these issues, this paper proposes a strategy of gradient weight modification. Specifically, it suggests multiplying the gradients generated for each new batch by a coefficient on top of the Adam algorithm. This coefficient incrementally increases, aiming to weaken the model’s dependence on high-frequency features during early training while maintaining the rapid convergence advantage of the algorithm in the later stages. This paper also outlines the modified training process, including adjustments and decay of coefficients, to emphasize different aspects at different training stages. The goal of this strategy is to enhance attention to low-frequency vocabulary and prevent the model from overfitting to high-frequency terms. Experimental translation tasks are conducted on three low-resource bilingual datasets, and the proposed method demonstrates improvements of 0.72, 1.37, and 1.04 BLEU scores relative to the baseline model on the respective test set