1 research outputs found

    A Dirichlet Process Mixture Based Name Origin Clustering and Alignment Model for Transliteration

    No full text
    In machine transliteration, it is common that the transliterated names in the target language come from multiple language origins. A conventional maximum likelihood based single model can not deal with this issue very well and often suffers from overfitting. In this paper, we exploit a coupled Dirichlet process mixture model (cDPMM) to address overfitting and names multiorigin cluster issues simultaneously in the transliteration sequence alignment step over the name pairs. After the alignment step, the cDPMM clusters name pairs into many groups according to their origin information automatically. In the decoding step, in order to use the learned origin information sufficiently, we use a cluster combination method (CCM) to build clustering-specific transliteration models by combining small clusters into large ones based on the perplexities of name language and transliteration model, which makes sure each origin cluster has enough data for training a transliteration model. On the three different Western-Chinese multiorigin names corpora, the cDPMM outperforms two state-of-the-art baseline models in terms of both the top-1 accuracy and mean F-score, and furthermore the CCM significantly improves the cDPMM
    corecore