30 research outputs found

    Character-level Chinese-English Translation through ASCII Encoding

    Full text link
    Character-level Neural Machine Translation (NMT) models have recently achieved impressive results on many language pairs. They mainly do well for Indo-European language pairs, where the languages share the same writing system. However, for translating between Chinese and English, the gap between the two different writing systems poses a major challenge because of a lack of systematic correspondence between the individual linguistic units. In this paper, we enable character-level NMT for Chinese, by breaking down Chinese characters into linguistic units similar to that of Indo-European languages. We use the Wubi encoding scheme, which preserves the original shape and semantic information of the characters, while also being reversible. We show promising results from training Wubi-based models on the character- and subword-level with recurrent as well as convolutional models.Comment: 7 pages, 3 figures, 3rd Conference on Machine Translation (WMT18), 201

    Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language Pre-training

    Full text link
    We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters. We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries and Jiezi refers to the process of enhancing characters' glyph representations with structure understanding. To facilitate dictionary understanding, we propose three pre-training tasks, i.e., Masked Entry Modeling, Contrastive Learning for Synonym and Antonym, and Example Learning. We evaluate our method on both modern Chinese understanding benchmark CLUE and ancient Chinese benchmark CCLUE. Moreover, we propose a new polysemy discrimination task PolyMRC based on the collected dictionary of ancient Chinese. Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks. Moreover, our approach yields significant boosting on few-shot setting of ancient Chinese understanding.Comment: To appear at ACL 2023 Finding

    Chinese–Japanese Unsupervised Neural Machine Translation Using Sub-character Level Information

    Get PDF

    Chinese Character Decomposition for Neural MT with Multi-Word Expressions

    Get PDF
    Chinese character decomposition has been used as a feature to enhance Machine Translation (MT) models, combining radicals into character and word level models. Recent work has investigated ideograph or stroke level embedding. However, questions remain about different decomposition levels of Chinese character representations, radical and strokes, best suited for MT. To investigate the impact of Chinese decomposition embedding in detail, i.e., radical, stroke, and intermediate levels, and how well these decompositions represent the meaning of the original character sequences, we carry out analysis with both automated and human evaluation of MT. Furthermore, we investigate if the combination of decomposed Multiword Expressions (MWEs) can enhance the model learning. MWE integration into MT has seen more than a decade of exploration. However, decomposed MWEs has not previously been explored.Comment: Accepted to publish in NoDaLiDa202

    An Investigation on Cognitive-Linguistic Skills of English-Chinese Bilingual Learners with and without Dyslexia in Singapore

    Get PDF
    This thesis investigates dyslexia and the cognitive-linguistics skills, namely phonological awareness, orthographic knowledge, morphological awareness and rapid naming, of bilingual learners in Singapore whose first language is English and second language is Chinese. The two main research aims are to investigate whether the English-Chinese bilingual learners with dyslexia diagnosed only in English are weaker than their typical counterparts in reading and all cognitive-linguistic skills in both languages or either language, and to investigate which cognitive-linguistic skills are strong predictors of reading in each language. Results show that the bilingual learners with dyslexia performed significantly poorer than their typical counterparts in reading and all cognitive-linguistic skills in both languages, although their dyslexia were diagnosed only in English. Results also found all English cognitive-linguistic skills predictive of English word reading, especially the unique predictive roles of morphological awareness and orthographic knowledge after rapid naming and phonological awareness were controlled. However, only rapid naming and morphological awareness were found to be predictive of Chinese word reading. The results suggest that dyslexia may manifest differently in reading and cognitive-linguistic skills of English and Chinese languages in the English-Chinese bilingual learners, based on the two different predictive models with different empirically and theoretically supported orders of cognitive-linguistic skills as predictors for reading development in the two languages. The difference in the unique contributions of the four cognitive-linguistic skills underlying the reading development of both languages may suggest the difference lies in language structure and instruction. Keywords: dyslexia, bilingualism, English reading, Chinese reading, cognitive-linguistic skill
    corecore