Search CORE

634 research outputs found

BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages

Author: Heinzerling Benjamin
Strube Michael
Publication venue
Publication date: 01/01/2017
Field of study

We present BPEmb, a collection of pre-trained subword unit embeddings in 275 languages, based on Byte-Pair Encoding (BPE). In an evaluation using fine-grained entity typing as testbed, BPEmb performs competitively, and for some languages bet- ter than alternative subword approaches, while requiring vastly fewer resources and no tokenization. BPEmb is available at https://github.com/bheinzerling/bpem

arXiv.org e-Print Archive

TUbiblio

Paradigm Completion for Derivational Morphology

Author: Cotterell Ryan
Khayrallah Huda
Kirov Christo
Vylomova Ekaterina
Yarowsky David
Publication venue
Publication date: 01/01/2017
Field of study

The generation of complex derived word forms has been an overlooked problem in NLP; we fill this gap by applying neural sequence-to-sequence models to the task. We overview the theoretical motivation for a paradigmatic treatment of derivational morphology, and introduce the task of derivational paradigm completion as a parallel to inflectional paradigm completion. State-of-the-art neural models, adapted from the inflection task, are able to learn a range of derivation patterns, and outperform a non-neural baseline by 16.4%. However, due to semantic, historical, and lexical considerations involved in derivational morphology, future work will be needed to achieve performance parity with inflection-generating systems.Comment: EMNLP 201

arXiv.org e-Print Archive

Crossref

MiLMo:Minority Multilingual Pre-trained Language Model

Author: Bao Wugedele
Deng Junjie
Shi Hanru
Sun Yuan
Yu Xinhe
Zhao Xiaobing
Publication venue
Publication date: 10/04/2023
Field of study

Pre-trained language models are trained on large-scale unsupervised data, and they can fine-turn the model only on small-scale labeled datasets, and achieve good results. Multilingual pre-trained language models can be trained on multiple languages, and the model can understand multiple languages at the same time. At present, the search on pre-trained models mainly focuses on rich resources, while there is relatively little research on low-resource languages such as minority languages, and the public multilingual pre-trained language model can not work well for minority languages. Therefore, this paper constructs a multilingual pre-trained model named MiLMo that performs better on minority language tasks, including Mongolian, Tibetan, Uyghur, Kazakh and Korean. To solve the problem of scarcity of datasets on minority languages and verify the effectiveness of the MiLMo model, this paper constructs a minority multilingual text classification dataset named MiTC, and trains a word2vec model for each language. By comparing the word2vec model and the pre-trained model in the text classification task, this paper provides an optimal scheme for the downstream task research of minority languages. The final experimental results show that the performance of the pre-trained model is better than that of the word2vec model, and it has achieved the best results in minority multilingual text classification. The multilingual pre-trained model MiLMo, multilingual word2vec model and multilingual text classification dataset MiTC are published on http://milmo.cmli-nlp.com/

arXiv.org e-Print Archive

Text Representation for Nonconcatenative Morphology

Author: Nikolic Nevena
Publication venue
Publication date: 01/12/2022
Field of study

The last six years have seen the immense improvement of the NMT in terms of translation quality. With the help of the neural networks, the NMT has been able to achieve the state-of-the-art results in transla- tion quality. However, the NMT is still not able to achieve translation quality near human levels. In this thesis, we propose new approaches to improve the language representation as input to the NMT. This can be achieved by exploiting language specific knowledge, such as phonetic alterations, the morphology, and the syntax. We propose a new approach to improve the language representation by exploiting mor- phological phenomena in Turkish and Hebrew and show that the proposed segmentation approaches can improve translation quality. We have used several different segmentation approaches and compared them with each other. All of the segmentation approaches are rooted in the language specific morphological analysis of Turkish and Hebrew. We have also looked at the effect of the specific segmentation approach on translation quality. We have trained six different models of the type transformer with different seg- mentation approaches and compared them with each other. For each of the segmentation approaches, we have evaluated the translation quality using two automatic metrics and the human evaluation. We have also observed that the segmentation approaches can improve the translation quality in the case of the human evaluation, but not in the case of the automatic metrics. We have emphasized the importance of the human evaluation for NMT, and have shown that the automatic metrics can often be misleading

ZORA