634 research outputs found
BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages
We present BPEmb, a collection of pre-trained subword unit embeddings in 275
languages, based on Byte-Pair Encoding (BPE). In an evaluation using
fine-grained entity typing as testbed, BPEmb performs competitively, and for
some languages bet- ter than alternative subword approaches, while requiring
vastly fewer resources and no tokenization. BPEmb is available at
https://github.com/bheinzerling/bpem
Paradigm Completion for Derivational Morphology
The generation of complex derived word forms has been an overlooked problem
in NLP; we fill this gap by applying neural sequence-to-sequence models to the
task. We overview the theoretical motivation for a paradigmatic treatment of
derivational morphology, and introduce the task of derivational paradigm
completion as a parallel to inflectional paradigm completion. State-of-the-art
neural models, adapted from the inflection task, are able to learn a range of
derivation patterns, and outperform a non-neural baseline by 16.4%. However,
due to semantic, historical, and lexical considerations involved in
derivational morphology, future work will be needed to achieve performance
parity with inflection-generating systems.Comment: EMNLP 201
MiLMo:Minority Multilingual Pre-trained Language Model
Pre-trained language models are trained on large-scale unsupervised data, and
they can fine-turn the model only on small-scale labeled datasets, and achieve
good results. Multilingual pre-trained language models can be trained on
multiple languages, and the model can understand multiple languages at the same
time. At present, the search on pre-trained models mainly focuses on rich
resources, while there is relatively little research on low-resource languages
such as minority languages, and the public multilingual pre-trained language
model can not work well for minority languages. Therefore, this paper
constructs a multilingual pre-trained model named MiLMo that performs better on
minority language tasks, including Mongolian, Tibetan, Uyghur, Kazakh and
Korean. To solve the problem of scarcity of datasets on minority languages and
verify the effectiveness of the MiLMo model, this paper constructs a minority
multilingual text classification dataset named MiTC, and trains a word2vec
model for each language. By comparing the word2vec model and the pre-trained
model in the text classification task, this paper provides an optimal scheme
for the downstream task research of minority languages. The final experimental
results show that the performance of the pre-trained model is better than that
of the word2vec model, and it has achieved the best results in minority
multilingual text classification. The multilingual pre-trained model MiLMo,
multilingual word2vec model and multilingual text classification dataset MiTC
are published on http://milmo.cmli-nlp.com/
Text Representation for Nonconcatenative Morphology
The last six years have seen the immense improvement of the NMT in terms of translation quality. With the help of the neural networks, the NMT has been able to achieve the state-of-the-art results in transla- tion quality. However, the NMT is still not able to achieve translation quality near human levels. In this thesis, we propose new approaches to improve the language representation as input to the NMT. This can be achieved by exploiting language specific knowledge, such as phonetic alterations, the morphology, and the syntax. We propose a new approach to improve the language representation by exploiting mor- phological phenomena in Turkish and Hebrew and show that the proposed segmentation approaches can improve translation quality. We have used several different segmentation approaches and compared them with each other. All of the segmentation approaches are rooted in the language specific morphological analysis of Turkish and Hebrew. We have also looked at the effect of the specific segmentation approach on translation quality. We have trained six different models of the type transformer with different seg- mentation approaches and compared them with each other. For each of the segmentation approaches, we have evaluated the translation quality using two automatic metrics and the human evaluation. We have also observed that the segmentation approaches can improve the translation quality in the case of the human evaluation, but not in the case of the automatic metrics. We have emphasized the importance of the human evaluation for NMT, and have shown that the automatic metrics can often be misleading
- …