183 research outputs found

    A non-projective greedy dependency parser with bidirectional LSTMs

    Full text link
    The LyS-FASTPARSE team presents BIST-COVINGTON, a neural implementation of the Covington (2001) algorithm for non-projective dependency parsing. The bidirectional LSTM approach by Kipperwasser and Goldberg (2016) is used to train a greedy parser with a dynamic oracle to mitigate error propagation. The model participated in the CoNLL 2017 UD Shared Task. In spite of not using any ensemble methods and using the baseline segmentation and PoS tagging, the parser obtained good results on both macro-average LAS and UAS in the big treebanks category (55 languages), ranking 7th out of 33 teams. In the all treebanks category (LAS and UAS) we ranked 16th and 12th. The gap between the all and big categories is mainly due to the poor performance on four parallel PUD treebanks, suggesting that some `suffixed' treebanks (e.g. Spanish-AnCora) perform poorly on cross-treebank settings, which does not occur with the corresponding `unsuffixed' treebank (e.g. Spanish). By changing that, we obtain the 11th best LAS among all runs (official and unofficial). The code is made available at https://github.com/CoNLL-UD-2017/LyS-FASTPARSEComment: 12 pages, 2 figures, 5 table

    MiLMo:Minority Multilingual Pre-trained Language Model

    Full text link
    Pre-trained language models are trained on large-scale unsupervised data, and they can fine-turn the model only on small-scale labeled datasets, and achieve good results. Multilingual pre-trained language models can be trained on multiple languages, and the model can understand multiple languages at the same time. At present, the search on pre-trained models mainly focuses on rich resources, while there is relatively little research on low-resource languages such as minority languages, and the public multilingual pre-trained language model can not work well for minority languages. Therefore, this paper constructs a multilingual pre-trained model named MiLMo that performs better on minority language tasks, including Mongolian, Tibetan, Uyghur, Kazakh and Korean. To solve the problem of scarcity of datasets on minority languages and verify the effectiveness of the MiLMo model, this paper constructs a minority multilingual text classification dataset named MiTC, and trains a word2vec model for each language. By comparing the word2vec model and the pre-trained model in the text classification task, this paper provides an optimal scheme for the downstream task research of minority languages. The final experimental results show that the performance of the pre-trained model is better than that of the word2vec model, and it has achieved the best results in minority multilingual text classification. The multilingual pre-trained model MiLMo, multilingual word2vec model and multilingual text classification dataset MiTC are published on http://milmo.cmli-nlp.com/
    corecore