Search CORE

277 research outputs found

Bayesian phylolinguistics infers the internal structure and the time-depth of the Turkic language family

Author: Robbeets M.
Savelyev A.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 14/02/2020
Field of study

Despite more than 200 years of research, the internal structure of the Turkic language family remains subject to debate. Classifications of Turkic so far are based on both classical historical–comparative linguistic and distance-based quantitative approaches. Although these studies yield an internal structure of the Turkic family, they cannot give us an understanding of the statistical robustness of the proposed branches, nor are they capable of reliably inferring absolute divergence dates, without assuming constant rates of change. Here we use computational Bayesian phylogenetic methods to build a phylogeny of the Turkic languages, express the reliability of the proposed branches in terms of probability, and estimate the time-depth of the family within credibility intervals. To this end, we collect a new dataset of 254 basic vocabulary items for thirty-two Turkic language varieties based on the recently introduced Leipzig–Jakarta list. Our application of Bayesian phylogenetic inference on lexical data of the Turkic languages is unprecedented. The resulting phylogenetic tree supports a binary structure for Turkic and replicates most of the conventional sub-branches in the Common Turkic branch. We calculate the robustness of the inferences for subgroups and individual languages whose position in the tree seems to be debatable. We infer the time-depth of the Turkic family at around 2100 years before present, thus providing a reliable quantitative basis for previous estimates based on classical historical linguistics and lexicostatistics

MiLMo:Minority Multilingual Pre-trained Language Model

Author: Bao Wugedele
Deng Junjie
Shi Hanru
Sun Yuan
Yu Xinhe
Zhao Xiaobing
Publication venue
Publication date: 10/04/2023
Field of study

Pre-trained language models are trained on large-scale unsupervised data, and they can fine-turn the model only on small-scale labeled datasets, and achieve good results. Multilingual pre-trained language models can be trained on multiple languages, and the model can understand multiple languages at the same time. At present, the search on pre-trained models mainly focuses on rich resources, while there is relatively little research on low-resource languages such as minority languages, and the public multilingual pre-trained language model can not work well for minority languages. Therefore, this paper constructs a multilingual pre-trained model named MiLMo that performs better on minority language tasks, including Mongolian, Tibetan, Uyghur, Kazakh and Korean. To solve the problem of scarcity of datasets on minority languages and verify the effectiveness of the MiLMo model, this paper constructs a minority multilingual text classification dataset named MiTC, and trains a word2vec model for each language. By comparing the word2vec model and the pre-trained model in the text classification task, this paper provides an optimal scheme for the downstream task research of minority languages. The final experimental results show that the performance of the pre-trained model is better than that of the word2vec model, and it has achieved the best results in minority multilingual text classification. The multilingual pre-trained model MiLMo, multilingual word2vec model and multilingual text classification dataset MiTC are published on http://milmo.cmli-nlp.com/

arXiv.org e-Print Archive

Cross-Lingual and Low-Resource Sentiment Analysis

Author: Farra Noura
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2019
Field of study

Identifying sentiment in a low-resource language is essential for understanding opinions internationally and for responding to the urgent needs of locals affected by disaster incidents in different world regions. While tools and resources for recognizing sentiment in high-resource languages are plentiful, determining the most effective methods for achieving this task in a low-resource language which lacks annotated data is still an open research question. Most existing approaches for cross-lingual sentiment analysis to date have relied on high-resource machine translation systems, large amounts of parallel data, or resources only available for Indo-European languages. This work presents methods, resources, and strategies for identifying sentiment cross-lingually in a low-resource language. We introduce a cross-lingual sentiment model which can be trained on a high-resource language and applied directly to a low-resource language. The model offers the feature of lexicalizing the training data using a bilingual dictionary, but can perform well without any translation into the target language. Through an extensive experimental analysis, evaluated on 17 target languages, we show that the model performs well with bilingual word vectors pre-trained on an appropriate translation corpus. We compare in-genre and in-domain parallel corpora, out-of-domain parallel corpora, in-domain comparable corpora, and monolingual corpora, and show that a relatively small, in-domain parallel corpus works best as a transfer medium if it is available. We describe the conditions under which other resources and embedding generation methods are successful, and these include our strategies for leveraging in-domain comparable corpora for cross-lingual sentiment analysis. To enhance the ability of the cross-lingual model to identify sentiment in the target language, we present new feature representations for sentiment analysis that are incorporated in the cross-lingual model: bilingual sentiment embeddings that are used to create bilingual sentiment scores, and a method for updating the sentiment embeddings during training by lexicalization of the target language. This feature configuration works best for the largest number of target languages in both untargeted and targeted cross-lingual sentiment experiments. The cross-lingual model is studied further by evaluating the role of the source language, which has traditionally been assumed to be English. We build cross-lingual models using 15 source languages, including two non-European and non-Indo-European source languages: Arabic and Chinese. We show that language families play an important role in the performance of the model, as does the morphological complexity of the source language. In the last part of the work, we focus on sentiment analysis towards targets. We study Arabic as a representative morphologically complex language and develop models and morphological representation features for identifying entity targets and sentiment expressed towards them in Arabic open-domain text. Finally, we adapt our cross-lingual sentiment models for the detection of sentiment towards targets. Through cross-lingual experiments on Arabic and English, we demonstrate that our findings regarding resources, features, and language also hold true for the transfer of targeted sentiment

CoNLL 2017 Shared Task : Multilingual Parsing from Raw Text to Universal Dependencies

Author: Attia Mohammed
Badmaeva Elena
Banerjee Esha
Burchardt Aljoscha
Cinková Silvie
de Marneffe Marie-Catherine
dePaiva Valeria
Droganova Kira
Elkahky Ali
Fernández Alcalde Héctor
Ginter Filip
Gökırmak Memduh
Habash Nizar
Hajič Jan
Hajič jr., Jan
Harris Kim
Hlaváčová Jaroslava
Kanayama Hiroshi
Kanerva Jenna
Kayadelen Tolga
Kettnerová Václava
Kirchner Jesse
Kwak Sookyoung
Lando Tatiana
Lertpradit Saran
Leung Herman
Li Josie
Luotolahti Juhani
Macketanz Vivien
Mandl Michael
Manning Christopher D.
Manurung Ruli
Marheinecke Katrin
Martínez Alonso Héctor
Mendonça Gustavo
Missilä Anna
Nedoluzhko Anna
Nitisaroj Rattima
Nivre Joakim
Ojala Stina
Petrov Slav
Pitler Emily
Popel Martin
Potthast Martin
Pyysalo Sampo
Reddy Siva
Rehm Georg
Sanguinetti Manuela
Schuster Sebastian
Shimada Atsuko
Simi Maria
Stella Antonio
Straka Milan
Strnadová Jana
Sulubacak Umut
Taji Dima
Tyers Francis
Urešová Zdeňka
Uszkoreit Hans
Yu Zhuoran
Zeman Daniel
Çöltekin Çağrı
Publication venue: The Association for Computational Linguistics
Publication date: 01/01/2017
Field of study

The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets. In 2017, one of two tasks was devoted to learning dependency parsers for a large number of languages, in a real world setting without any gold-standard annotation on input. All test sets followed a unified annotation scheme, namely that of Universal Dependencies. In this paper, we define the task and evaluation methodology, describe data preparation, report and analyze the main results, and provide a brief categorization of the different approaches of the participating systems.Peer reviewe

Biblio at Institute of Formal and Applied Linguistics

Helsingin yliopiston digitaalinen arkisto

Proceedings of the 1st Conference on Central Asian Languages and Linguistics (ConCALL)

Author: Kent Amber Kennedy
Özçelik Öner
Publication venue
Publication date: 29/06/2015
Field of study

The Conference on Central Asian Languages and Linguistics (ConCALL) was founded in 2014 at Indiana University by Dr. Öner Özçelik, the residing director of the Center for Languages of the Central Asian Region (CeLCAR). As the nation’s sole U.S. Department of Education funded Language Resource Center focusing on the languages of the Central Asian Region, CeLCAR’s main mission is to strengthen and improve the nation’s capacity for teaching and learning Central Asian languages through teacher training, research, materials development projects, and dissemination. As part of this mission, CeLCAR has an ultimate goal to unify and fortify the Central Asian language learning community by facilitating networking between linguists and language educators, encouraging research projects that will inform language instruction, and provide opportunities for professionals in the field to both showcase their work and receive feedback from their peers. Thus ConCALL was established to be the first international academic conference to bring together linguists and language educators in the languages of the Central Asian region, including both the Altaic and Eastern Indo-European languages spoken in the region, to focus on research into how these specific languages are represented formally, as well as acquired by second/foreign language learners, and also to present research driven teaching methods. Languages served by ConCALL include, but are not limited to: Azerbaijani, Dari, Karakalpak, Kazakh, Kyrgyz, Lokaabharan, Mari, Mongolian, Pamiri, Pashto, Persian, Russian, Shughnani, Tajiki, Tibetan, Tofalar, Tungusic, Turkish, Tuvan, Uyghur, Uzbek, Wakhi and more!The Conference on Central Asian Languages and Linguistics held at Indiana University on 16-17 May 1014 was made possible through the generosity of our sponsors: Center for Languages of the Central Asian Region (CeLCAR), Ostrom Grant Programs, IU's College of Arts and Humanities Center (CAHI), Inner Asian and Uralic National Resource Center (IAUNRC), IU's School of Global and International Studies (SGIS), IU's College of Arts and Sciences, Sinor Research Institute for Inner Asian Studies (SRIFIAS), IU's Department of Central Eurasian Studies (CEUS), and IU's Department of Linguistics

IUScholarWorks (University of Indiana)

Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

Author: Attia M
Badmaeva E
Banerjee E
Burchardt A
Cinková S
Droganova K
Elkahky A
Fernandez Alcalde H
Ginter F
Gökırmak M
Habash N
Hajič J
Hajič jr. J
Harris K
Hlaváčová J
Kanayama H
Kanerva J
Kayadelen T
Kettnerová V
Kirchner J
Kwak S
Lando T
Lertpradit S
Leung H
Li J
Luotolahti J
Macketanz V
Mandl M
Manning C
Manurung R
Marheinecke K
Marneffe M
Martínez Alonso H
Mendonçca G
Missilä A
Nedoluzhko A
Nitisaroj R
Nivre J
Ojala S
Paiva V
Petrov S
Pitler E
Popel M
Potthast M
Pyysalo S
Reddy S
Rehm G
Sanguinetti M
Schuster S
Shimada A
Simi M
Stella A
Straka M
Strnadova J
Taji D
Tyers F
Urešová Z
Uszkoreit H
Yu Z
Zeman D
Publication venue: Vancouver, Canada
Publication date: 28/10/2022
Field of study

Recognition and Classification of Ancient Dwellings based on Elastic Grid and GLCM

Author: Chen Lai-xin
Fan Yang
Publication venue
Publication date: 10/05/2016
Field of study

Rectangle algorithm is designed to extract ancient dwellings from village satellite images according to their pixel features and shape features. For these unrecognized objects, we need to distinguish them by further extracting texture features of them. In order to get standardized sample, three pre-process operations including rotating operation, scaling operation, and clipping operation are designed to unify their sizes and directions

ZENODO