85 research outputs found
Myanmar named entity corpus and its use in syllable-based neural named entity recognition
Myanmar language is a low-resource language and this is one of the main reasons why Myanmar Natural Language Processing lagged behind compared to other languages. Currently, there is no publicly available named entity corpus for Myanmar language. As part of this work, a very first manually annotated Named Entity tagged corpus for Myanmar language was developed and proposed to support the evaluation of named entity extraction. At present, our named entity corpus contains approximately 170,000 name entities and 60,000 sentences. This work also contributes the first evaluation of various deep neural network architectures on Myanmar Named Entity Recognition. Experimental results of the 10-fold cross validation revealed that syllable-based neural sequence models without additional feature engineering can give better results compared to baseline CRF model. This work also aims to discover the effectiveness of neural network approaches to textual processing for Myanmar language as well as to promote future research works on this understudied language
Tackling Hate Speech in Low-resource Languages with Context Experts
Given Myanmars historical and socio-political context, hate speech spread on
social media has escalated into offline unrest and violence. This paper
presents findings from our remote study on the automatic detection of hate
speech online in Myanmar. We argue that effectively addressing this problem
will require community-based approaches that combine the knowledge of context
experts with machine learning tools that can analyze the vast amount of data
produced. To this end, we develop a systematic process to facilitate this
collaboration covering key aspects of data collection, annotation, and model
validation strategies. We highlight challenges in this area stemming from small
and imbalanced datasets, the need to balance non-glamorous data work and
stakeholder priorities, and closed data-sharing practices. Stemming from these
findings, we discuss avenues for further work in developing and deploying hate
speech detection systems for low-resource languages.Comment: ICTD 2022 Conference pape
Impact of Tokenization on Language Models: An Analysis for Turkish
Tokenization is an important text preprocessing step to prepare input tokens
for deep language models. WordPiece and BPE are de facto methods employed by
important models, such as BERT and GPT. However, the impact of tokenization can
be different for morphologically rich languages, such as Turkic languages,
where many words can be generated by adding prefixes and suffixes. We compare
five tokenizers at different granularity levels, i.e. their outputs vary from
smallest pieces of characters to the surface form of words, including a
Morphological-level tokenizer. We train these tokenizers and pretrain
medium-sized language models using RoBERTa pretraining procedure on the Turkish
split of the OSCAR corpus. We then fine-tune our models on six downstream
tasks. Our experiments, supported by statistical tests, reveal that
Morphological-level tokenizer has challenging performance with de facto
tokenizers. Furthermore, we find that increasing the vocabulary size improves
the performance of Morphological and Word-level tokenizers more than that of de
facto tokenizers. The ratio of the number of vocabulary parameters to the total
number of model parameters can be empirically chosen as 20% for de facto
tokenizers and 40% for other tokenizers to obtain a reasonable trade-off
between model size and performance.Comment: submitted to ACM TALLI
NusaCrowd: Open Source Initiative for Indonesian NLP Resources
We present NusaCrowd, a collaborative initiative to collect and unify
existing resources for Indonesian languages, including opening access to
previously non-public resources. Through this initiative, we have brought
together 137 datasets and 118 standardized data loaders. The quality of the
datasets has been assessed manually and automatically, and their value is
demonstrated through multiple experiments. NusaCrowd's data collection enables
the creation of the first zero-shot benchmarks for natural language
understanding and generation in Indonesian and the local languages of
Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual
automatic speech recognition benchmark in Indonesian and the local languages of
Indonesia. Our work strives to advance natural language processing (NLP)
research for languages that are under-represented despite being widely spoken
Language and Culture in Northeast India and Beyond: In Honor of Robbins Burling
This volume celebrates the life and work of Robbins Burling, Emeritus Professor
of Anthropology and Linguistics at the University of Michigan, giant in the
fields of anthropological linguistics, language evolution, and language pedagogy,
and pioneer in the ethnography and linguistics of Tibeto-Burmanspeaking
groups in the Northeast Indian region. We offer it to Professor Burling
– Rob – on the occasion of his 90th birthday, on the occasion of the 60th year of
his extraordinary scholarly productivity, and on the occasion of yet another –
yet another! – field trip to Northeast India, where his career in anthropology and
linguistics effectively began so many decades ago, and where he has amassed so
many devoted friends and colleagues – including ourselves. (First paragraph of Editor's Introduction)
- …