Search CORE

737 research outputs found

Impact of Tokenization on Language Models: An Analysis for Turkish

Author: Ozcelik Oguzhan
Toraman Cagri
Yilmaz Eyup Halit
Şahinuç Furkan
Publication venue
Publication date: 19/04/2022
Field of study

Tokenization is an important text preprocessing step to prepare input tokens for deep language models. WordPiece and BPE are de facto methods employed by important models, such as BERT and GPT. However, the impact of tokenization can be different for morphologically rich languages, such as Turkic languages, where many words can be generated by adding prefixes and suffixes. We compare five tokenizers at different granularity levels, i.e. their outputs vary from smallest pieces of characters to the surface form of words, including a Morphological-level tokenizer. We train these tokenizers and pretrain medium-sized language models using RoBERTa pretraining procedure on the Turkish split of the OSCAR corpus. We then fine-tune our models on six downstream tasks. Our experiments, supported by statistical tests, reveal that Morphological-level tokenizer has challenging performance with de facto tokenizers. Furthermore, we find that increasing the vocabulary size improves the performance of Morphological and Word-level tokenizers more than that of de facto tokenizers. The ratio of the number of vocabulary parameters to the total number of model parameters can be empirically chosen as 20% for de facto tokenizers and 40% for other tokenizers to obtain a reasonable trade-off between model size and performance.Comment: submitted to ACM TALLI

arXiv.org e-Print Archive

FST Morphology for the Endangered Skolt Sami Language

Author: Hämäläinen Mika
Rueter Jack
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2020
Field of study

Peer reviewe

arXiv.org e-Print Archive

Helsingin yliopiston digitaalinen arkisto

Survey on Publicly Available Sinhala Natural Language Processing Tools and Research

Author: de Silva Nisansa
Publication venue
Publication date: 02/04/2021
Field of study

Sinhala is the native language of the Sinhalese people who make up the largest ethnic group of Sri Lanka. The language belongs to the globe-spanning language tree, Indo-European. However, due to poverty in both linguistic and economic capital, Sinhala, in the perspective of Natural Language Processing tools and research, remains a resource-poor language which has neither the economic drive its cousin English has nor the sheer push of the law of numbers a language such as Chinese has. A number of research groups from Sri Lanka have noticed this dearth and the resultant dire need for proper tools and research for Sinhala natural language processing. However, due to various reasons, these attempts seem to lack coordination and awareness of each other. The objective of this paper is to fill that gap of a comprehensive literature survey of the publicly available Sinhala natural language tools and research so that the researchers working in this field can better utilize contributions of their peers. As such, we shall be uploading this paper to arXiv and perpetually update it periodically to reflect the advances made in the field

arXiv.org e-Print Archive

AxomiyaBERTa: A Phonologically-aware Transformer Model for Assamese

Author: Krishnaswamy Nikhil
Mannan Sheikh
Nath Abhijnan
Publication venue
Publication date: 22/05/2023
Field of study

Despite their successes in NLP, Transformer-based language models still require extensive computing resources and suffer in low-resource or low-compute settings. In this paper, we present AxomiyaBERTa, a novel BERT model for Assamese, a morphologically-rich low-resource language (LRL) of Eastern India. AxomiyaBERTa is trained only on the masked language modeling (MLM) task, without the typical additional next sentence prediction (NSP) objective, and our results show that in resource-scarce settings for very low-resource languages like Assamese, MLM alone can be successfully leveraged for a range of tasks. AxomiyaBERTa achieves SOTA on token-level tasks like Named Entity Recognition and also performs well on "longer-context" tasks like Cloze-style QA and Wiki Title Prediction, with the assistance of a novel embedding disperser and phonological signals respectively. Moreover, we show that AxomiyaBERTa can leverage phonological signals for even more challenging tasks, such as a novel cross-document coreference task on a translated version of the ECB+ corpus, where we present a new SOTA result for an LRL. Our source code and evaluation scripts may be found at https://github.com/csu-signal/axomiyaberta.Comment: 16 pages, 6 figures, 8 tables, appearing in Findings of the ACL: ACL 2023. This version compiled using pdfLaTeX-compatible Assamese script font. Assamese text may appear differently here than in official ACL 2023 proceeding

arXiv.org e-Print Archive

PLPrepare: A Grammar Checker for Challenging Cases

Author: Hoyos Jacob
Publication venue: Digital Commons @ East Tennessee State University
Publication date: 01/05/2021
Field of study

This study investigates one of the Polish language’s most arbitrary cases: the genitive masculine inanimate singular. It collects and ranks several guidelines to help language learners discern its proper usage and also introduces a framework to provide detailed feedback regarding arbitrary cases. The study tests this framework by implementing and evaluating a hybrid grammar checker called PLPrepare. PLPrepare performs similarly to other grammar checkers and is able to detect genitive case usages and provide feedback based on a number of error classifications

East Tennessee State University

TENSOR: retrieval and analysis of heterogeneous online content for terrorist activity recognition

Author: Akhgar Babak
Bertrand Piere
Chananouli Christina
Day Tony
Gibson Helen
Kavallieros Dimitrios
Kompastsiaris Ioannis
Kyriakou Eva
Leventakis George
Lissaris Euthimios
Mille Simon
Tsikrika Theodora
Vrochidis Stefanos
Williamson Una
Publication venue: Estonian Academy of Security Sciences
Publication date: 14/11/2017
Field of study

The proliferation of terrorist generated content online is a cause for concern as it goes together with the rise of radicalisation and violent extremism. Law enforcement agencies (LEAs) need powerful platforms to help stem the influence of such content. This article showcases the TENSOR project which focusses on the early detection of online terrorist activities, radicalisation and recruitment. Operating under the H2020 Secure Societies Challenge, TENSOR aims to develop a terrorism intelligence platform for increasing the ability of LEAs to identify, gather and analyse terrorism-related online content. The mechanisms to tackle this challenge by bringing together LEAs, industry, research, and legal experts are presented

Sheffield Hallam University Research Archive