737 research outputs found
Impact of Tokenization on Language Models: An Analysis for Turkish
Tokenization is an important text preprocessing step to prepare input tokens
for deep language models. WordPiece and BPE are de facto methods employed by
important models, such as BERT and GPT. However, the impact of tokenization can
be different for morphologically rich languages, such as Turkic languages,
where many words can be generated by adding prefixes and suffixes. We compare
five tokenizers at different granularity levels, i.e. their outputs vary from
smallest pieces of characters to the surface form of words, including a
Morphological-level tokenizer. We train these tokenizers and pretrain
medium-sized language models using RoBERTa pretraining procedure on the Turkish
split of the OSCAR corpus. We then fine-tune our models on six downstream
tasks. Our experiments, supported by statistical tests, reveal that
Morphological-level tokenizer has challenging performance with de facto
tokenizers. Furthermore, we find that increasing the vocabulary size improves
the performance of Morphological and Word-level tokenizers more than that of de
facto tokenizers. The ratio of the number of vocabulary parameters to the total
number of model parameters can be empirically chosen as 20% for de facto
tokenizers and 40% for other tokenizers to obtain a reasonable trade-off
between model size and performance.Comment: submitted to ACM TALLI
Survey on Publicly Available Sinhala Natural Language Processing Tools and Research
Sinhala is the native language of the Sinhalese people who make up the
largest ethnic group of Sri Lanka. The language belongs to the globe-spanning
language tree, Indo-European. However, due to poverty in both linguistic and
economic capital, Sinhala, in the perspective of Natural Language Processing
tools and research, remains a resource-poor language which has neither the
economic drive its cousin English has nor the sheer push of the law of numbers
a language such as Chinese has. A number of research groups from Sri Lanka have
noticed this dearth and the resultant dire need for proper tools and research
for Sinhala natural language processing. However, due to various reasons, these
attempts seem to lack coordination and awareness of each other. The objective
of this paper is to fill that gap of a comprehensive literature survey of the
publicly available Sinhala natural language tools and research so that the
researchers working in this field can better utilize contributions of their
peers. As such, we shall be uploading this paper to arXiv and perpetually
update it periodically to reflect the advances made in the field
AxomiyaBERTa: A Phonologically-aware Transformer Model for Assamese
Despite their successes in NLP, Transformer-based language models still
require extensive computing resources and suffer in low-resource or low-compute
settings. In this paper, we present AxomiyaBERTa, a novel BERT model for
Assamese, a morphologically-rich low-resource language (LRL) of Eastern India.
AxomiyaBERTa is trained only on the masked language modeling (MLM) task,
without the typical additional next sentence prediction (NSP) objective, and
our results show that in resource-scarce settings for very low-resource
languages like Assamese, MLM alone can be successfully leveraged for a range of
tasks. AxomiyaBERTa achieves SOTA on token-level tasks like Named Entity
Recognition and also performs well on "longer-context" tasks like Cloze-style
QA and Wiki Title Prediction, with the assistance of a novel embedding
disperser and phonological signals respectively. Moreover, we show that
AxomiyaBERTa can leverage phonological signals for even more challenging tasks,
such as a novel cross-document coreference task on a translated version of the
ECB+ corpus, where we present a new SOTA result for an LRL. Our source code and
evaluation scripts may be found at https://github.com/csu-signal/axomiyaberta.Comment: 16 pages, 6 figures, 8 tables, appearing in Findings of the ACL: ACL
2023. This version compiled using pdfLaTeX-compatible Assamese script font.
Assamese text may appear differently here than in official ACL 2023
proceeding
PLPrepare: A Grammar Checker for Challenging Cases
This study investigates one of the Polish language’s most arbitrary cases: the genitive masculine inanimate singular. It collects and ranks several guidelines to help language learners discern its proper usage and also introduces a framework to provide detailed feedback regarding arbitrary cases. The study tests this framework by implementing and evaluating a hybrid grammar checker called PLPrepare. PLPrepare performs similarly to other grammar checkers and is able to detect genitive case usages and provide feedback based on a number of error classifications
TENSOR: retrieval and analysis of heterogeneous online content for terrorist activity recognition
The proliferation of terrorist generated content online is a cause for concern as it goes together with the rise of radicalisation and violent extremism. Law enforcement agencies (LEAs) need powerful platforms to help stem the influence of such content. This article showcases the TENSOR project which focusses on the early detection of online terrorist activities, radicalisation and recruitment. Operating under the H2020 Secure Societies Challenge, TENSOR aims to develop a terrorism intelligence platform for increasing the ability of LEAs to identify, gather and analyse terrorism-related online content. The mechanisms to tackle this challenge by bringing together LEAs, industry, research, and legal experts are presented
- …