152 research outputs found
Survey on Publicly Available Sinhala Natural Language Processing Tools and Research
Sinhala is the native language of the Sinhalese people who make up the
largest ethnic group of Sri Lanka. The language belongs to the globe-spanning
language tree, Indo-European. However, due to poverty in both linguistic and
economic capital, Sinhala, in the perspective of Natural Language Processing
tools and research, remains a resource-poor language which has neither the
economic drive its cousin English has nor the sheer push of the law of numbers
a language such as Chinese has. A number of research groups from Sri Lanka have
noticed this dearth and the resultant dire need for proper tools and research
for Sinhala natural language processing. However, due to various reasons, these
attempts seem to lack coordination and awareness of each other. The objective
of this paper is to fill that gap of a comprehensive literature survey of the
publicly available Sinhala natural language tools and research so that the
researchers working in this field can better utilize contributions of their
peers. As such, we shall be uploading this paper to arXiv and perpetually
update it periodically to reflect the advances made in the field
Sinhala-English Parallel Word Dictionary Dataset
Parallel datasets are vital for performing and evaluating any kind of
multilingual task. However, in the cases where one of the considered language
pairs is a low-resource language, the existing top-down parallel data such as
corpora are lacking in both tally and quality due to the dearth of human
annotation. Therefore, for low-resource languages, it is more feasible to move
in the bottom-up direction where finer granular pairs such as dictionary
datasets are developed first. They may then be used for mid-level tasks such as
supervised multilingual word embedding alignment. These in turn can later guide
higher-level tasks in the order of aligning sentence or paragraph text corpora
used for Machine Translation (MT). Even though more approachable than
generating and aligning a massive corpus for a low-resource language, for the
same reason of apathy from larger research entities, even these finer granular
data sets are lacking for some low-resource languages. We have observed that
there is no free and open dictionary data set for the low-resource language,
Sinhala. Thus, in this work, we introduce three parallel English-Sinhala word
dictionaries (En-Si-dict-large, En-Si-dict-filtered, En-Si-dict-FastText) which
help in multilingual Natural Language Processing (NLP) tasks related to English
and Sinhala languages. In this paper, we explain the dataset creation pipeline
as well as the experimental results of the tests we have carried out to verify
the quality of the data sets. The data sets and the related scripts are
available at https://github.com/kasunw22/sinhala-para-dict
Approaches Used to Recognise and Decipher Ancient Inscriptions: A Review
Inscriptions play a vital role in historical studies. In order to boost tourism and academic necessities, archaeological experts, epigraphers and researchers recognised and deciphered a great number of inscriptions using numerous approaches. Due to the technological revolution and inefficiencies of manual methods, humans tend to use automated systems. Hence, computational archaeology plays an important role in the current era. Even though different types of research are conducted in this domain, it still poses a big challenge and needs more accurate and efficient methods. This paper presents a review of manual and computational approaches used to recognise and decipher ancient inscriptions.Keywords: ancient inscriptions, computational archaeology, decipher, script
Leveraging Auxiliary Domain Parallel Data in Intermediate Task Fine-tuning for Low-resource Translation
NMT systems trained on Pre-trained Multilingual Sequence-Sequence (PMSS)
models flounder when sufficient amounts of parallel data is not available for
fine-tuning. This specifically holds for languages missing/under-represented in
these models. The problem gets aggravated when the data comes from different
domains. In this paper, we show that intermediate-task fine-tuning (ITFT) of
PMSS models is extremely beneficial for domain-specific NMT, especially when
target domain data is limited/unavailable and the considered languages are
missing or under-represented in the PMSS model. We quantify the domain-specific
results variations using a domain-divergence test, and show that ITFT can
mitigate the impact of domain divergence to some extent.Comment: Accepted for poster presentation at the Practical Machine Learning
for Developing Countries (PML4DC) workshop, ICLR 202
Neural Machine Translation For Low Resource Languages
Neural Machine translation is a challenging task due to the inherent complex
nature and the fluidity that natural languages bring. Nonetheless, in recent
years, it has achieved state-of-the-art performance in several language pairs.
Although, a lot of traction can be seen in the areas of multilingual neural
machine translation (MNMT) in the recent years, there are no comprehensive
survey done to identify what approaches work well. The goal of this paper is to
investigate the realm of low resource languages and build a Neural Machine
Translation model to achieve state-of-the-art results. The paper looks to build
upon the mBART language model and explore strategies to augment it with various
NLP and Deep Learning techniques like back translation and transfer learning.
This implementation tries to unpack the architecture of the NMT application and
determine the different components which offers us opportunities to amend the
said application within the purview of the low resource languages problem
space
- …