5 research outputs found
Introducing Syllable Tokenization for Low-resource Languages: A Case Study with Swahili
Many attempts have been made in multilingual NLP to ensure that pre-trained
language models, such as mBERT or GPT2 get better and become applicable to
low-resource languages. To achieve multilingualism for pre-trained language
models (PLMs), we need techniques to create word embeddings that capture the
linguistic characteristics of any language. Tokenization is one such technique
because it allows for the words to be split based on characters or subwords,
creating word embeddings that best represent the structure of the language.
Creating such word embeddings is essential to applying PLMs to other languages
where the model was not trained, enabling multilingual NLP. However, most PLMs
use generic tokenization methods like BPE, wordpiece, or unigram which may not
suit specific languages. We hypothesize that tokenization based on syllables
within the input text, which we call syllable tokenization, should facilitate
the development of syllable-aware language models. The syllable-aware language
models make it possible to apply PLMs to languages that are rich in syllables,
for instance, Swahili. Previous works introduced subword tokenization. Our work
extends such efforts. Notably, we propose a syllable tokenizer and adopt an
experiment-centric approach to validate the proposed tokenizer based on the
Swahili language. We conducted text-generation experiments with GPT2 to
evaluate the effectiveness of the syllable tokenizer. Our results show that the
proposed syllable tokenizer generates syllable embeddings that effectively
represent the Swahili language
Image Classification for CSSVD Detection in Cacao Plants
The detection of diseases within plants has attracted a lot of attention from
computer vision enthusiasts. Despite the progress made to detect diseases in
many plants, there remains a research gap to train image classifiers to detect
the cacao swollen shoot virus disease or CSSVD for short, pertinent to cacao
plants. This gap has mainly been due to the unavailability of high quality
labeled training data. Moreover, institutions have been hesitant to share their
data related to CSSVD. To fill these gaps, we propose the development of image
classifiers to detect CSSVD-infected cacao plants. Our proposed solution is
based on VGG16, ResNet50 and Vision Transformer (ViT). We evaluate the
classifiers on a recently released and publicly accessible KaraAgroAI Cocoa
dataset. Our best image classifier, based on ResNet50, achieves 95.39\%
precision, 93.75\% recall, 94.34\% F1-score and 94\% accuracy on only 20
epochs. There is a +9.75\% improvement in recall when compared to previous
works. Our results indicate that the image classifiers learn to identify cacao
plants infected with CSSVD
Domain Adaptation in Intent Classification Systems: A Review
Dialogue agents, which perform specific tasks, are part of the long-term goal
of NLP researchers to build intelligent agents that communicate with humans in
natural language. Such systems should adapt easily from one domain to another
to assist users in completing tasks. Researchers have developed a broad range
of techniques, objectives, and datasets for intent classification to achieve
such systems. Despite the progress in developing intent classification systems
(ICS), a systematic review of the progress from a technical perspective is yet
to be conducted. In effect, important implementation details of intent
classification remain restricted and unclear, making it hard for natural
language processing (NLP) researchers to develop new methods. To fill this gap,
we review contemporary works in intent classification. Specifically, we conduct
a thorough technical review of the datasets, domains, tasks, and methods needed
to train the intent classification part of dialogue systems. Our structured
analysis describes why intent classification is difficult and studies the
limitations to domain adaptation while presenting opportunities for future
work
Dealing with Imbalanced Classes in Bot-IoT Dataset
With the rapidly spreading usage of Internet of Things (IoT) devices, a
network intrusion detection system (NIDS) plays an important role in detecting
and protecting various types of attacks in the IoT network. To evaluate the
robustness of the NIDS in the IoT network, the existing work proposed a
realistic botnet dataset in the IoT network (Bot-IoT dataset) and applied it to
machine learning-based anomaly detection. This dataset contains imbalanced
normal and attack packets because the number of normal packets is much smaller
than that of attack ones. The nature of imbalanced data may make it difficult
to identify the minority class correctly. In this thesis, to address the class
imbalance problem in the Bot-IoT dataset, we propose a binary classification
method with synthetic minority over-sampling techniques (SMOTE). The proposed
classifier aims to detect attack packets and overcome the class imbalance
problem using the SMOTE algorithm. Through numerical results, we demonstrate
the proposed classifier's fundamental characteristics and the impact of
imbalanced data on its performance
Constructing Multilingual Visual-Text Datasets Revealing Visual Multilingual Ability of Vision Language Models
Large language models (LLMs) have increased interest in vision language
models (VLMs), which process image-text pairs as input. Studies investigating
the visual understanding ability of VLMs have been proposed, but such studies
are still preliminary because existing datasets do not permit a comprehensive
evaluation of the fine-grained visual linguistic abilities of VLMs across
multiple languages. To further explore the strengths of VLMs, such as GPT-4V
\cite{openai2023GPT4}, we developed new datasets for the systematic and
qualitative analysis of VLMs. Our contribution is four-fold: 1) we introduced
nine vision-and-language (VL) tasks (including object recognition, image-text
matching, and more) and constructed multilingual visual-text datasets in four
languages: English, Japanese, Swahili, and Urdu through utilizing templates
containing \textit{questions} and prompting GPT4-V to generate the
\textit{answers} and the \textit{rationales}, 2) introduced a new VL task named
\textit{unrelatedness}, 3) introduced rationales to enable human understanding
of the VLM reasoning process, and 4) employed human evaluation to measure the
suitability of proposed datasets for VL tasks. We show that VLMs can be
fine-tuned on our datasets. Our work is the first to conduct such analyses in
Swahili and Urdu. Also, it introduces \textit{rationales} in VL analysis, which
played a vital role in the evaluation