15 research outputs found
Detecting the Usage of Vulgar Words in Cyberbully Activities from Twitter
Nowadays, nearly all people utilize the device which is connected to Internet. People are accustomed to the use information technology devices in their daily life to interact with other people. Currently, many social media platforms such as Facebook, Twitter, Instagram, and YouTube are becoming popular. This study selected Twitter platforms, which is started to gain popularity. By the rapid growth of users signing up for Twitter accounts, at the same time, cybercrime started to bloom each year in social media platforms. Cyberbully is one of the cybercrime practices which had caused a significant impact on the targeted victims. The victims experienced social pressure, which they need to bear each day while the bullies stayed free behind the veil of anonymity. This study aims to identify the common vulgar words used by the cyberbullies on Twitter. Also, this study is subject to produce essential features of Twitter based on the collected tweets. The evaluation in this study includes the occurrences of the vulgar word perpetrated by the cyberbullies from Twitter. This study detected the usage of vulgar words in cyberbully activities on Twitter platform. A list of vulgar words were extracted and evaluated from a corpus of 50 Twitter users who posted a various number of tweets. The vulgar words detection in the tweets enable the tracking process of the cyberbully activities. In the evaluation section, we discussed how the usage of the vulgar words would define the user’s earnestness in doing the cyberbully activities in the Twitter. This study shows there are users with a low number of tweets have a high number of vulgar words occurrences, while other users with high numbers of tweets but less number of vulgar words occurrences. The information collected in this study is expected to assist marking users with a high number of vulgar words occurrences who tend to have high possibilities in doing cyber-bully activities
TOWARDS CURBING CYBER-BULLYING IN MALAYSIA BY AUTHOR IDENTIFICATION OF IBAN AND KADAZANDUSUN OSN TEXT USING DEEP LEARNING
Online Social Network (OSN) is frequently used to carry out cyber-criminal actions such as cyberbullying. As a developing country in Asia that keeps abreast of ICT advancement, Malaysia is no exception when it comes to cyberbullying. Author Identification (AI) task plays a vital role in social media forensic investigation (SMF) to unveil the genuine identity of the offender by analysing the text written in OSN by the candidate culprits. Several challenges in AI dealing with OSN text, including limited text length and informal language full of internet jargon and grammatical errors that further impact AI's performance in SMF. The traditional AI system that analyses long text documents seems inadequate to analyse short OSN text's writing style. N-gram features are proven to efficiently represent the authors' writing style for shot text. However, representing N-grams in traditional representation like Tf-IDF resulted in sparse and difficult in grasping the semantic information from text. Besides, most AI works have been done in English but receive less attention in indigenous languages. In West Malaysia, the supreme languages that transcend ethnic boundaries are Iban of Sarawak and KadazanDusun of Sabah, which both are inherently under-resourced. This paper presented a proposed workflow of AI for short OSN text using two Under-Resourced Language (U-RL), Iban and KadazanDusun tweets, to curb the cyberbullying issue in Malaysia. This paper compares Tf-Idf (sparse) and SoA embedding-based (dense) feature representations to observe which representations best represent the stylistic features of the authors’ writing. N-grams of word, character, and POS were extracted as the features. The representation models were learned by different classifiers using machine learning (Naïve Bayes, Random Forest, and SVM). The convolutional neural network (CNN), a SoA deep learning model in sentence classification, was tested against the traditional classifiers. The result was observed by combining different representation models and classifiers on three datasets (English, Iban, and KadazanDusun). The best result was achieved when CNN learned embedding-based models with a combination of all features. KadazanDusun achieved the highest accuracy with 95.76%, English with 95.02%, and Iban with 94%.
A Review on Grapheme-to-Phoneme Modelling Techniques to Transcribe Pronunciation Variants for Under-Resourced Language
A pronunciation dictionary (PD) is one of the components in an Automatic Speech Recognition (ASR) system, a system that is used to convert speech to text. The dictionary consists of word-phoneme pairs that map sound units to phonetic units for modelling and predictions. Research has shown that words can be transcribed to phoneme sequences using grapheme-to-phoneme (G2P) models, which could expedite building PDs. The G2P models can be developed by training seed PD data using statistical approaches requiring large amounts of data. Consequently, building PD for under-resourced languages is a great challenge due to poor grapheme and phoneme systems in these languages. Moreover, some PDs must include pronunciation variants, including regional accents that native speakers practice. For example, recent work on a pronunciation dictionary for an ASR in Iban, an under-resourced language from Malaysia, was built through a bootstrapping G2P method. However, the current Iban pronunciation dictionary has yet to include pronunciation variants that the Ibans practice. Researchers have done recent studies on Iban pronunciation variants, but no computational methods for generating the variants are available yet. Thus, this paper reviews G2P algorithms and processes we would use to develop pronunciation variants automatically. Specifically, we discuss data-driven techniques such as CRF, JSM, and JMM. These methods were used to build PDs for Thai, Arabic, Tunisian, and Swiss-German languages. Moreover, this paper also highlights the importance of pronunciation variants and how they can affect ASR performance
Social Versus Physical Distancing: Analysis of Public Health Messages at the Start of COVID-19 Outbreak in Malaysia Using Natural Language Processing
The study presents an attempt to analyse how social media netizens in Malaysia responded to the calls for ``Social Distancing'' and ``Physical Distancing'' as the newly recommended social norm was introduced to the world as a response to the COVID-19 global pandemic. The pandemic drove a sharp increase in social media platforms' use as a public health communication platform since the first wave of the COVID-19 outbreak in Malaysia in April 2020. We analysed thousands of tweets posted by Malaysians daily between January 2020 and August 2021 to determine public perceptions and interactions patterns. The analysis focused on positive and negative reactions and the interchanges of uses of the recommended terminologies ``social distancing'' and ``physical distancing''. Using linguistic analysis and natural language processing, findings dominantly indicate influences from the multilingual and multicultural values held by Malaysian netizens, as they embrace the concept of distancing as a measure of global public health safety
Discovering Popular Topics of Sarawak Gazette (SaGa) from Twitter Using Deep Learning
The emergence of social media as an information-sharing platform is progressively increasing. With the progress of artificial intelligence, it is now feasible to analyze historical document from social media. This study aims to understand more about how people use their social media to share the content of the Sarawak Gazette (SaGa), one of the valuable historical documents of Sarawak. In the study, a short text of Tweet corpus relating to SaGa was built (according to some keyword search criteria). The Tweet corpus will then be analyzed to extract the topic based on a topic modeling, specifically, Latent Dirichlet Allocation (LDA). Then, the topics will be further classified with Convolutional Neural Network (CNN) classifier
Selecting Requirement Elicitation Methods for Designing ICT Application in Minority Community
In recent years, Information and Communication Technologies or widely known as ICT has rapidly acquired a place in society. ICT facilitates communities in terms of providing the latest information updates on various fields such as business, education, sports and many more. Various ICT applications have been developed to cater these arising needs. In ensuring the developed ICT applications achieving its purposes, user requirements must be fulfilled. Thus, gathering requirements from communities during system development is an important phase. A suitable elicitation technique is needed as this will determine the quality and accuracy of the requirements gathered which ensures success of the developed system. The same applies when developing systems for minority communities. Hence, this paper explores the existing requirement elicitation techniques to gain insight for constructing a suitable requirement elicitation technique for minority communities. As a result, a proposed framework for eliciting requirements in the minority community will be discussed
Visualisation of User Stories to UML Model: A Systematic Literature Review
The usage of Agile methodology in software development project is growing rapidly among industry professionals and academia. Unified Modelling Language (UML) conventionally accompanied Agile software development to model the software requirements. The user story is fundamental and should be identified to communicate the basic requirements between the development team and the stakeholders before the UML model such as the use case diagram, class diagrams, and many others can be designed. However, there are several challenges associated with this process such as poorly organised user stories, natural language complexity, and high time consumption to create them. A systematic literature review (SLR) is conducted to grasp more knowledge about the utilisation of Natural Language Processing (NLP) for UML model generation. A total of 198 papers were initially found in four online databases which are Scopus, IEEE Xplore, ScienceDirect, and ACM Digital Library from the year 2018 until 2022. After removing duplicates, applying inclusion and exclusion criteria, and conducting the full-text assessment, only 20 papers are included as the primary studies. The primary studies are reviewed to discover several important pieces of information which are the challenges of designing UML models, NLP tools and techniques used to generate UML models, UML models generated, and validation methods used for measuring the accuracy of generated models. Finally, this study discusses important elements related to the UML model generation with the utilisation of NLP tools and technique
Preservation of Sarawak Ethnic Languages : The Sarawak Language Technology (SaL 1) Initiative
The population of speakers of indigenous languages all over the world is decreasing. This drop in numbers is due to the pressures of dominant global languages (such as English which is the lingua franca of international commerce, research, and the Internet), rural-urban migration, and exogamy (inter-ethnic group marriages). Similarly, the number of speakers of Sarawak's 63 languages is also declining. Thus, the Sarawak Language Technologies (SaLT) Research Group at Universiti Malaysia Sarawak has initiated a number of research and development projects with the end goal of revitalising and maintaining the ethnic languages of Sarawak. The ongoing projects include building corpora of languages (Iban, Melanau, and Kelabit), as well as, research and development of technologies whic!1 contribute to the implementation of software· for the ethnic languages. Specifically, these projects include development of morphological analysers and Part of Speech (POS) taggers which a