Enhancement of feature sets for subjectivity analysis on Malay-English code-switching text

Abstract

A code-switching sentence is a sentence that is constructed using two or more languages. It is a norm for a multi-lingual speaker to use code-switching sentences to share objective and subjective textual information on public platforms such as blogs and social media. Classifying a voluminous code-switching text into subjective and objective classes has posed a new challenge to the current solution of subjectivity analysis. The current solution has limited its design to process only monolingual text. Therefore, the presence of subjective code-switching text is ignored by the current solution. The ignorant limits the capability of the current solution to generate an accurate result of subjectivity analysis on code-switching text. Therefore, this research aims to find a set of solutions for subjectivity analysis on code-switching text. The research process begins by filling in the absence of the subjectivity code-switching corpus. A subjective Malay-English code-switching corpus was built. The corpus contains 35,067 Malay-English code-switching sentences that were harvested from Malay-English blog posts. Each sentence was annotated with either subjective or objective labels. The research process continues with designing the feature sets that represent the subjectivity of the Malay-English code-switching sentences from the corpus. The feature sets were enhanced from the subjective monolingual feature set, that was initially designed to represent subjectivity of English text. The initial subjective monolingual feature sets consist of pronoun, adjective, cardinal number, modal and adverb. The enhanced feature sets consist three feature sets which are embedded code-switching feature set, unified code-switching feature set and stylistic feature set. The embedded code-switching feature used the initial monolingual feature set for English and embeds the feature of Malay language in it. In the unified code-switching feature set, the extracted Malay and English features were unified using an adapted algorithm known as the Malay-English Unified POS. The algorithm predicts the type of each word in a code-switching sentence according to the language of the word. In the stylistic feature set, emoticons, interjections, signs of subjectivity such as exclamation marks and word with exaggerations of spelling were extracted to represent the subjectivity in the code-switching sentences. The effectiveness of the enhanced feature sets was evaluated using the Malay-English code-switching subjectivity corpus as the data set and two machine learning classifiers, which are Naïve-Bayes and Support Vector Machine. The 10-fold cross-validation classification technique was used on different settings of experiments and combinations of feature sets to obtain the performance of the enhanced feature sets. The performance from the combination of unified code-switching and stylistic feature sets has outperformed other feature sets. The combination has consistently performed at the accuracy of 59% using both machine learning classifiers. The consistent performance indicates the combined feature sets are the viable solution for subjectivity analysis on the Malay-English code-switching text

Similar works

Full text

thumbnail-image

Universiti Teknikal Malaysia Melaka (UTeM) Repository

redirect
Last time updated on 18/10/2023

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.