144 research outputs found
Filtering Spam E-Mail from Mixed Arabic and English Messages: A Comparison of Machine Learning Techniques.
Spam is one of the main problems in emails communications. As the volume of non-english language spam increases, little work is done in this area. For example, in Arab world users receive spam written mostly in arabic, english or mixed Arabic and english. To filter this kind of messages, this research applied several machine learning techniques. Many researchers have used machine learning techniques to filter spam email messages. This study compared six supervised machine learning classifiers which are maximum entropy, decision trees, artificial neural nets, naĂŻve bayes, support system machines and k-nearest neighbor. The experiments suggested that words in Arabic messages should be stemmed before applying classifier. In addition, in most cases, experiments showed that classifiers using feature selection techniques can achieve comparable or better performance than filters do not used them
Recommended from our members
Ensemble methods for instance-based Arabic language authorship attribution
The Authorship Attribution (AA) is considered as a subfield of authorship analysis and it is an important problem as the range of anonymous information increased with fast growing of internet usage worldwide. In other languages such as English, Spanish and Chinese, such issue is quite well studied. However, in Arabic language, the AA problem has received less attention from the research community due to complexity and nature of Arabic sentences. The paper presented an intensive review on previous studies for Arabic language. Based on that, this study has employed the Technique for Order Preferences by Similarity to Ideal Solution (TOPSIS) method to choose the base classifier of the ensemble methods. In terms of attribution features, hundreds of stylometric features and distinct words using several tools have been extracted. Then, Adaboost and Bagging ensemble methods have been applied on Arabic enquires (Fatwa) dataset. The findings showed an improvement of the effectiveness of the authorship attribution task in the Arabic language
Classification of Encouragement (Targhib) And Warning (Tarhib) Using Sentiment Analysis on Classical Arabic
The Holy Qur’an is the main religious text of Islam. The Qur’an has its own methods of Targhib (encouragement) and Tarhib (warning), which are important features of the Qur’an. Most of the Quranic verses would urge and encourage people to do right and good deeds, and also warn them from committing evil and bad deeds. The method of classifying a text into two opposing opinions has been applied previously in solving the problem of sentiment analysis. Currently, it is applied in identifying between Targhib (encouragement) and Tarhib (warning) verses in the Qur’an. Each verse of the Qur’an can be treated as either an encouragement, warning or neutral. The language of the Holy Qur’an is one of the most challenging natural languages in sentiment analysis. The aim of this work is to classify the verses of encouragement and warning using sentiment analysis and NLP techniques. Several approaches are used in the Sentiment Analysis classification, such as the machine learning approach, the lexicon-based approach and the hybrid approach. In carrying out this aim, the applied machine learning approach was used, where the impact of the use of different techniques such as POS tagging, N-Gram and Feature selection with correlation based were evaluated and investigated. 95.6% accuracy was achieved using Naïve Bayes (NB) and 91.5% accuracy was achieved using the Support Vector Machines (SVM). This study is a significant study in extracting information and knowledge from the Holy Qur’an. It is significant for both researchers in the field of Islamic studies as well as non-specialized researchers
Arabic text classification methods: Systematic literature review of primary studies
Recent research on Big Data proposed and evaluated a number of advanced techniques to gain meaningful information from the complex and large volume of data available on the World Wide Web. To achieve accurate text analysis, a process is usually initiated with a Text Classification (TC) method. Reviewing the very recent literature in this area shows that most studies are focused on English (and other scripts) while attempts on classifying Arabic texts remain relatively very limited. Hence, we intend to contribute the first Systematic Literature Review (SLR) utilizing a search protocol strictly to summarize key characteristics of the different TC techniques and methods used to classify Arabic text, this work also aims to identify and share a scientific evidence of the gap in current literature to help suggesting areas for further research. Our SLR explicitly investigates empirical evidence as a decision factor to include studies, then conclude which classifier produced more accurate results. Further, our findings identify the lack of standardized corpuses for Arabic text; authors compile their own, and most of the work is focused on Modern Arabic with very little done on Colloquial Arabic despite its wide use in Social Media Networks such as Twitter. In total, 1464 papers were surveyed from which 48 primary studies were included and analyzed
KACST Arabic Text Classification Project: Overview and Preliminary Results
Electronically formatted Arabic free-texts can be found in abundance these days on the World Wide Web, often linked to commercial enterprises and/or government organizations. Vast tracts of knowledge and relations lie hidden within these texts, knowledge that can be exploited once the correct intelligent tools have been identified and applied. For example, text mining may help with text classification and categorization. Text classification aims to automatically assign text to a predefined category based on identifiable linguistic features. Such a process has different useful applications including, but not restricted to, E-Mail spam detection, web pages content filtering, and automatic message routing. In this paper an overview of King Abdulaziz City for Science and Technology (KACST) Arabic Text Classification Project will be illustrated along with some preliminary results. This project will contribute to the better understanding and elaboration of Arabic text classification techniques
Sentiment Analysis of Customers' Reviews Using a Hybrid Evolutionary SVM-Based Approach in an Imbalanced Data Distribution
Online media has an increasing presence on the restaurants' activities through social media
websites, coinciding with an increase in customers' reviews of these restaurants. These reviews become
the main source of information for both customers and decision-makers in this field. Any customer who
is seeking such places will check their reviews first, which usually affect their final choice. In addition,
customers' experiences can be enhanced by utilizing other customers' suggestions. Consequently, customers'
reviews can influence the success of restaurant business since it is considered the final judgment of the overall
quality of any restaurant. Thus, decision-makers need to analyze their customers' underlying sentiments in
order to meet their expectations and improve the restaurants' services, in terms of food quality, ambiance,
price range, and customer service. The number of reviews available for various products and services
has dramatically increased these days and so has the need for automated methods to collect and analyze
these reviews. Sentiment Analysis (SA) is a field of machine learning that helps analyze and predict the
sentiments underlying these reviews. Usually, SA for customers' reviews face imbalanced datasets challenge,
as the majority of these sentiments fall into supporters or resistors of the product or service. This work
proposes a hybrid approach by combining the SupportVector Machine (SVM) algorithm with Particle Swarm
Optimization (PSO) and different oversampling techniques to handle the imbalanced data problem. SVM is
applied as a machine learning classi cation technique to predict the sentiments of reviews by optimizing the
dataset, which contains different reviews of several restaurants in Jordan. Data were collected from Jeeran,
a well-known social network for Arabic reviews. A PSO technique is used to optimize the weights of the
features, as well as four different oversampling techniques, namely, the Synthetic Minority Oversampling
Technique (SMOTE), SVM-SMOTE, Adaptive Synthetic Sampling (ADASYN) and borderline-SMOTE
were examined to produce an optimized dataset and solve the imbalanced problem of the dataset. This study
shows that the proposed PSO-SVM approach produces the best results compared to different classiffication
techniques in terms of accuracy, F-measure, G-mean and Area Under the Curve (AUC), for different versions
of the datasets
Recommended from our members
Contextual Semantics for Radicalisation Detection on Twitter
Much research aims to detect online radical content mainly using radicalisation glossaries, i.e., by looking for terms and expressions associated with religion, war, offensive language, etc. However, such crude methods are highly inaccurate towards content that uses radicalisation terminology to simply report on current events, to share harmless religious rhetoric, or even to counter extremism.
Language is complex and the context in which particular terms are used should not be disregarded. In this paper, we propose an approach for building a representation of the semantic context of the terms that are linked to radicalised rhetoric. We use this approach to analyse over 114K tweets that contain radicalisation-terms (around 17K posted by pro-ISIS users, and 97k posted by “general” Twitter users).
We report on how the contextual information differs for the same radicalisation terms in the two datasets, which indicate that contextual semantics can help to better discriminate radical content from content that only uses radical terminology.The classifiers we built to test this hypothesis outperform those that disregard contextual informatio
- …