Search CORE

17 research outputs found

The Impact of Text Preprocessing and Term Weighting on Arabic Text Classification

Author: Saad Motaz K
Publication venue: The Islamic University College Journal
Publication date: 01/01/2010
Field of study

This research presents and compares the impact of text preprocessing, which has not been addressed before, on Arabic text classification using popular text classification algorithms; Decision Tree, K Nearest Neighbors, Support Vector Machines, Naïve Bayes and its variations. Text preprocessing includes applying different term weighting schemes, and Arabic morphological analysis (stemming and light stemming). We implemented and integrated Arabic morphological analysis tools within the leading open source machine learning tools: Weka, and RapidMiner. Text Classification algorithms are applied on seven Arabic corpora (3 in-house collected and 4 existing corpora). Experimental results show: (1) Light stemming with term pruning is best feature reduction technique. (2) Support Vector Machines and Naïve Bayes variations outperform other algorithms. (3) Weighting schemes impact the performance of distance based classifier

Institutional Repository of the Islamic University of Gaza

Osac: Open source arabic corpora

Author: Ashour Wesam M.
Saad Motaz K
Publication venue
Publication date: 01/01/2010
Field of study

The acute lack of free public accessible Arabic corpora is one of the major difficulties that Arabic linguistics researches face. The effort of this paper is a step towards supporting Arabic linguistics research field. This paper presents the complex nature of Arabic language, pose the problems of:(1) lacking free public Arabic corpora,(2) the lack of high-quality, wellstructured Arabic digital contents. The paper finally presents OSAC, the largest free accessible that we collected

Institutional Repository of the Islamic University of Gaza

Arabic morphological tools for text mining

Author: Ashour Wesam M.
Saad Motaz K
Publication venue
Publication date: 01/01/2010
Field of study

Arabic Language has complex morphology; this led to unavailability to standard Arabic morphological analysis tools until now. In this paper, we present and evaluate existing common Arabic stemming/light stemming algorithms, we also implement and integrate Arabic morphological analysis tools into the leading open source machine learning and data mining tools, Weka and RapidMiner

Institutional Repository of the Islamic University of Gaza

ArbDialectID at MADAR Shared Task 1: Language Modelling and Ensemble Learning for Fine Grained Arabic Dialect Identification

Author: Abu Kwaik Kathrein
Saad Motaz K
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2019
Field of study

In this paper, we present a Dialect Identification system (ArbDialectID) that competed at Task 1 of the MADAR shared task, MADARTravel Domain Dialect Identification. We build a course and a fine-grained identification model to predict the label (corresponding to a dialect of Arabic) of a given text. We build two language models by extracting features at two levels (words and characters). We firstly build a coarse identification model to classify each sentence into one out of six dialects, then use this label as a feature for the fine-grained model that classifies the sentence among 26 dialects from different Arab cities, after that we apply ensemble voting classifier on both sub-systems. Our system ranked 1st that achieving an f-score of 67.32%. Both the models and our feature engineering tools are made available to the research community.In this paper, we present a Dialect Identification system (ArbDialectID) that competed at Task 1 of the MADAR shared task, MADARTravel Domain Dialect Identification. We build a course and a fine-grained identification model to predict the label (corresponding to a dialect of Arabic) of a given text. We build two language models by extracting features at two levels (words and characters). We firstly build a coarse identification model to classify each sentence into one out of six dialects, then use this label as a feature for the fine-grained model that classifies the sentence among 26 dialects from different Arab cities, after that we apply ensemble voting classifier on both sub-systems. Our system ranked 1st that achieving an f-score of 67.32%. Both the models and our feature engineering tools are made available to the research community

Crossref

Institutional Repository of the Islamic University of Gaza

Building and modelling multilingual subjective corpora

Author: Langlois David
Saad Motaz K
Smaıli Kamel
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2014
Field of study

Building multilingual opinionated models requires multilingual corpora annotated with opinion labels. Unfortunately, such kind of corpora are rare. We consider opinions in this work as subjective or objective. In this paper, we introduce an annotation method that can be reliably transferred across topic domains and across languages. The method starts by building a classifier that annotates sentences into subjective/objective label using a training data from "movie reviews" domain which is in English language. The annotation can be transferred to another language by classifying English sentences in parallel corpora and transferring the same annotation to the same sentences of the other language. We also shed the light on the link between opinion mining and statistical language modelling, and how such corpora are useful for domain specific language modelling. We show the distinction between subjective and objective sentences which tends to be stable across domains and languages. Our experiments show that language models trained on objective (respectively subjective) corpus lead to better perplexities on objective (respectively subjective) test

Institutional Repository of the Islamic University of Gaza

Arabic Text Classification: Text Preprocessing, Term Weighting, and Morphological Analysis

Author: Saad Motaz K
Publication venue: LAP LAMBERT Academic Publishing
Publication date: 01/01/2011
Field of study

Institutional Repository of the Islamic University of Gaza

تنقيب البيانات الموزع على شبكة

Author: Abed Ramzi
Saad Motaz K
Publication venue
Publication date: 01/01/2012
Field of study

Data mining tasks considered a very complex business problem. In this research, we study the enhancement in the speedup of executing data mining tasks on a grid environment. Experiments were performed by running two main data mining algorithms Classification and Clustering algorithms, and one of the data sampling methods for classification task which is Cross Validation. These tasks were executed on large dataset. Gird environment was prepared by installing GridGain framework on the experimental machines which were connected by a LAN. Experimental results show significant enhancement in the speedup when executing data mining tasks on a grid of computing nodes.لا يوج

Institutional Repository of the Islamic University of Gaza

Arabic text classification using decision trees

Author: Ashour Wesam M.
Saad Motaz K
Publication venue
Publication date: 01/01/2010
Field of study

Text mining draw more and more attention recently, it has been applied on different domains including web mining, opinion mining, and sentiment analysis. Text pre-processing is an important stage in text mining. The major obstacle in text mining is the very high dimensionality and the large size of text data. Natural language processing and morphological tools can be employed to reduce dimensionality and size of text data. In addition, there are many term weighting schemes available in the literature that may be used to enhance text representation as feature vector. In this paper, we study the impact of text pre-processing and different term weighting schemes on Arabic text classification. In addition, develop new combinations of term weighting schemes to be applied on Arabic text for classification purposes

CiteSeerX

Institutional Repository of the Islamic University of Gaza

A comparative study of outlier mining and class outlier mining

Author: Hewahi Nabil M
Saad Motaz K
Publication venue
Publication date: 01/01/2009
Field of study

Outliers can significantly affect data mining performance. Outlier mining is an important issue in knowledge discovery and data mining and has attracted increasing interests in recent years. Class outlier is promising research direction. Few researches have been done in this direction. The paper theme has two main goals: the first one is to show the significance of Class Outlier Mining by discussing a comparative study between a Class Outlier detection method called Class Outlier Distance Based (CODB) and a conventional Outlier detection method. The second goal is to introduc

Institutional Repository of the Islamic University of Gaza

ويكي دوك ألينار: اداة من على الرف لمحاذات مستندات ويكيبيديا

Author: Alijla Basem O
Saad Motaz K
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2017
Field of study

Wikipedia encyclopedia is an attractive source for comparable corpora in many languages. Most researchers develop their own script to perform document alignment task, which requires efforts and time. In this paper, we present WikiDocsAligner, an off-the-shelf Wikipedia Articles alignment handy tool. The implementation of WikiDocsAligner does not require the researchers to import/export of interlanguage links databases. The user just need to download Wikipedia dumps (interlanguage links and articles), then provide them to the tool, which performs the alignment. This software can be used easily to align Wikipedia documents in any language pair. Finally, we use WikiDocsAligner to align comparable documents from Arabic Wikipedia and Egyptian Wikipedia. So we shed the light on Wikipedia as a source of Arabic dialects language resources. The produced resources is interesting and useful as the demand on Arabic/dialects language resources increased in the last decade.لا يوج

Institutional Repository of the Islamic University of Gaza