12 research outputs found
Review of coreference resolution in English and Persian
Coreference resolution (CR) is one of the most challenging areas of natural
language processing. This task seeks to identify all textual references to the
same real-world entity. Research in this field is divided into coreference
resolution and anaphora resolution. Due to its application in textual
comprehension and its utility in other tasks such as information extraction
systems, document summarization, and machine translation, this field has
attracted considerable interest. Consequently, it has a significant effect on
the quality of these systems. This article reviews the existing corpora and
evaluation metrics in this field. Then, an overview of the coreference
algorithms, from rule-based methods to the latest deep learning techniques, is
provided. Finally, coreference resolution and pronoun resolution systems in
Persian are investigated.Comment: 44 pages, 11 figures, 5 table
Twitter Analysis to Predict the Satisfaction of Saudi Telecommunication Companiesâ Customers
The flexibility in mobile communications allows customers to quickly switch from one service provider to
another, making customer churn one of the most critical challenges for the data and voice telecommunication
service industry. In 2019, the percentage of post-paid telecommunication customers in Saudi Arabia
decreased; this represents a great deal of customer dissatisfaction and subsequent corporate fiscal losses.
Many studies correlate customer satisfaction with customer churn. The Telecom companies have depended
on historical customer data to measure customer churn. However, historical data does not reveal current
customer satisfaction or future likeliness to switch between telecom companies. Current methods of analysing
churn rates are inadequate and faced some issues, particularly in the Saudi market.
This research was conducted to realize the relationship between customer satisfaction and customer churn
and how to use social media mining to measure customer satisfaction and predict customer churn.
This research conducted a systematic review to address the churn prediction models problems and their
relation to Arabic Sentiment Analysis. The findings show that the current churn models lack integrating
structural data frameworks with real-time analytics to target customers in real-time. In addition, the findings
show that the specific issues in the existing churn prediction models in Saudi Arabia relate to the Arabic
language itself, its complexity, and lack of resources.
As a result, I have constructed the first gold standard corpus of Saudi tweets related to telecom companies,
comprising 20,000 manually annotated tweets. It has been generated as a dialect sentiment lexicon extracted
from a larger Twitter dataset collected by me to capture text characteristics in social media. I developed a
new ASA prediction model for telecommunication that fills the detected gaps in the ASA literature and fits
the telecommunication field. The proposed model proved its effectiveness for Arabic sentiment analysis and
churn prediction. This is the first work using Twitter mining to predict potential customer loss (churn) in
Saudi telecom companies, which has not been attempted before. Different fields, such as education, have
different features, making applying the proposed model is interesting because it based on text-mining
Recommended from our members
Arabic Language Processing for Text Classification. Contributions to Arabic Root Extraction Techniques, Building An Arabic Corpus, and to Arabic Text Classification Techniques.
The impact and dynamics of Internet-based resources for Arabic-speaking users is increasing in significance, depth and breadth at highest pace than ever, and thus requires updated mechanisms for computational processing of Arabic texts. Arabic is a complex language and as such requires in depth investigation for analysis and improvement of available automatic processing techniques such as root extraction methods or text classification techniques, and for developing text collections that are already labeled, whether with single or multiple labels.
This thesis proposes new ideas and methods to improve available automatic processing techniques for Arabic texts. Any automatic processing technique would require data in order to be used and critically reviewed and assessed, and here an attempt to develop a labeled Arabic corpus is also proposed. This thesis is composed of three parts: 1- Arabic corpus development, 2- proposing, improving and implementing root extraction techniques, and 3- proposing and investigating the effect of different pre-processing methods on single-labeled text classification methods for Arabic.
This thesis first develops an Arabic corpus that is prepared to be used here for testing root extraction methods as well as single-label text classification techniques. It also enhances a rule-based root extraction method by handling irregular cases (that appear in about 34% of texts). It proposes and implements two expanded algorithms as well as an adjustment for a weight-based method. It also includes the algorithm that handles irregular cases to all and compares the performances of these proposed methods with original ones. This thesis thus develops a root extraction system that handles foreign Arabized words by constructing a list of about 7,000 foreign words. The outcome of the technique with best accuracy results in extracting the correct stem and root for respective words in texts, which is an enhanced rule-based method, is used in the third part of this thesis. This thesis finally proposes and implements a variant term frequency inverse document frequency weighting method, and investigates the effect of using different choices of features in document representation on single-label text classification performance (words, stems or roots as well as including to these choices their respective phrases). This thesis applies forty seven classifiers on all proposed representations and compares their performances. One challenge for researchers in Arabic text processing is that reported root extraction techniques in literature are either not accessible or require a long time to be reproduced while labeled benchmark Arabic text corpus is not fully available online. Also, by now few machine learning techniques were investigated on Arabic where usual preprocessing steps before classification were chosen. Such challenges are addressed in this thesis by developing a new labeled Arabic text corpus for extended applications of computational techniques.
Results of investigated issues here show that proposing and implementing an algorithm that handles irregular words in Arabic did improve the performance of all implemented root extraction techniques. The performance of the algorithm that handles such irregular cases is evaluated in terms of accuracy improvement and execution time. Its efficiency is investigated with different document lengths and empirically is found to be linear in time for document lengths less than about 8,000. The rule-based technique is improved the highest among implemented root extraction methods when including the irregular cases handling algorithm. This thesis validates that choosing roots or stems instead of words in documents representations indeed improves single-label classification performance significantly for most used classifiers. However, the effect of extending such representations with their respective phrases on single-label text classification performance shows that it has no significant improvement. Many classifiers were not yet tested for Arabic such as the ripple-down rule classifier. The outcome of comparing the classifiers' performances concludes that the Bayesian network classifier performance is significantly the best in terms of accuracy, training time, and root mean square error values for all proposed and implemented representations.Petra University, Amman (Jordan
Enhancing extremist data classification through textual analysis
The high volume of extremist materials on the Internet has created the need for intelligence gathering via the Web and real-time monitoring of potential websites for evidence of extremist activities. However, the manual classification for such contents is practically difficult and time-consuming. In response to this challenge, the work reported here developed several classification frameworks. Each framework provides a basis of text representation before being fed into machine learning algorithm. The basis of text representation are Sentiment-rule, Posit-textual analysis with word-level features, and an extension of Posit analysis, known as Extended-Posit, which adopts character-level as well as word-level data. Identifying some gaps in the aforementioned techniques created avenues for further improvements, most especially in handling larger datasets with better classification accuracy.
Consequently, a novel basis of text representation known as the Composite-based method was developed. This is a computational framework that explores the combination of both sentiment and syntactic features of textual contents of a Web page. Subsequently, these techniques are applied on a dataset that had been subjected to a manual classification process, thereafter fed into machine learning algorithm. This is to generate a measure of how well each page can be classified into their appropriate classes. The classifiers considered are both Neural Network (RNN and MLP) and Machine Learning classifiers (such as J48, Random Forest and KNN). In addition, features selection and model optimisation were evaluated to know the cost when creating machine learning model.
However, considering all the result obtained from each of the framework, the results indicated that composite features are preferable to solely syntactic or sentiment features which offer improved classification accuracy when used with machine learning algorithms. Furthermore, the extension of Posit analysis to include both word and character-level data out-performed word-level feature alone when applied on the assembled textual data. Moreover, Random Forest classifier outperformed other classifiers explored. Taking cost into account, feature selection improves classification accuracy and save time better than hyperparameter turning (model optimisation).The high volume of extremist materials on the Internet has created the need for intelligence gathering via the Web and real-time monitoring of potential websites for evidence of extremist activities. However, the manual classification for such contents is practically difficult and time-consuming. In response to this challenge, the work reported here developed several classification frameworks. Each framework provides a basis of text representation before being fed into machine learning algorithm. The basis of text representation are Sentiment-rule, Posit-textual analysis with word-level features, and an extension of Posit analysis, known as Extended-Posit, which adopts character-level as well as word-level data. Identifying some gaps in the aforementioned techniques created avenues for further improvements, most especially in handling larger datasets with better classification accuracy.
Consequently, a novel basis of text representation known as the Composite-based method was developed. This is a computational framework that explores the combination of both sentiment and syntactic features of textual contents of a Web page. Subsequently, these techniques are applied on a dataset that had been subjected to a manual classification process, thereafter fed into machine learning algorithm. This is to generate a measure of how well each page can be classified into their appropriate classes. The classifiers considered are both Neural Network (RNN and MLP) and Machine Learning classifiers (such as J48, Random Forest and KNN). In addition, features selection and model optimisation were evaluated to know the cost when creating machine learning model.
However, considering all the result obtained from each of the framework, the results indicated that composite features are preferable to solely syntactic or sentiment features which offer improved classification accuracy when used with machine learning algorithms. Furthermore, the extension of Posit analysis to include both word and character-level data out-performed word-level feature alone when applied on the assembled textual data. Moreover, Random Forest classifier outperformed other classifiers explored. Taking cost into account, feature selection improves classification accuracy and save time better than hyperparameter turning (model optimisation)
Language-independent pre-processing of large document bases for text classification
Text classification is a well-known topic in the research of knowledge discovery in
databases. Algorithms for text classification generally involve two stages. The first
is concerned with identification of textual features (i.e. words andlor phrases) that
may be relevant to the classification process. The second is concerned with
classification rule mining and categorisation of "unseen" textual data. The first
stage is the subject of this thesis and often involves an analysis of text that is both
language-specific (and possibly domain-specific), and that may also be
computationally costly especially when dealing with large datasets. Existing
approaches to this stage are not, therefore, generally applicable to all languages. In
this thesis, we examine a number of alternative keyword selection methods and
phrase generation strategies, coupled with two potential significant word list
construction mechanisms and two final significant word selection mechanisms, to
identify such words andlor phrases in a given textual dataset that are expected to
serve to distinguish between classes, by simple, language-independent statistical
properties. We present experimental results, using common (large) textual datasets
presented in two distinct languages, to show that the proposed approaches can
produce good performance with respect to both classification accuracy and
processing efficiency. In other words, the study presented in this thesis
demonstrates the possibility of efficiently solving the traditional text classification
problem in a language-independent (also domain-independent) manner
Real Time Crime Prediction Using Social Media
There is no doubt that crime is on the increase and has a detrimental influence on a nation's economy despite several attempts of studies on crime prediction to minimise crime rates. Historically, data mining techniques for crime prediction models often rely on historical information and its mostly country specific. In fact, only a few of the earlier studies on crime prediction follow standard data mining procedure. Hence, considering the current worldwide crime trend in which criminals routinely publish their criminal intent on social media and ask others to see and/or engage in different crimes, an alternative, and more dynamic strategy is needed. The goal of this research is to improve the performance of crime prediction models. Thus, this thesis explores the potential of using information on social media (Twitter) for crime prediction in combination with historical crime data. It also figures out, using data mining techniques, the most relevant feature engineering needed for United Kingdom dataset which could improve crime prediction model performance. Additionally, this study presents a function that could be used by every state in the United Kingdom for data cleansing, pre-processing and feature engineering. A shinny App was also use to display the tweets sentiment trends to prevent crime in near-real time.Exploratory analysis is essential for revealing the necessary data pre-processing and feature engineering needed prior to feeding the data into the machine learning model for efficient result. Based on earlier documented studies available, this is the first research to do a full exploratory analysis of historical British crime statistics using stop and search historical dataset. Also, based on the findings from the exploratory study, an algorithm was created to clean the data, and prepare it for further analysis and model creation. This is an enormous success because it provides a perfect dataset for future research, particularly for non-experts to utilise in constructing models to forecast crime or conducting investigations in around 32 police districts of the United Kingdom.Moreover, this study is the first study to present a complete collection of geo-spatial parameters for training a crime prediction model by combining demographic data from the same source in the United Kingdom with hourly sentiment polarity that was not restricted to Twitter keyword search. Six unique base models that were frequently mentioned in the previous literature was selected and used to train stop-and-search historical crime dataset and evaluated on test data and finally validated with dataset from London and Kent crime datasets.Two different datasets were created from twitter and historical data (historical crime data with twitter sentiment score and historical data without twitter sentiment score). Six of the most prevalent machine learning classifiers (Random Forest, Decision Tree, K-nearest model, support vector machine, neural network and naĂŻve bayes) were trained and tested on these datasets. Additionally, hyperparameters of each of the six models developed were tweaked using random grid search. Voting classifiers and logistic regression stacked ensemble of different models were also trained and tested on the same datasets to enhance the individual model performance.In addition, two combinations of stack ensembles of multiple models were constructed to enhance and choose the most suitable models for crime prediction, and based on their performance, the appropriate prediction model for the UK dataset would be selected. In terms of how the research may be interpreted, it differs from most earlier studies that employed Twitter data in that several methodologies were used to show how each attribute contributed to the construction of the model, and the findings were discussed and interpreted in the context of the study. Further, a shiny app visualisation tool was designed to display the tweetsâ sentiment score, the text, the usersâ screen name, and the tweetsâ vicinity which allows the investigation of any criminal actions in near-real time. The evaluation of the models revealed that Random Forest, Decision Tree, and K nearest neighbour outperformed other models. However, decision trees and Random Forests perform better consistently when evaluated on test data
Features, positions and affixes in autonomous morphological structure
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Linguistics and Philosophy, 1992.Includes bibliographical references (leaves 307-319).by Robert Rolf Noyer.Ph.D