30 research outputs found
Improving Kurdish Web Mining through Tree Data Structure and Porter’s Stemmer Algorithms
Stemming is one of the main important preprocessing techniques that can be used to enhance the accuracy of text classification. The key purpose of using the stemming is combining the number of words that have same stem to decrease high dimensionality of feature space. Reducing feature space cause to decline time to construct a model and minimize the memory space. In this paper, a new stemming approach is explored for enhancing Kurdish text classification performance. Tree data structure and Porter’s stemmer algorithms are incorporated for building the proposed approach. The system is assessed through using Support Vector Machine (SVM) and Decision Tree (C4.5) to illustrate the performance of the suggested stemmer after and before applying it. Furthermore, the usefulness of using stop words are considered before and after implementing the suggested approach
Central Kurdish Sentiment Analysis Using Deep Learning
Sentiment Analysis (SA) as a type of opinion mining and as a more general topic than polarity detection, is widely used for analyzing user's reviews or comments of online expressions, which is implemented using various techniques among which the Artificial Neural Network (ANN) is the most popular one. This paper addresses the development of an SA system for the Central Kurdish language (CKB) using deep learning. Increasing the efficiency and strengthening of the SA system relies on a robust language model. In addition, for creating and training a robust language model, collecting a large amount of text corpus is required and we have created a corpus of size 300 million tokens for CKB. Also, to train the SA model, we collected 14,881 comments on Facebook, then they are labeled manually. The combination of Word2Vec for the language model and Long Short-Term Memory (LSTM) for the classifier are used to create an SA model on the CKB SA dataset. These deep learning-based techniques are the most well-known methods in this field which have received high performance in SA for various languages. The performance of the proposed method for 3 classes SA is %71.35 accuracy. This result is superior to the best-reported result for CKB
An extensive dataset of handwritten central Kurdish isolated characters
To collect the handwritten format of separate Kurdish characters, each character has been printed on a grid of 14 × 9 of A4 paper. Each paper is filled with only one printed character so that the volunteers know what character should be written in each paper. Then each paper has been scanned, spliced, and cropped with a macro in photoshop to make sure the same process is applied for all characters. The grids of the characters have been filled mainly by volunteers of students from multiple universities in Erbil
Improvement performance by using Machine learning algorithms for fake news detection
The prevalence of internet use and the volume of actual-time data created and shared on social media sites and applications have raised the risk of spreading harmful or misunderstanding content, engaging in unlawful activity, abusing others, and disseminating false information. As of today, some studies have been done on fake news recognition in the Kurdish language. For extremely resourced languages like Arabic, English, and other international languages, false news detection is a well-researched research subject. Less resourced languages, however, stay out of attention because there is no labeled fake corpus, no fact-checking website, and no access to NPL tools. This paper illustrates the process of identifying fake news, using two components of the dataset for fake news and actual news. Several classifiers were then applied to the quantity after using identifiers as a highlight of selection. Results of the proposed study demonstrated that Passive-Aggressive Classifier (PAC) outperformed the other classifiers on both datasets the dataset with an accuracy score of 93.0 percent and other classifiers were less in some percentage that show high accuracy as well since it is 90 percent
Assessing relevance using automatically translated documents for cross-language information retrieval
This thesis focuses on the Relevance Feedback (RF) process, and the scenario considered is that of a Portuguese-English Cross-Language Information Retrieval (CUR) system. CUR deals with the retrieval of documents in one natural language in response to a query expressed in another language. RF is an automatic process for query reformulation. The idea behind it is that users are unlikely to produce perfect
queries, especially if given just one attempt.The process aims at improving the queryspecification, which will lead to more relevant documents being retrieved. The method consists of asking the user to analyse an initial sample of documents retrieved in response to a query and judge them for relevance.
In that context, two main questions were posed. The first one relates to the user's ability in assessing the relevance of texts in a foreign language, texts hand translated into their language and texts automatically translated into their language. The second question concerns the relationship between the accuracy of the participant's judgements and the improvement achieved through the RF process.
In order to answer those questions, this work performed an experiment in which Portuguese speakers were asked to judge the relevance of English documents, documents hand-translated to Portuguese, and documents automatically translated to Portuguese. The results show that machine translation is as effective as hand translation in aiding users to assess relevance. In addition, the impact of misjudged
documents on the performance of RF is overall just moderate, and varies greatly for different query topics.
This work advances the existing research on RF by considering a CUR scenario and carrying out user experiments, which analyse aspects of RF and CUR that remained unexplored until now. The contributions of this work also include: the investigation of CUR using a new language pair; the design and implementation of a stemming algorithm for Portuguese; and the carrying out of several experiments using Latent Semantic Indexing which contribute data points to the CUR theory
Automatic detection of clusters and switches in Turkish semantic verbal fluency data
Verbal fluency tests are popular measures of executive function. These tests involve
listing as many words from a given category as possible in a short time, typically 60
seconds. In phonemic verbal fluency tests, these words should begin with the same
letter; in semantic verbal fluency tests (SVF), they should belong to the same category,
e.g., animals. SVF is quick to administer, amenable to semi-automated analysis,
and can be used to screen for cognitive impairments such as dementia. Troyer and
collaborators proposed a fine-grained analysis method for SVF sequences that divides
them into clusters, i.e., sequences of more closely semantically related words. Useful
metrics that can be derived from such an analysis include mean cluster size and
the number of switches between clusters. The aim of this thesis is to develop semiautomated
methods to extract cluster- and switch-related metrics from Turkish SVF
sequences.
First, we conducted a systematic review of studies that report SVF performance
of healthy adult native Turkish speakers, using international and Turkish databases including
unpublished theses. We particularly focused on normative data and commonly
used methods for collecting and analysing SVF data. We found that all included papers
reported SVF sequences using the animal category, followed by first names. Considering
the size of Turkish diaspora, there was a lack of studies comparing monolingual
speakers to bilingual speakers. Detailed analyses beyond word count, such as perseverations,
category violations, and clustering/switching were only rarely reported.
Semi-automatic and automatic approaches were almost never used. The thesis therefore
fills a clear gap in the literature.
For our work on Turkish, we chose two computational approaches that can be easily
adapted to languages with comparatively few corpus resources: a simple bigram
method and a vector-space model (word2vec). We initially implemented and tested
those methods on a Spanish dataset which included 50 healthy participants and 14 participants
diagnosed with familial AD. Both computational models positioned switches
very similarly to manual annotations, achieving F1=0.756 for Bigram and F1=0.8309
for Word2vec. There is no difference in terms of cluster sizes (p>0.01), but healthy
participants produce significantly more switches (p<0.001). These findings hold both
for the manual analysis and the automatic analysis.
Since there are no public datasets of Turkish SVF data, we collected SVF data
online from native speakers of Turkish with no self-reported cognitive impairments
living both in Turkey and abroad. To the best of our knowledge, this is the first online
spoken corpus of SVF for Turkish. The study used the three most frequently used categories
in Turkish SVF data that have also been reported for other languages, namely
animals, fruits and vegetables, and supermarket items. The study had two parts, an
initial Qualtrics survey for screening and collecting relevant participant information,
and a web-based app for collecting three SVF sequences. 286 participants consented
to take part in the survey, and 137 (47.9%) continued on to the SVF app. In total, we
collected 311 SVF sequences (Animals=105, Vegetables and Fruits=105, Supermarket
Items=101) from 137 adults. The mean number of items produced per category is
25.04 (SD=X) for animals, 25.32 (SD=Y) for fruits and vegetables, and 25.97 (SD=Z)
for supermarket items. Overall, data quality of the recorded sequences was good. The
reasons for the drop off between survey and SVF data collections need to be investigated
in further work.
Finally, we adapted the computational techniques used for Spanish to the Turkish
SVF data and assessed their ability to replicate clustering and switching based metrics.
We found that both bigram and word2vec performed satisfactorily. There was
no significant difference in cluster sizes, and switch numbers were highly correlated
(p<0.001). In terms of predicting switch position, word2vec reached F1=0.738 and
Bigram achieved F1=0.66. Next, we examined whether findings obtained from manual
annotation of clusters and switches could be replicated using metrics derived using the
two computational methods. Specifically, we investigated cluster size and switch numbers
between male and female participants (sex) and between mono- and multilingual
participants (multilinguality). Based on the manual analysis, we established that male
participants created larger clusters than female participants, but used a similar number
of switches. There were no significant differences between monolingual and multilingual
participants. Both findings are in line with the existing literature on Turkish SVF.
While bigram and word2vec yielded a similar result regarding number of switches,
only word2vec-derived metrics replicated the difference in cluster size between male
and female participants.
In future work, other computational approaches, such as large language models,
should be explored, automatic speech recognition should be integrated to eliminate
the need for manual transcription, and additional speech-based features can be investigated.
Finally, user experience research may help to improve online data collection
and reduce the number of participants who drop out of the study before speech data
collection
Minimally-supervised Methods for Arabic Named Entity Recognition
Named Entity Recognition (NER) has attracted much attention over the past twenty years, as a main task of Information Extraction. The current dominant techniques for addressing NER are supervised methods that can achieve high performance, but require new manually annotated data for every new domain and/or genre change. Our work focuses on approaches that make it possible to tackle new domains with minimal human intervention to identify Named Entities (NEs) in Arabic text. Specifically, we investigate two minimally-supervised methods: semi-supervised learning and distant learning. Our semi-supervised algorithm for identifying NEs does not require annotated training data or gazetteers. It only requires, for each NE type, a seed list of a few instances to initiate the learning process. Novel aspects of our algorithm include (i) a new way to produce and generalise the extraction patterns (ii) a new filtering criterion to remove noisy patterns (iii) a comparison of two ranking measures for determining the most reliable candidate NEs. Next, we present our methodology to exploit Wikipedia structure to automatically develop an Arabic NE annotated corpus. A novel mechanism is introduced, based on the high coverage of Wikipedia, in order to address two challenges particular to tagging NEs in Arabic text: rich morphology and the absence of capitalisation. Neither technique has yet achieved performance levels comparable to those of supervised methods. Semi-supervised algorithms tend to have high precision but comparatively low recall, whereas distant learning tends to achieve higher recall but lower precision. Therefore, we present a novel approach to Arabic NER using a combination of semi-supervised and distant learning techniques. We used a variety of classifier combination schemes, including the Bayesian Classifier Combination (BCC) procedure, recently proposed for sentiment analysis. According to our results, the BCC model leads to an increase in performance of 8 percentage points over the best minimally-supervised classifier
Extraction of Arabic word roots: An Approach Based on Computational Model and Multi-Backpropagation Neural Networks
Stemming is a process of extracting the root of a given word, by stripping
off the affixes attached to this word. Many attempts have been made
to address the stemming of Arabic words problem. The majority of the
existing Arabic stemming algorithms require a complete set of morphological
rules and large vocabulary lookup tables. Furthermore, many of them give
more than one potential stem or root for a given Arabic word. According to
Ahmad [11], the Arabic stemming process based on the language morphological
rules is still a very difficult task due to the nature of the language itself.
The limitations of the current Arabic stemming methods have motivated this
research in which we investigate a novel approach to extract the word roots
of Arabic language named here as MUAIDI-STEMMER 2. This approach attempts
to exploit numerical relations between Arabic letters, avoiding having a list
of the root and pattern of each word in the language, and giving one root solution.
This approach is composed of two phases. Phase I depends on a basic
calculations extracted from linguistic analysis of Arabic patterns and affixes.
Phase II is based on artificial neural network trained by backpropagation
learning rule. In this proposed phase, we formulate the root extraction problem
as a classification problem and the neural network as a classifier tool.
This study demonstrates that a neural network can be effectively used to ex- tract the word roots of Arabic language
The stemmer developed is tested using 46,895 Arabic word types3. Error counting accuracy evaluation was employed to evaluate the performance of
the stemmer. It was successful in producing the stems of 44,107 Arabic words
from the given test datasets with accuracy of 94.81%.
2.Muaidi is the author father's name.
3.Types mean distinct or unique words