30 research outputs found

    Improving Kurdish Web Mining through Tree Data Structure and Porter’s Stemmer Algorithms

    Get PDF
    Stemming is one of the main important preprocessing techniques that can be used to enhance the accuracy of text classification. The key purpose of using the stemming is combining the number of words that have same stem to decrease high dimensionality of feature space. Reducing feature space cause to decline time to construct a model and minimize the memory space. In this paper, a new stemming approach is explored for enhancing Kurdish text classification performance. Tree data structure and Porter’s stemmer algorithms are incorporated for building the proposed approach.  The system is assessed through using Support Vector Machine (SVM) and Decision Tree (C4.5) to illustrate the performance of the suggested stemmer after and before applying it. Furthermore, the usefulness of using stop words are considered before and after implementing the suggested approach

    Central Kurdish Sentiment Analysis Using Deep Learning

    Get PDF
    Sentiment Analysis (SA) as a type of opinion mining and as a more general topic than polarity detection, is widely used for analyzing user's reviews or comments of online expressions, which is implemented using various techniques among which the Artificial Neural Network (ANN) is the most popular one. This paper addresses the development of an SA system for the Central Kurdish language (CKB) using deep learning. Increasing the efficiency and strengthening of the SA system relies on a robust language model. In addition, for creating and training a robust language model, collecting a large amount of text corpus is required and we have created a corpus of size 300 million tokens for CKB. Also, to train the SA model, we collected 14,881 comments on Facebook, then they are labeled manually. The combination of Word2Vec for the language model and Long Short-Term Memory (LSTM) for the classifier are used to create an SA model on the CKB SA dataset. These deep learning-based techniques are the most well-known methods in this field which have received high performance in SA for various languages. The performance of the proposed method for 3 classes SA is %71.35 accuracy. This result is superior to the best-reported result for CKB

    An extensive dataset of handwritten central Kurdish isolated characters

    Get PDF
    To collect the handwritten format of separate Kurdish characters, each character has been printed on a grid of 14 × 9 of A4 paper. Each paper is filled with only one printed character so that the volunteers know what character should be written in each paper. Then each paper has been scanned, spliced, and cropped with a macro in photoshop to make sure the same process is applied for all characters. The grids of the characters have been filled mainly by volunteers of students from multiple universities in Erbil

    Improvement performance by using Machine learning algorithms for fake news detection

    Get PDF
    The prevalence of internet use and the volume of actual-time data created and shared on social media sites and applications have raised the risk of spreading harmful or misunderstanding content, engaging in unlawful activity, abusing others, and disseminating false information. As of today, some studies have been done on fake news recognition in the Kurdish language. For extremely resourced languages like Arabic, English, and other international languages, false news detection is a well-researched research subject. Less resourced languages, however, stay out of attention because there is no labeled fake corpus, no fact-checking website, and no access to NPL tools. This paper illustrates the process of identifying fake news, using two components of the dataset for fake news and actual news. Several classifiers were then applied to the quantity after using identifiers as a highlight of selection. Results of the proposed study demonstrated that Passive-Aggressive Classifier (PAC) outperformed the other classifiers on both datasets the dataset with an accuracy score of 93.0 percent and other classifiers were less in some percentage that show high accuracy as well since it is 90 percent

    Assessing relevance using automatically translated documents for cross-language information retrieval

    Get PDF
    This thesis focuses on the Relevance Feedback (RF) process, and the scenario considered is that of a Portuguese-English Cross-Language Information Retrieval (CUR) system. CUR deals with the retrieval of documents in one natural language in response to a query expressed in another language. RF is an automatic process for query reformulation. The idea behind it is that users are unlikely to produce perfect queries, especially if given just one attempt.The process aims at improving the queryspecification, which will lead to more relevant documents being retrieved. The method consists of asking the user to analyse an initial sample of documents retrieved in response to a query and judge them for relevance. In that context, two main questions were posed. The first one relates to the user's ability in assessing the relevance of texts in a foreign language, texts hand translated into their language and texts automatically translated into their language. The second question concerns the relationship between the accuracy of the participant's judgements and the improvement achieved through the RF process. In order to answer those questions, this work performed an experiment in which Portuguese speakers were asked to judge the relevance of English documents, documents hand-translated to Portuguese, and documents automatically translated to Portuguese. The results show that machine translation is as effective as hand translation in aiding users to assess relevance. In addition, the impact of misjudged documents on the performance of RF is overall just moderate, and varies greatly for different query topics. This work advances the existing research on RF by considering a CUR scenario and carrying out user experiments, which analyse aspects of RF and CUR that remained unexplored until now. The contributions of this work also include: the investigation of CUR using a new language pair; the design and implementation of a stemming algorithm for Portuguese; and the carrying out of several experiments using Latent Semantic Indexing which contribute data points to the CUR theory

    Automatic detection of clusters and switches in Turkish semantic verbal fluency data

    Get PDF
    Verbal fluency tests are popular measures of executive function. These tests involve listing as many words from a given category as possible in a short time, typically 60 seconds. In phonemic verbal fluency tests, these words should begin with the same letter; in semantic verbal fluency tests (SVF), they should belong to the same category, e.g., animals. SVF is quick to administer, amenable to semi-automated analysis, and can be used to screen for cognitive impairments such as dementia. Troyer and collaborators proposed a fine-grained analysis method for SVF sequences that divides them into clusters, i.e., sequences of more closely semantically related words. Useful metrics that can be derived from such an analysis include mean cluster size and the number of switches between clusters. The aim of this thesis is to develop semiautomated methods to extract cluster- and switch-related metrics from Turkish SVF sequences. First, we conducted a systematic review of studies that report SVF performance of healthy adult native Turkish speakers, using international and Turkish databases including unpublished theses. We particularly focused on normative data and commonly used methods for collecting and analysing SVF data. We found that all included papers reported SVF sequences using the animal category, followed by first names. Considering the size of Turkish diaspora, there was a lack of studies comparing monolingual speakers to bilingual speakers. Detailed analyses beyond word count, such as perseverations, category violations, and clustering/switching were only rarely reported. Semi-automatic and automatic approaches were almost never used. The thesis therefore fills a clear gap in the literature. For our work on Turkish, we chose two computational approaches that can be easily adapted to languages with comparatively few corpus resources: a simple bigram method and a vector-space model (word2vec). We initially implemented and tested those methods on a Spanish dataset which included 50 healthy participants and 14 participants diagnosed with familial AD. Both computational models positioned switches very similarly to manual annotations, achieving F1=0.756 for Bigram and F1=0.8309 for Word2vec. There is no difference in terms of cluster sizes (p>0.01), but healthy participants produce significantly more switches (p<0.001). These findings hold both for the manual analysis and the automatic analysis. Since there are no public datasets of Turkish SVF data, we collected SVF data online from native speakers of Turkish with no self-reported cognitive impairments living both in Turkey and abroad. To the best of our knowledge, this is the first online spoken corpus of SVF for Turkish. The study used the three most frequently used categories in Turkish SVF data that have also been reported for other languages, namely animals, fruits and vegetables, and supermarket items. The study had two parts, an initial Qualtrics survey for screening and collecting relevant participant information, and a web-based app for collecting three SVF sequences. 286 participants consented to take part in the survey, and 137 (47.9%) continued on to the SVF app. In total, we collected 311 SVF sequences (Animals=105, Vegetables and Fruits=105, Supermarket Items=101) from 137 adults. The mean number of items produced per category is 25.04 (SD=X) for animals, 25.32 (SD=Y) for fruits and vegetables, and 25.97 (SD=Z) for supermarket items. Overall, data quality of the recorded sequences was good. The reasons for the drop off between survey and SVF data collections need to be investigated in further work. Finally, we adapted the computational techniques used for Spanish to the Turkish SVF data and assessed their ability to replicate clustering and switching based metrics. We found that both bigram and word2vec performed satisfactorily. There was no significant difference in cluster sizes, and switch numbers were highly correlated (p<0.001). In terms of predicting switch position, word2vec reached F1=0.738 and Bigram achieved F1=0.66. Next, we examined whether findings obtained from manual annotation of clusters and switches could be replicated using metrics derived using the two computational methods. Specifically, we investigated cluster size and switch numbers between male and female participants (sex) and between mono- and multilingual participants (multilinguality). Based on the manual analysis, we established that male participants created larger clusters than female participants, but used a similar number of switches. There were no significant differences between monolingual and multilingual participants. Both findings are in line with the existing literature on Turkish SVF. While bigram and word2vec yielded a similar result regarding number of switches, only word2vec-derived metrics replicated the difference in cluster size between male and female participants. In future work, other computational approaches, such as large language models, should be explored, automatic speech recognition should be integrated to eliminate the need for manual transcription, and additional speech-based features can be investigated. Finally, user experience research may help to improve online data collection and reduce the number of participants who drop out of the study before speech data collection

    Minimally-supervised Methods for Arabic Named Entity Recognition

    Get PDF
    Named Entity Recognition (NER) has attracted much attention over the past twenty years, as a main task of Information Extraction. The current dominant techniques for addressing NER are supervised methods that can achieve high performance, but require new manually annotated data for every new domain and/or genre change. Our work focuses on approaches that make it possible to tackle new domains with minimal human intervention to identify Named Entities (NEs) in Arabic text. Specifically, we investigate two minimally-supervised methods: semi-supervised learning and distant learning. Our semi-supervised algorithm for identifying NEs does not require annotated training data or gazetteers. It only requires, for each NE type, a seed list of a few instances to initiate the learning process. Novel aspects of our algorithm include (i) a new way to produce and generalise the extraction patterns (ii) a new filtering criterion to remove noisy patterns (iii) a comparison of two ranking measures for determining the most reliable candidate NEs. Next, we present our methodology to exploit Wikipedia structure to automatically develop an Arabic NE annotated corpus. A novel mechanism is introduced, based on the high coverage of Wikipedia, in order to address two challenges particular to tagging NEs in Arabic text: rich morphology and the absence of capitalisation. Neither technique has yet achieved performance levels comparable to those of supervised methods. Semi-supervised algorithms tend to have high precision but comparatively low recall, whereas distant learning tends to achieve higher recall but lower precision. Therefore, we present a novel approach to Arabic NER using a combination of semi-supervised and distant learning techniques. We used a variety of classifier combination schemes, including the Bayesian Classifier Combination (BCC) procedure, recently proposed for sentiment analysis. According to our results, the BCC model leads to an increase in performance of 8 percentage points over the best minimally-supervised classifier

    Extraction of Arabic word roots: An Approach Based on Computational Model and Multi-Backpropagation Neural Networks

    Get PDF
    Stemming is a process of extracting the root of a given word, by stripping off the affixes attached to this word. Many attempts have been made to address the stemming of Arabic words problem. The majority of the existing Arabic stemming algorithms require a complete set of morphological rules and large vocabulary lookup tables. Furthermore, many of them give more than one potential stem or root for a given Arabic word. According to Ahmad [11], the Arabic stemming process based on the language morphological rules is still a very difficult task due to the nature of the language itself. The limitations of the current Arabic stemming methods have motivated this research in which we investigate a novel approach to extract the word roots of Arabic language named here as MUAIDI-STEMMER 2. This approach attempts to exploit numerical relations between Arabic letters, avoiding having a list of the root and pattern of each word in the language, and giving one root solution. This approach is composed of two phases. Phase I depends on a basic calculations extracted from linguistic analysis of Arabic patterns and affixes. Phase II is based on artificial neural network trained by backpropagation learning rule. In this proposed phase, we formulate the root extraction problem as a classification problem and the neural network as a classifier tool. This study demonstrates that a neural network can be effectively used to ex- tract the word roots of Arabic language The stemmer developed is tested using 46,895 Arabic word types3. Error counting accuracy evaluation was employed to evaluate the performance of the stemmer. It was successful in producing the stems of 44,107 Arabic words from the given test datasets with accuracy of 94.81%. 2.Muaidi is the author father's name. 3.Types mean distinct or unique words
    corecore