1 research outputs found

    Compression versus Machine Learning for Classifying Modern Arabic Code-Switching in Social Media and Classical Arabic Hadith

    Get PDF
    This thesis aims to enrich Arabic resources by building several Arabic corpora and making them freely available to the Arabic research community. Therefore, the Bangor Arabic–English codeswitching (BAEC) corpus, the Saudi Dialect Corpus (SDC) and the Egyptian Dialect Corpus (EDC) and the Non-Authentic Hadith (NAH) corpus were built. This thesis carries out the detection of code-switching in Arabic varieties and dialects from social media platforms to evaluate the prediction by partial matching (PPM) compression approach, comparing it with a the support vector machine (SVM) classifier with character-based and wordbased approaches. The aim was to test the PPM compression on modern standard Arabic (MSA) and Arabic dialect before using it on Hadith.To the best of our knowledge, no previous study involving the detection of code-switching between Arabic and English using PPM compression has been published before. The experimental results show that PPM compression achieved a higher accuracy rate than the SVM classifier when the training corpus correctly represented the language or dialect being studied. Then, classifying experiments on Arabic Hadith to evaluate the PPM compression approach and compare it against machine learning and deep learning approaches was also performed. The aim was to classify Arabic Hadith into two main classification tasks: Hadith components classification and Hadith authenticity classification. For the former, the experimental results show that deep learning classifiers can achieve a higher classification accuracy than the other classifiers under study. However, the execution time for deep learning classifiers was high. For the latter, the experimental results showed that Isnad was the part of a Hadith resulting in the most effective automatic determination of authenticity. In addition, the results proved that Matan can be used to judge Hadiths with up to 85% accuracy. These experiments were novel in their approaches to Hadith authenticity classification because they investigated the use of the ii character-based text compression scheme PPM and DL classifiers. Finally, the current thesis also investigated the automatic segmentation of Arabic Hadith using PPM compression. The experiments showed that PPM was effective in segmenting Hadith into its two main components, having been tested on different Hadith corpora that have different structures. The main innovation in these experiments was their use of a character-based text compression method to segment the Hadiths
    corecore