10 research outputs found

    Corpora for sentiment analysis of Arabic text in social media

    Get PDF
    Different Natural Language Processing (NLP) applications such as text categorization, machine translation, etc., need annotated corpora to check quality and performance. Similarly, sentiment analysis requires annotated corpora to test the performance of classifiers. Manual annotation performed by native speakers is used as a benchmark test to measure how accurate a classifier is. In this paper we summarise currently available Arabic corpora and describe work in progress to build, annotate, and use Arabic corpora consisting of Facebook (FB) posts. The distinctive nature of thesecorpora is that it is based on posts written in Dialectal Arabic (DA) not following specific grammatical or spelling standards. The corpora are annotated with five labels (positive, negative, dual, neutral, and spam). In addition to building the corpus, the paper illustrates how manual tagging can be used to extract opinionated words and phrases to be used in a lexicon-based classifier

    Hybrid model of post-processing techniques for Arabic optical character recognition

    Get PDF
    Optical character recognition (OCR) is used to extract text contained in an image. One of the stages in OCR is the post-processing and it corrects the errors of OCR output text. The OCR multiple outputs approach consists of three processes: differentiation, alignment, and voting. Existing differentiation techniques suffer from the loss of important features as it uses N-versions of input images. On the other hand, alignment techniques in the literatures are based on approximation while the voting process is not context-aware. These drawbacks lead to a high error rate in OCR. This research proposed three improved techniques of differentiation, alignment, and voting to overcome the identified drawbacks. These techniques were later combined into a hybrid model that can recognize the optical characters in the Arabic language. Each of the proposed technique was separately evaluated against three other relevant existing techniques. The performance measurements used in this study were Word Error Rate (WER), Character Error Rate (CER), and Non-word Error Rate (NWER). Experimental results showed a relative decrease in error rate on all measurements for the evaluated techniques. Similarly, the hybrid model also obtained lower WER, CER, and NWER by 30.35%, 52.42%, and 47.86% respectively when compared to the three relevant existing models. This study contributes to the OCR domain as the proposed hybrid model of post-processing techniques could facilitate the automatic recognition of Arabic text. Hence, it will lead to a better information retrieval

    A prototype system for handwritten sub-word recognition: Toward Arabic-manuscript transliteration

    Full text link
    A prototype system for the transliteration of diacritics-less Arabic manuscripts at the sub-word or part of Arabic word (PAW) level is developed. The system is able to read sub-words of the input manuscript using a set of skeleton-based features. A variation of the system is also developed which reads archigraphemic Arabic manuscripts, which are dot-less, into archigraphemes transliteration. In order to reduce the complexity of the original highly multiclass problem of sub-word recognition, it is redefined into a set of binary descriptor classifiers. The outputs of trained binary classifiers are combined to generate the sequence of sub-word letters. SVMs are used to learn the binary classifiers. Two specific Arabic databases have been developed to train and test the system. One of them is a database of the Naskh style. The initial results are promising. The systems could be trained on other scripts found in Arabic manuscripts.Comment: 8 pages, 7 figures, 6 table

    Offline printed Arabic character recognition

    Get PDF
    Optical Character Recognition (OCR) shows great potential for rapid data entry, but has limited success when applied to the Arabic language. Normal OCR problems are compounded by the right-to-left nature of Arabic and because the script is largely connected. This research investigates current approaches to the Arabic character recognition problem and innovates a new approach. The main work involves a Haar-Cascade Classifier (HCC) approach modified for the first time for Arabic character recognition. This technique eliminates the problematic steps in the pre-processing and recognition phases in additional to the character segmentation stage. A classifier was produced for each of the 61 Arabic glyphs that exist after the removal of diacritical marks. These 61 classifiers were trained and tested on an average of about 2,000 images each. A Multi-Modal Arabic Corpus (MMAC) has also been developed to support this work. MMAC makes innovative use of the new concept of connected segments of Arabic words (PAWs) with and without diacritics marks. These new tokens have significance for linguistic as well as OCR research and applications and have been applied here in the post-processing phase. A complete Arabic OCR application has been developed to manipulate the scanned images and extract a list of detected words. It consists of the HCC to extract glyphs, systems for parsing and correcting these glyphs and the MMAC to apply linguistic constrains. The HCC produces a recognition rate for Arabic glyphs of 87%. MMAC is based on 6 million words, is published on the web and has been applied and validated both in research and commercial use

    Arabic Word Learning in Novice L1 English Speakers: Multi-modal Approaches and the Impact of Letter Training in the Target Language Script

    Get PDF
    This thesis explores early Arabic word learning by beginner level native English speakers who have no prior exposure to the language. While Arabic is considered a difficult language for English speakers to learn, very few studies focus on Arabic as a Foreign Language (AFL) vocabulary acquisition in beginners, despite vocabulary’s central role in language learning. The present research encompasses two separate word learning studies which employed multi-modal learning tasks in a language lab setting to show that novice adult learners can acquire Arabic vocabulary given minimal exposure to target language input accompanied by images and audio. In the first study, response time and accuracy data were used to explain performance on a word learning task and probe difficulty drivers in the target language word set. Findings suggest the number of letters and syllables a word contains can explain response time while the number of Arabic-only phonemes it contains can have a significant impact on accuracy. The second study provided a subset of participants with an Arabic letter-training session and utilized written word forms in a modified version of the target language script. Results showed a significant advantage for the letter-training group across all measures of learning. Findings support the use of letter training at introductory level and suggest novice learners can make use of the Arabic script to support form-meaning mapping in early vocabulary study. Results complement existing work on multi-modal learning paradigms and are discussed in the context of research for AFL. They may be used to support future study design and inform stimulus selection for vocabulary researchers choosing to work with Arabic, and generally serve to advance our understanding of effective approaches to Teaching Arabic as a Foreign Language (TAFL) to English native speakers

    Sentiment analysis and resources for informal Arabic text on social media

    Get PDF
    Online content posted by Arab users on social networks does not generally abide by the grammatical and spelling rules. These posts, or comments, are valuable because they contain users’ opinions towards different objects such as products, policies, institutions, and people. These opinions constitute important material for commercial and governmental institutions. Commercial institutions can use these opinions to steer marketing campaigns, optimize their products and know the weaknesses and/ or strengths of their products. Governmental institutions can benefit from the social networks posts to detect public opinion before or after legislating a new policy or law and to learn about the main issues that concern citizens. However, the huge size of online data and its noisy nature can hinder manual extraction and classification of opinions present in online comments. Given the irregularity of dialectal Arabic (or informal Arabic), tools developed for formally correct Arabic are of limited use. This is specifically the case when employed in sentiment analysis (SA) where the target of the analysis is social media content. This research implemented a system that addresses this challenge. This work can be roughly divided into three blocks: building a corpus for SA and manually tagging it to check the performance of the constructed lexicon-based (LB) classifier; building a sentiment lexicon that consists of three different sets of patterns (negative, positive, and spam); and finally implementing a classifier that employs the lexicon to classify Facebook comments. In addition to providing resources for dialectal Arabic SA and classifying Facebook comments, this work categorises reasons behind incorrect classification, provides preliminary solutions for some of them with focus on negation, and uses regular expressions to detect the presence of lexemes. This work also illustrates how the constructed classifier works along with its different levels of reporting. Moreover, it compares the performance of the LB classifier against Naïve Bayes classifier and addresses how NLP tools such as POS tagging and Named Entity Recognition can be employed in SA. In addition, the work studies the performance of the implemented LB classifier and the developed sentiment lexicon when used to classify other corpora used in the literature, and the performance of lexicons used in the literature to classify the corpora constructed in this research. With minor changes, the classifier can be used in domain classification of documents (sports, science, news, etc.). The work ends with a discussion of research questions arising from the research reported

    A framework for interactive end-user web automation

    Get PDF
    This research investigates the feasibility and usefulness of a Web-based model for end-user Web automation. The aim is to empower end users to automate their Web interactions. Web automation is defined here as the study of theoretical and practical techniques for applying an end-user programming model to enable the automation of Web tasks, activities, or interactions. To date, few tools address the issue of Web automation; moreover, their functionality and usage are limited. A novel model is presented, which combines end-user programming techniques and the software tools philosophy with the vision of the “Web as a platform.” The model provided a Web-based environment that enables the rapid creation of efficient and useful Web-oriented automation tools. It consists of a command line for the Web, a shell scripting language, and a repository of Web commands. A framework called Web2Sh (Web 2.0 Shell) has been implemented, which includes the design and implementation of scripting language (WSh) enabling end users to create and customise Web commands. A number of Web2Sh-core Web commands were implemented. There are two techniques for extending the system: developers can implement new core Web commands, and the use of WSh by end users to connect, customise, and parameterise Web commands to create new commands. The feasibility and the usefulness of the proposed model have been demonstrated by implementing several automation scripts using Web2Sh, and by a case study based experiment that was carried out by volunteered participants. The implemented Web2Sh framework provided a novel and realistic environment for creating, customising, and running Web-oriented automation tools

    A framework for interactive end-user web automation

    Get PDF
    This research investigates the feasibility and usefulness of a Web-based model for end-user Web automation. The aim is to empower end users to automate their Web interactions. Web automation is defined here as the study of theoretical and practical techniques for applying an end-user programming model to enable the automation of Web tasks, activities, or interactions. To date, few tools address the issue of Web automation; moreover, their functionality and usage are limited. A novel model is presented, which combines end-user programming techniques and the software tools philosophy with the vision of the “Web as a platform.” The model provided a Web-based environment that enables the rapid creation of efficient and useful Web-oriented automation tools. It consists of a command line for the Web, a shell scripting language, and a repository of Web commands. A framework called Web2Sh (Web 2.0 Shell) has been implemented, which includes the design and implementation of scripting language (WSh) enabling end users to create and customise Web commands. A number of Web2Sh-core Web commands were implemented. There are two techniques for extending the system: developers can implement new core Web commands, and the use of WSh by end users to connect, customise, and parameterise Web commands to create new commands. The feasibility and the usefulness of the proposed model have been demonstrated by implementing several automation scripts using Web2Sh, and by a case study based experiment that was carried out by volunteered participants. The implemented Web2Sh framework provided a novel and realistic environment for creating, customising, and running Web-oriented automation tools
    corecore