Search CORE

10 research outputs found

Corpora for sentiment analysis of Arabic text in social media

Author: Al-Khayatt Samir
Itani Maher
Roast Chris
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 12/05/2017
Field of study

Different Natural Language Processing (NLP) applications such as text categorization, machine translation, etc., need annotated corpora to check quality and performance. Similarly, sentiment analysis requires annotated corpora to test the performance of classifiers. Manual annotation performed by native speakers is used as a benchmark test to measure how accurate a classifier is. In this paper we summarise currently available Arabic corpora and describe work in progress to build, annotate, and use Arabic corpora consisting of Facebook (FB) posts. The distinctive nature of thesecorpora is that it is based on posts written in Dialectal Arabic (DA) not following specific grammatical or spelling standards. The corpora are annotated with five labels (positive, negative, dual, neutral, and spam). In addition to building the corpus, the paper illustrates how manual tagging can be used to extract opinionated words and phrases to be used in a lexicon-based classifier

Crossref

Sheffield Hallam University Research Archive

Hybrid model of post-processing techniques for Arabic optical character recognition

Author: Habeeb Imad Qasim
Publication venue
Publication date: 01/01/2016
Field of study

Optical character recognition (OCR) is used to extract text contained in an image. One of the stages in OCR is the post-processing and it corrects the errors of OCR output text. The OCR multiple outputs approach consists of three processes: differentiation, alignment, and voting. Existing differentiation techniques suffer from the loss of important features as it uses N-versions of input images. On the other hand, alignment techniques in the literatures are based on approximation while the voting process is not context-aware. These drawbacks lead to a high error rate in OCR. This research proposed three improved techniques of differentiation, alignment, and voting to overcome the identified drawbacks. These techniques were later combined into a hybrid model that can recognize the optical characters in the Arabic language. Each of the proposed technique was separately evaluated against three other relevant existing techniques. The performance measurements used in this study were Word Error Rate (WER), Character Error Rate (CER), and Non-word Error Rate (NWER). Experimental results showed a relative decrease in error rate on all measurements for the evaluated techniques. Similarly, the hybrid model also obtained lower WER, CER, and NWER by 30.35%, 52.42%, and 47.86% respectively when compared to the three relevant existing models. This study contributes to the OCR domain as the proposed hybrid model of post-processing techniques could facilitate the automatic recognition of Arabic text. Hence, it will lead to a better information retrieval

Universiti Utara Malaysia: UUM eTheses

A prototype system for handwritten sub-word recognition: Toward Arabic-manuscript transliteration

Author: Cheriet Mohamed
Milo Thomas
Moghaddam Reza Farrahi
Wisnovsky Robert
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 14/11/2011
Field of study

A prototype system for the transliteration of diacritics-less Arabic manuscripts at the sub-word or part of Arabic word (PAW) level is developed. The system is able to read sub-words of the input manuscript using a set of skeleton-based features. A variation of the system is also developed which reads archigraphemic Arabic manuscripts, which are dot-less, into archigraphemes transliteration. In order to reduce the complexity of the original highly multiclass problem of sub-word recognition, it is redefined into a set of binary descriptor classifiers. The outputs of trained binary classifiers are combined to generate the sequence of sub-word letters. SVMs are used to learn the binary classifiers. Two specific Arabic databases have been developed to train and test the system. One of them is a database of the Naskh style. The initial results are promising. The systems could be trained on other scripts found in Arabic manuscripts.Comment: 8 pages, 7 figures, 6 table

arXiv.org e-Print Archive

Crossref

Recommended from our members

A high level approach to Arabic sentence recognition

Author: Krayem AG
Publication venue
Publication date: 01/09/2013
Field of study

The aim of this work is to develop sentence recognition system inspired by the human reading process. Cognitive studies observed that the human tended to read a word as a whole at a time. He considers the global word shapes and uses contextual knowledge to infer and discriminate a word among other possible words. The sentence recognition system is a fully integrated system; a word level recogniser (baseline system) integrated with linguistic knowledge post-processing module. The presented baseline system is holistic word-based recognition approach characterised as probabilistic ranked task. The output of the system is multiple recognition hypotheses (N-best word lattice). The basic unit is the word rather than the character; it does not rely on any segmentation or require baseline detection. The considered linguistic knowledge to re-rank the output of the existing baseline system is the standard n-gram Statistical Language Models (SLMs). The candidates are re-ranked through exploiting phrase perplexity score. The system is an OCR system that depends on HMM models utilizing the HTK Toolkit. The baseline system supported by global transformation features extracted from binary word images. The adopted features' extraction technique is the block-based Discrete Cosine Transform (DCT) applied to the whole word image. Feature vectors extracted using block-based DCT with non-overlapping sub-block of size 8x8 pixels. The applied HMMs to the task are mono-model discrete one-dimensional HMMs (Bakis Model). A balanced actual scanned and synthetic database of word-image has been constructed to ensure an even distribution of word samples. The Arabic words are typewritten in five fonts having a size 14 points in a plain style. The statistical language models and lexicon words are extracted from The Holy Qur‟an. The systems are applied on word images with no overlap between the training and testing datasets. The actual scanned database is used to evaluate the word recogniser. The synthetic database is a large amount of data acquired for a reliable training of sentence recognition systems. This word recogniser evaluated in mono-font and multi-font contexts. The two types of word recogniser have been used to achieve a final recognition accuracy of99.30% and 73.47% in mono-font and multi-font, respectively. The achieved average accuracy by the sentence recogniser is 67.24% improved to 78.35% on average when using 5-gram post-processing. The complexity and accuracy of the post-processing module are evaluated and found that 4-gram is more suitable than 5-gram; it is much faster at an average improvement of 76.89%

Nottingham Trent Institutional Repository (IRep)

Offline printed Arabic character recognition

Author: AbdelRaouf Ashraf M.
Publication venue
Publication date
Field of study

Optical Character Recognition (OCR) shows great potential for rapid data entry, but has limited success when applied to the Arabic language. Normal OCR problems are compounded by the right-to-left nature of Arabic and because the script is largely connected. This research investigates current approaches to the Arabic character recognition problem and innovates a new approach. The main work involves a Haar-Cascade Classifier (HCC) approach modified for the first time for Arabic character recognition. This technique eliminates the problematic steps in the pre-processing and recognition phases in additional to the character segmentation stage. A classifier was produced for each of the 61 Arabic glyphs that exist after the removal of diacritical marks. These 61 classifiers were trained and tested on an average of about 2,000 images each. A Multi-Modal Arabic Corpus (MMAC) has also been developed to support this work. MMAC makes innovative use of the new concept of connected segments of Arabic words (PAWs) with and without diacritics marks. These new tokens have significance for linguistic as well as OCR research and applications and have been applied here in the post-processing phase. A complete Arabic OCR application has been developed to manipulate the scanned images and extract a list of detected words. It consists of the HCC to extract glyphs, systems for parsing and correcting these glyphs and the MMAC to apply linguistic constrains. The HCC produces a recognition rate for Arabic glyphs of 87%. MMAC is based on 6 million words, is published on the web and has been applied and validated both in research and commercial use

Nottingham ePrints

Arabic Word Learning in Novice L1 English Speakers: Multi-modal Approaches and the Impact of Letter Training in the Target Language Script

Author: Cicerchia Meredith
Publication venue
Publication date
Field of study

This thesis explores early Arabic word learning by beginner level native English speakers who have no prior exposure to the language. While Arabic is considered a difficult language for English speakers to learn, very few studies focus on Arabic as a Foreign Language (AFL) vocabulary acquisition in beginners, despite vocabulary’s central role in language learning. The present research encompasses two separate word learning studies which employed multi-modal learning tasks in a language lab setting to show that novice adult learners can acquire Arabic vocabulary given minimal exposure to target language input accompanied by images and audio. In the first study, response time and accuracy data were used to explain performance on a word learning task and probe difficulty drivers in the target language word set. Findings suggest the number of letters and syllables a word contains can explain response time while the number of Arabic-only phonemes it contains can have a significant impact on accuracy. The second study provided a subset of participants with an Arabic letter-training session and utilized written word forms in a modified version of the target language script. Results showed a significant advantage for the letter-training group across all measures of learning. Findings support the use of letter training at introductory level and suggest novice learners can make use of the Arabic script to support form-meaning mapping in early vocabulary study. Results complement existing work on multi-modal learning paradigms and are discussed in the context of research for AFL. They may be used to support future study design and inform stimulus selection for vocabulary researchers choosing to work with Arabic, and generally serve to advance our understanding of effective approaches to Teaching Arabic as a Foreign Language (TAFL) to English native speakers

Nottingham ePrints

Sentiment analysis and resources for informal Arabic text on social media

Author: Itani Maher
Publication venue: 'Sheffield Hallam University'
Publication date
Field of study

Online content posted by Arab users on social networks does not generally abide by the grammatical and spelling rules. These posts, or comments, are valuable because they contain users’ opinions towards different objects such as products, policies, institutions, and people. These opinions constitute important material for commercial and governmental institutions. Commercial institutions can use these opinions to steer marketing campaigns, optimize their products and know the weaknesses and/ or strengths of their products. Governmental institutions can benefit from the social networks posts to detect public opinion before or after legislating a new policy or law and to learn about the main issues that concern citizens. However, the huge size of online data and its noisy nature can hinder manual extraction and classification of opinions present in online comments. Given the irregularity of dialectal Arabic (or informal Arabic), tools developed for formally correct Arabic are of limited use. This is specifically the case when employed in sentiment analysis (SA) where the target of the analysis is social media content. This research implemented a system that addresses this challenge. This work can be roughly divided into three blocks: building a corpus for SA and manually tagging it to check the performance of the constructed lexicon-based (LB) classifier; building a sentiment lexicon that consists of three different sets of patterns (negative, positive, and spam); and finally implementing a classifier that employs the lexicon to classify Facebook comments. In addition to providing resources for dialectal Arabic SA and classifying Facebook comments, this work categorises reasons behind incorrect classification, provides preliminary solutions for some of them with focus on negation, and uses regular expressions to detect the presence of lexemes. This work also illustrates how the constructed classifier works along with its different levels of reporting. Moreover, it compares the performance of the LB classifier against Naïve Bayes classifier and addresses how NLP tools such as POS tagging and Named Entity Recognition can be employed in SA. In addition, the work studies the performance of the implemented LB classifier and the developed sentiment lexicon when used to classify other corpora used in the literature, and the performance of lexicons used in the literature to classify the corpora constructed in this research. With minor changes, the classifier can be used in domain classification of documents (sports, science, news, etc.). The work ends with a discussion of research questions arising from the research reported

Crossref

Sheffield Hallam University Research Archive

A framework for interactive end-user web automation

Author: Eliwa Essam
Publication venue
Publication date: 10/12/2013
Field of study

This research investigates the feasibility and usefulness of a Web-based model for end-user Web automation. The aim is to empower end users to automate their Web interactions. Web automation is defined here as the study of theoretical and practical techniques for applying an end-user programming model to enable the automation of Web tasks, activities, or interactions. To date, few tools address the issue of Web automation; moreover, their functionality and usage are limited. A novel model is presented, which combines end-user programming techniques and the software tools philosophy with the vision of the “Web as a platform.” The model provided a Web-based environment that enables the rapid creation of efficient and useful Web-oriented automation tools. It consists of a command line for the Web, a shell scripting language, and a repository of Web commands. A framework called Web2Sh (Web 2.0 Shell) has been implemented, which includes the design and implementation of scripting language (WSh) enabling end users to create and customise Web commands. A number of Web2Sh-core Web commands were implemented. There are two techniques for extending the system: developers can implement new core Web commands, and the use of WSh by end users to connect, customise, and parameterise Web commands to create new commands. The feasibility and the usefulness of the proposed model have been demonstrated by implementing several automation scripts using Web2Sh, and by a case study based experiment that was carried out by volunteered participants. The implemented Web2Sh framework provided a novel and realistic environment for creating, customising, and running Web-oriented automation tools

Nottingham eTheses

A framework for interactive end-user web automation

Author: Eliwa Essam
Publication venue
Publication date
Field of study

Nottingham ePrints