814 research outputs found

    A Computational Lexicon and Representational Model for Arabic Multiword Expressions

    Get PDF
    The phenomenon of multiword expressions (MWEs) is increasingly recognised as a serious and challenging issue that has attracted the attention of researchers in various language-related disciplines. Research in these many areas has emphasised the primary role of MWEs in the process of analysing and understanding language, particularly in the computational treatment of natural languages. Ignoring MWE knowledge in any NLP system reduces the possibility of achieving high precision outputs. However, despite the enormous wealth of MWE research and language resources available for English and some other languages, research on Arabic MWEs (AMWEs) still faces multiple challenges, particularly in key computational tasks such as extraction, identification, evaluation, language resource building, and lexical representations. This research aims to remedy this deficiency by extending knowledge of AMWEs and making noteworthy contributions to the existing literature in three related research areas on the way towards building a computational lexicon of AMWEs. First, this study develops a general understanding of AMWEs by establishing a detailed conceptual framework that includes a description of an adopted AMWE concept and its distinctive properties at multiple linguistic levels. Second, in the use of AMWE extraction and discovery tasks, the study employs a hybrid approach that combines knowledge-based and data-driven computational methods for discovering multiple types of AMWEs. Third, this thesis presents a representative system for AMWEs which consists of multilayer encoding of extensive linguistic descriptions. This project also paves the way for further in-depth AMWE-aware studies in NLP and linguistics to gain new insights into this complicated phenomenon in standard Arabic. The implications of this research are related to the vital role of the AMWE lexicon, as a new lexical resource, in the improvement of various ANLP tasks and the potential opportunities this lexicon provides for linguists to analyse and explore AMWE phenomena

    The Semantic Prosody of Natural Phenomena in the Qur’an: A Corpus-Based Study

    Get PDF
    This thesis explores the Semantic Prosody (SP) of natural phenomena in the Qur’an and five of its prominent English translations [Pickthall (1930), Yusuf Ali (1939/ revised edition 1987), Arberry (1957), Saheeh International (1997), and Abdel Haleem (2004)]. SP, scarcely explored in Qur’anic research, is defined as ‘a form of meaning established through the proximity of a consistent series of collocates’ (Louw 2000, p.50). Theoretically, it is both an evaluative prosody (i.e., lexical items collocating with semantic word classes that are positive, negative, or neutral) and a discourse prosody (i.e., having a communicative purpose). Given the stylistic uniqueness of the Qur’an and considering that SP can be examined empirically via corpora, the present study explores the SP of 154 words associated with nature referenced throughout the Qur’an using Corpus Linguistics techniques. Firstly, the Python-based Natural Language Toolkit was used for the following: to define nature terms via WordNet; to disambiguate their variant forms with Stemmers, and to compute their frequencies. Once frequencies were found, a quantitative analysis using Evert’s (2008) five-step statistical analysis was implemented on the 30 most frequent terms to investigate their collocations and SPs. Following this, a qualitative analysis was conducted as per the Extended Lexical Unit via concordance to analyse collocations and the Lexical-Functional Grammar to find the variation of meanings produced by lexico-grammatical patterns. Finally, the resulting datasets were aligned to evaluate their congruency with the Qur’an. Findings of this research confirm that words referring to nature in the Qur’an do have semantic prosody. For example, astronomical bodies are primed to occur in predominantly positive collocations referring to glorifying God, while weather phenomena in negative ones refer to Day of Judgment calamities. In addition, results show that Abdel-Haleem’s translation can be considered the most congruent. This research develops an approach to explore themes (e.g., nature) via SP analysis in texts and their translations and provides several linguistic resources that can be used for future corpus-based studies on the language of the Qur’an.

    Arabic Rule-Based Named Entity Recognition Systems Progress and Challenges

    Get PDF
    Rule-based approaches are using human-made rules to extract Named Entities (NEs), it is one of the most famous ways to extract NE as well as Machine Learning.  The term Named Entity Recognition (NER) is defined as a task determined to indicate personal names, locations, organizations and many other entities. In Arabic language, Big Data challenges make Arabic NER develops rapidly and extracts useful information from texts. The current paper sheds some light on research progress in rule-based via a diagnostic comparison among linguistic resource, entity type, domain, and performance. We also highlight the challenges of the processing Arabic NEs through rule-based systems. It is expected that good performance of NER will be effective to other modern fields like semantic web searching, question answering, machine translation, information retrieval, and abstracting systems

    A Machine Learning Approach For Opinion Holder Extraction In Arabic Language

    Full text link
    Opinion mining aims at extracting useful subjective information from reliable amounts of text. Opinion mining holder recognition is a task that has not been considered yet in Arabic Language. This task essentially requires deep understanding of clauses structures. Unfortunately, the lack of a robust, publicly available, Arabic parser further complicates the research. This paper presents a leading research for the opinion holder extraction in Arabic news independent from any lexical parsers. We investigate constructing a comprehensive feature set to compensate the lack of parsing structural outcomes. The proposed feature set is tuned from English previous works coupled with our proposed semantic field and named entities features. Our feature analysis is based on Conditional Random Fields (CRF) and semi-supervised pattern recognition techniques. Different research models are evaluated via cross-validation experiments achieving 54.03 F-measure. We publicly release our own research outcome corpus and lexicon for opinion mining community to encourage further research

    The role of terminology and local grammar in video annotation

    Get PDF
    The linguistic annotation' of video sequences is an intellectually challenging task involving the investigation of how images and words are linked .together, a task that is ultimately financially rewarding in that the eventual automatic retrieval of video (sequences) can be much less time consuming, subjective and expensive than when retrieved manually. Much effort has been focused on automatic or semi-automatic annotation. Computational linguistic methods of video annotation rely on collections of collateral text in the form of keywords and proper nouns. Keywords are often used in a particular order indicating an identifiable pattern which is often limited and can subsequently be used to annotate the portion of a video where such a pattern occurred. Once' the relevant keywords and patterns have been stored, they can then be used to annotate the remainder of the video, excluding all collateral text which does not match the keywords or patterns. A new method of video annotation is presented in this thesis. The method facilitates a) annotation extraction of specialist terms within a corpus of collateral text; b) annotation identification of frequently used linguistic patterns to use in repeating key events within the data-set. The use of the method has led to the development of a system that can automatically assign key words and key patterns to a number of frames that are found in the commentary text approximately contemporaneous to the selected number of frames. The system does not perform video analysis; it only analyses the collateral text. The method is based on corpus linguistics and is mainly frequency based - frequency of occurrence of a key word or key pattern is taken as the basis of its representation. No assumptions are made about the grammatical structure of the language used in the collateral text, neither is a lexica of key words refined. Our system has been designed to annotate videos of football matches in English a!ld Arabic, and also cricket videos in English. The system has also been designed to retrieve annotated clips. The system not only provides a simple search method for annotated clips retrieval, it also provides complex, more advanced search methods.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    The Israeli-Palestinian Conflict in American, Arab, and British Media: Corpus-Based Critical Discourse Analysis

    Get PDF
    The Israeli-Palestinian conflict is one of the longest and most violent conflicts in modern history. The language used to represent this important conflict in the media is frequently commented on by scholars and political commentators (e.g., Ackerman, 2001; Fisk, 2001; Mearsheimer & Walt, 2007). To date, however, few studies in the field of applied linguistics have attempted a thorough investigation of the language used to represent the conflict in influential media outlets using systematic methods of linguistic analysis. The current study aims to partially bridge this gap by combining methods and analytical frameworks from Critical Discourse Analysis (CDA) and Corpus Linguistics (CL) to analyze the discursive representation of the Israeli-Palestinian conflict in American, Arab, and British media, represented by CNN, Al-Jazeera Arabic, and BBC respectively. CDA, which is primarily interested in studying how power and ideology are enacted and resisted in the use of language in social and political contexts, has been frequently criticized mainly for the arbitrary selection of a small number of texts or text fragments to be analyzed. In order to strengthen CDA analysis, Stubbs (1997) suggested that CDA analysts should utilize techniques from CL, which employs computational approaches to perform quantitative and qualitative analysis of actual patterns of use occurring in a large and principled collection of natural texts. In this study, the corpus-based keyword technique is initially used to identify the topics that tend to be emphasized, downplayed, and/or left out in the coverage of the Israeli-Palestinian conflict in three corpora complied from the news websites of Al-Jazeera, CNN, and the BBC. Topics –such as terrorism, occupation, settlements, and the recent Israeli disengagement plan—which were found to be key in the coverage of the conflict—are further studied in context using several other corpus tools, especially the concordancer and the collocation finder. The analysis reveals some of the strategies employed by each news website to control for the positive or negative representations of the different actors involved in the conflict. The corpus findings are interpreted using some informative CDA frameworks, especially Van Dijk’s (1998) ideological square framework

    A Named Entity Recognition System Applied to Arabic Text in the Medical Domain

    Get PDF
    Currently, 30-35% of the global population uses the Internet. Furthermore, there is a rapidly increasing number of non-English language internet users, accompanied by an also increasing amount of unstructured text online. One area replete with underexploited online text is the Arabic medical domain, and one method that can be used to extract valuable data from Arabic medical texts is Named Entity Recognition (NER). NER is the process by which a system can automatically detect and categorise Named Entities (NE). NER has numerous applications in many domains, and medical texts are no exception. NER applied to the medical domain could assist in detection of patterns in medical records, allowing doctors to make better diagnoses and treatment decisions, enabling medical staff to quickly assess a patient's records and ensuring that patients are informed about their data, as just a few examples. However, all these applications would require a very high level of accuracy. To improve the accuracy of NER in this domain, new approaches need to be developed that are tailored to the types of named entities to be extracted and categorised. In an effort to solve this problem, this research applied Bayesian Belief Networks (BBN) to the process. BBN, a probabilistic model for prediction of random variables and their dependencies, can be used to detect and predict entities. The aim of this research is to apply BBN to the NER task to extract relevant medical entities such as disease names, symptoms, treatment methods, and diagnosis methods from modern Arabic texts in the medical domain. To achieve this aim, a new corpus related to the medical domain has been built and annotated. Our BBN approach achieved a 96.60% precision, 90.79% recall, and 93.60% F-measure for the disease entity, while for the treatment method entity, it achieved 69.33%, 70.99%, and 70.15% for precision, recall, and F-measure, respectively. For the diagnosis method and symptom categories, our system achieved 84.91% and 71.34%, respectively, for precision, 53.36% and 49.34%, respectively, for recall, and 65.53% and 58.33%, for F-measure, respectively. Our BBN strategy achieved good accuracy for NEs in the categories of disease and treatment method. However, the average word length of the other two NE categories observed, diagnosis method and symptom, may have had a negative effect on their accuracy. Overall, the application of BBN to Arabic medical NER is successful, but more development is needed to improve accuracy to a standard at which the results can be applied to real medical systems

    Acquisition of lexical collocations : a corpus-assisted contrastive analysis and translation approach

    Get PDF
    PhD ThesisResearch from the past 20 years has indicated that much of natural language consists of formulaic sequences or chunks. It has been suggested that learning vocabulary as discrete items does not necessarily help L2 learners become successful communicators or fluent and accurate language users. Collocations, i.e. words that usually go together as one form of formulaic sequences, constitute an inherent problem for ESL/ EFL learners. Researchers have submitted that non-congruent collocations, i.e. collocations that do not have corresponding L1 equivalents, are especially difficult to acquire by ESL/ EFL learners. This study examines the effect of three Focus-on-Forms instructional approaches on the passive and active acquisition of non-congruent collocations: 1) the non-corpus-assisted contrastive analysis and translation (CAT) approach, 2) the corpus-assisted CAT approach, and 3) the corpus-assisted non-CAT approach. To fully assess the proposed combined condition (i.e. the corpus-assisted CAT) and its learning outcomes, a control group under no-condition was included for a baseline comparison. Thirty collocations non-congruent with the learners’ L1 (Arabic) were chosen for this study. 129 undergraduate EFL learners in a Saudi University participated in the study. The participants were assigned to the three experimental groups and to the control group following a cluster random sampling method. The corpus-assisted CAT group performed (L1/ L2 and L2/ L1) translation tasks with the help of bilingual English/ Arabic corpus data. The non-corpus CAT group was assigned text-based translation tasks and received contrastive analysis of the target collocations and their L1 translation options from the teacher. The non-contrastive group performed multiple-choice/ gap-filling tasks with the help of monolingual corpus data, focusing on the target items. Immediately after the intervention stage, the three groups were tested on the retention of the target collocations by two tests: active recall and passive recall. The same tests were administered to the participants three weeks later. The corpus-assisted CAT group significantly outperformed the other two groups on all the tests. These results were discussed in light of the ‘noticing’, ‘task-induced involvement load’, and ‘pushed output’ hypotheses and the influence that L1 exerts on the acquisition of L2 vocabulary. The discussion includes an evaluation of the three instructional conditions in ii relation to different determinants, dimensions and functions within the hypotheses.Saudi Ministry of Higher Education and King Saud Universit

    Corpus and sentiment analysis

    Get PDF
    EThOS - Electronic Theses Online ServiceGBUnited Kingdo
    • …
    corecore