960 research outputs found

    An empirical study of Arabic formulaic sequence extraction methods

    Get PDF
    This paper aims to implement what is referred to as the collocation of the Arabic keywords approach for extracting formulaic sequences (FSs) in the form of high frequency but semantically regular formulas that are not restricted to any syntactic construction or semantic domain. The study applies several distributional semantic models in order to automatically extract relevant FSs related to Arabic keywords. The data sets used in this experiment are rendered from a new developed corpus-based Arabic wordlist consisting of 5,189 lexical items which represent a variety of modern standard Arabic (MSA) genres and regions, the new wordlist being based on an overlapping frequency based on a comprehensive comparison of four large Arabic corpora with a total size of over 8 billion running words. Empirical n-best precision evaluation methods are used to determine the best association measures (AMs) for extracting high frequency and meaningful FSs. The gold standard reference FSs list was developed in previous studies and manually evaluated against well-established quantitative and qualitative criteria. The results demonstrate that the MI.log_f AM achieved the highest results in extracting significant FSs from the large MSA corpus, while the T-score association measure achieved the worst results

    A Computational Lexicon and Representational Model for Arabic Multiword Expressions

    Get PDF
    The phenomenon of multiword expressions (MWEs) is increasingly recognised as a serious and challenging issue that has attracted the attention of researchers in various language-related disciplines. Research in these many areas has emphasised the primary role of MWEs in the process of analysing and understanding language, particularly in the computational treatment of natural languages. Ignoring MWE knowledge in any NLP system reduces the possibility of achieving high precision outputs. However, despite the enormous wealth of MWE research and language resources available for English and some other languages, research on Arabic MWEs (AMWEs) still faces multiple challenges, particularly in key computational tasks such as extraction, identification, evaluation, language resource building, and lexical representations. This research aims to remedy this deficiency by extending knowledge of AMWEs and making noteworthy contributions to the existing literature in three related research areas on the way towards building a computational lexicon of AMWEs. First, this study develops a general understanding of AMWEs by establishing a detailed conceptual framework that includes a description of an adopted AMWE concept and its distinctive properties at multiple linguistic levels. Second, in the use of AMWE extraction and discovery tasks, the study employs a hybrid approach that combines knowledge-based and data-driven computational methods for discovering multiple types of AMWEs. Third, this thesis presents a representative system for AMWEs which consists of multilayer encoding of extensive linguistic descriptions. This project also paves the way for further in-depth AMWE-aware studies in NLP and linguistics to gain new insights into this complicated phenomenon in standard Arabic. The implications of this research are related to the vital role of the AMWE lexicon, as a new lexical resource, in the improvement of various ANLP tasks and the potential opportunities this lexicon provides for linguists to analyse and explore AMWE phenomena

    Towards Comprehensive Computational Representations of Arabic Multiword Expressions

    Get PDF
    A successful computational treatment of multiword expressions (MWEs) in natural languages leads to a robust NLP system which considers the long-standing problem of language ambiguity caused primarily by this complex linguistic phenomenon. The first step in addressing this challenge is building an extensive reliable MWEs language resource LR with comprehensive computational representations across all linguistic levels. This forms the cornerstone in understanding the heterogeneous linguistic behaviour of MWEs in their various manifestations. This paper presents a detailed framework for computational representations of Arabic MWEs (ArMWEs) across all linguistic levels based on the state-of-the-art lexical mark-up framework (LMF) with the necessary modifications to suit the distinctive properties of Modern Standard Arabic (MSA). This work forms part of a larger project that aims to develop a comprehensive computational lexicon of ArMWEs for NLP and language pedagogy LP (JOMAL project)

    Towards Comprehensive Computational Representations of Arabic Multiword Expressions

    Get PDF
    A successful computational treatment of multiword expressions (MWEs) in natural languages leads to a robust NLP system which considers the long-standing problem of language ambiguity caused primarily by this complex linguistic phenomenon. The first step in addressing this challenge is building an extensive reliable MWEs language resource LR with comprehensive computational representations across all linguistic levels. This forms the cornerstone in understanding the heterogeneous linguistic behaviour of MWEs in their various manifestations. This paper presents a detailed framework for computational representations of Arabic MWEs (ArMWEs) across all linguistic levels based on the state-of-the-art lexical mark-up framework (LMF) with the necessary modifications to suit the distinctive properties of Modern Standard Arabic (MSA). This work forms part of a larger project that aims to develop a comprehensive computational lexicon of ArMWEs for NLP and language pedagogy LP (JOMAL project)

    Compilation of an Arabic Children’s Corpus

    Get PDF
    Inspired by the Oxford Children's Corpus, we have developed a prototype corpus of Arabic texts written and/or selected for children. Our Arabic Children's Corpus of 2950 documents and nearly 2 million words has been collected manually from the web during a 3-month project. It is of high quality, and contains a range of different children's genres based on sources located, including classic tales from The Arabian Nights, and popular fictional characters such as Goha. We anticipate that the current and subsequent versions of our corpus will lead to interesting studies in text classification, language use, and ideology in children's texts

    Corpus approaches to issues in second language acquisition: three studies

    Get PDF
    This dissertation demonstrates how advancements in corpus approaches to linguistic inquiry can be used to improve the methodological rigour, reliability, and general usefulness of findings in various areas of Second Language Acquisition (SLA) research. Although these studies primarily focus on improvements in areas where corpus approaches are already commonplace, this dissertation also demonstrates how corpus methods can be usefully applied to new areas. Through the use of these methods, the presented studies highlight issues learners face when attempting to gain proficiency in second language (L2) English. Study 1 investigated the usefulness of transitional probability as a way of improving the extraction of formulaic sequences (e.g., on the other hand) from large scale corpora. Since current methods of identification often lead to lists of overlapping structures that lack psycholinguistic validity and pedagogical usefulness (Liu, 2012; Nekrasova, 2009; Simpson-Vlach & Ellis, 2010), this study evaluated the effectiveness of a new statistical measure in this area, transitional probability, as a way of improving the psycholinguistic status of corpus derived formulaic sequences. Using a sequence completion task, results revealed that corpus derived formulaic sequences with higher transitional probabilities were more accurately completed by first language (L1) and L2 English users, leading to the conclusion that these sequences are more likely to be stored as prefabricated units. Study 2 used a corpus approach to investigate the relationship between L1 background and the lexical choices made by L2 English writers. Looking specifically at L2 English writers of L1 Arabic, Chinese, and French backgrounds, a corpus of 150 argumentative essays written as part of an English for Academic Purposes program at a large English-medium university in North America was used to identify production tendencies in the use of linking adverbials by each L1 group. Results revealed important L1 differences for the use of specific linking adverbials and broader functional categories. Study 3 investigated lexical dimensions of L2 English speech associated with differences in perceived linguistic ability as judged by naĂŻve L1 English raters. Using a corpus of transcribed speech samples from 97 L2 English users across two tasks (194 speech samples), naĂŻve L1 English raters evaluated each sample for perceived comprehensibility and nativeness. Variables associated with factors related to dimensions of lexical density, sophistication, and diversity were targeted for potential correlations with L1 rater judgements of each construct. Results indicated important linguistic measures significantly correlated with each construct as well as task-based differences

    Learning collocations through interaction: The effects of the quality and quantity of encounters

    Get PDF
    This study examined how short-term and long-term retention of two types of collocations (verb-noun and adjective-noun) was affected by the learning context. The experimental research design of the study involved two major experiments. In Experiment 1 (EX1), 109 male Emirati college students were randomly assigned to an experimental group (task-based activities) or a control group (mainstream exercises). EX1 involved 20 verb-noun collocations and consisted of two sub-experiments. In experiment 1a both the control and experimental groups were exposed to 20 verb-noun collocations four times. To clarify the effects of the instructional context, a second experiment (EX1b) was conducted where participants encountered the same collocations four times for the experimental group and eight times for the control group. As for Experiment 2 (EX2), it involved 108 male Emirati college students and targeted 20 adjective-noun collocations, and similarly, in Experiment 2a, both the control and experimental groups encountered the adjective-noun collocations 4 times, whereas Experiment 2b offered the experimental and control groups four and eight collocation encounters, respectively. The treatment consisted of exposing participants in both experiments to the target collocations using two different teaching methods. The experimental groups used four task-based activities that presented collocations as whole units (Ellis’s, 2003 chunking principle) while the control groups used mainstream textbook exercises to learn these sequences, breaking them down into their two constituents (verb + noun and adjective + noun). The experiment was carried out over a two-hour period during students’ regular English classes.The results showed that the experimental group learners in both EX1 and EX2 who used task-based activities to learn the collocations, and were exposed to these sequences four times only as whole units, further outscored their control group peers in all collocation measurements. Statistical analysis of participants’ test responses also showed that the long-term receptive knowledge category of the target verb-noun and adjective-noun collocations in both experiments was higher than the productive knowledge for all experimental groups. This study fills a gap in the research about the importance of the quality of encounter vs. the quantity of encounter in collocation learning and identifies an instructional method that is optimal for learning. The overall results suggest that task-based activities were superior to mainstream exercises and that the quality of encounter appears to be more important than the number of encounter in collocation learning; four highly interactive tasks presenting collocations as whole units, with only four encounters, could be more effective to retain unknown collocations than mainstream exercises (e.g., matching and fill-in) that offered learners eight encounters to the collocations broken down into their constituents. The implications for teachers may be that task-based activities, exposing learners to collocations as whole units, should be part of their language instructional pedagogy if they want learners to retain collocations in their long-term memory. For material designers, a well-balanced course would be one that prioritises collocations as chinks through interactive task-based activities
    • …
    corecore