639 research outputs found

    MultiMWE: building a multi-lingual multi-word expression (MWE) parallel corpora

    Get PDF
    Multi-word expressions (MWEs) are a hot topic in research in natural language processing (NLP), including topics such as MWE detection, MWE decomposition, and research investigating the exploitation of MWEs in other NLP fields such as Machine Translation. However, the availability of bilingual or multi-lingual MWE corpora is very limited. The only bilingual MWE corpora that we are aware of is from the PARSEME (PARSing and Multi-word Expressions) EU project. This is a small collection of only 871 pairs of English-German MWEs. In this paper, we present multi-lingual and bilingual MWE corpora that we have extracted from root parallel corpora. Our collections are 3,159,226 and 143,042 bilingual MWE pairs for German-English and Chinese-English respectively after filtering. We examine the quality of these extracted bilingual MWEs in MT experiments. Our initial experiments applying MWEs in MT show improved translation performances on MWE terms in qualitative analysis and better general evaluation scores in quantitative analysis, on both German-English and Chinese-English language pairs. We follow a standard experimental pipeline to create our MultiMWE corpora which are available online. Researchers can use this free corpus for their own models or use them in a knowledge base as model features

    The lexico-phraseology of THE and A/AN in spoken English: a corpus-based study

    Get PDF
    The English articles (THE, A, AN) are normally described in terms of the grammar of the language. This is only natural, since they are extremely frequent, fit into certain well-defined syntactic slots, and usually help to communicate only very broad aspects of textual meaning. However, as John Sinclair has pointed out (1999, pp.160-161), the articles are also found as components of many lexico-phraseological units, and in such cases a normal grammatical description may not be of relevance. An example he gives is the presence of A in the phrase 'come to a head', where ‘A has little more status than that of a letter of the alphabet’ (p.161). Sinclair also makes the observation that, ‘I do not know of an estimate of the proportion of instances of A, for example, that are not a realisation of the choice of article but of the realisation of part of a multi-word expression.’ (p.161). The present paper addresses the questions raised by Sinclair, and does so with reference to both the definite and the indefinite article. It focuses, in particular, on the spoken language, and presents the results of analyses of random samples of the articles in the spoken component of the British National Corpus (hereafter BNC-spkn). According to the data in Leech et al (2001, p.144), THE is the most frequent word in BNC-spkn and A is the sixth most frequent (a rank position which remains unaltered when the frequencies of A and AN are combined). Using the BNCweb interface, and specifying that the relevant word forms should be ‘articles’, the total numbers of tokens are: an 19,049; a 200,004; the 409,060. Since the numbers are very high, the samples investigated also contained a reasonably large number of tokens (500). The relative samples corresponded to the following proportions of tokens in BNC-spkn: an 2.62%, a 0.25%, the 0.12%. The latter two are very low percentages, and for this reason, three separate samples of each were investigated, in order to see the extent to which the samples differed. Analysis of article usage was carried out in the first instance by reading right-sorted concordance lines. Whenever doubts arose, larger contexts were retrieved from the corpus. Various reference works were also consulted, including Berry (1993), Francis et al (1998), and various corpus-based dictionaries and grammars. The data presented includes: description of the various types of lexico-phraseological unit found; the proportions of the samples judged to involve the different lexico-phraseological phenomena identified; the problems encountered when deciding whether or not phraseology is an important factor in specific instances of article usage; and the number of tokens in each sample which were in some way irrelevant, for example because they involved speaker repetition of the article, or the non-completion of a noun phrase

    A Computational Lexicon and Representational Model for Arabic Multiword Expressions

    Get PDF
    The phenomenon of multiword expressions (MWEs) is increasingly recognised as a serious and challenging issue that has attracted the attention of researchers in various language-related disciplines. Research in these many areas has emphasised the primary role of MWEs in the process of analysing and understanding language, particularly in the computational treatment of natural languages. Ignoring MWE knowledge in any NLP system reduces the possibility of achieving high precision outputs. However, despite the enormous wealth of MWE research and language resources available for English and some other languages, research on Arabic MWEs (AMWEs) still faces multiple challenges, particularly in key computational tasks such as extraction, identification, evaluation, language resource building, and lexical representations. This research aims to remedy this deficiency by extending knowledge of AMWEs and making noteworthy contributions to the existing literature in three related research areas on the way towards building a computational lexicon of AMWEs. First, this study develops a general understanding of AMWEs by establishing a detailed conceptual framework that includes a description of an adopted AMWE concept and its distinctive properties at multiple linguistic levels. Second, in the use of AMWE extraction and discovery tasks, the study employs a hybrid approach that combines knowledge-based and data-driven computational methods for discovering multiple types of AMWEs. Third, this thesis presents a representative system for AMWEs which consists of multilayer encoding of extensive linguistic descriptions. This project also paves the way for further in-depth AMWE-aware studies in NLP and linguistics to gain new insights into this complicated phenomenon in standard Arabic. The implications of this research are related to the vital role of the AMWE lexicon, as a new lexical resource, in the improvement of various ANLP tasks and the potential opportunities this lexicon provides for linguists to analyse and explore AMWE phenomena

    Multiword expressions at length and in depth

    Get PDF
    The annual workshop on multiword expressions takes place since 2001 in conjunction with major computational linguistics conferences and attracts the attention of an ever-growing community working on a variety of languages, linguistic phenomena and related computational processing issues. MWE 2017 took place in Valencia, Spain, and represented a vibrant panorama of the current research landscape on the computational treatment of multiword expressions, featuring many high-quality submissions. Furthermore, MWE 2017 included the first shared task on multilingual identification of verbal multiword expressions. The shared task, with extended communal work, has developed important multilingual resources and mobilised several research groups in computational linguistics worldwide. This book contains extended versions of selected papers from the workshop. Authors worked hard to include detailed explanations, broader and deeper analyses, and new exciting results, which were thoroughly reviewed by an internationally renowned committee. We hope that this distinctly joint effort will provide a meaningful and useful snapshot of the multilingual state of the art in multiword expressions modelling and processing, and will be a point point of reference for future work

    Automatic Acquisition of Knowledge About Multiword Predicates

    Get PDF
    PACLIC 19 / Taipei, taiwan / December 1-3, 200

    Representation and parsing of multiword expressions

    Get PDF
    This book consists of contributions related to the definition, representation and parsing of MWEs. These reflect current trends in the representation and processing of MWEs. They cover various categories of MWEs such as verbal, adverbial and nominal MWEs, various linguistic frameworks (e.g. tree-based and unification-based grammars), various languages including English, French, Modern Greek, Hebrew, Norwegian), and various applications (namely MWE detection, parsing, automatic translation) using both symbolic and statistical approaches

    Current trends

    Get PDF
    Deep parsing is the fundamental process aiming at the representation of the syntactic structure of phrases and sentences. In the traditional methodology this process is based on lexicons and grammars representing roughly properties of words and interactions of words and structures in sentences. Several linguistic frameworks, such as Headdriven Phrase Structure Grammar (HPSG), Lexical Functional Grammar (LFG), Tree Adjoining Grammar (TAG), Combinatory Categorial Grammar (CCG), etc., offer different structures and combining operations for building grammar rules. These already contain mechanisms for expressing properties of Multiword Expressions (MWE), which, however, need improvement in how they account for idiosyncrasies of MWEs on the one hand and their similarities to regular structures on the other hand. This collaborative book constitutes a survey on various attempts at representing and parsing MWEs in the context of linguistic theories and applications

    Exploring figurative language recognition: a comprehensive study of human and machine approaches

    Full text link
    Treballs Finals de Grau de Llengües i Literatures Modernes. Facultat de Filologia. Universitat de Barcelona. Curs: 2022-2023. Tutora: Elisabet Comelles Pujadas[eng] Figurative language (FL) plays a significant role in human communication. Understanding and interpreting FL is essential for humans to fully grasp the intended message, appreciate cultural nuances, and engage in effective interaction. For machines, comprehending FL presents a challenge due to its complexity and ambiguity. Enabling machines to understand FL has become increasingly important in sentiment analysis, text classification, and social media monitoring, for instance, benefits from accurately recognizing figurative expressions to capture subtle emotions and extract meaningful insights. Machine translation also requires the ability to accurately convey FL to ensure translations reflect the intended meaning and cultural nuances. Therefore, developing computational methods to enable machines to understand and interpret FL is crucial. By bridging the gap between human and machine understanding of FL, we can enhance communication, improve language-based applications, and unlock new possibilities in human-machine interactions. Keywords: figurative language, NLP, human-machine communication.[cat] El Llenguatge Figuratiu (LF) té un paper important en la comunicació humana. Per entendre completament els missatges, apreciar els matisos culturals i la interacció efectiva, és necessària la capacitat d'interpretar el LF. No obstant això, els ordinadors tenen dificultats per entendre la LF a causa de la seva complexitat i ambigüitat. És crític que els ordinadors siguin capaços de reconèixer el LF, especialment en àrees com l'anàlisi de sentiments, la classificació de textos i la supervisió de les xarxes socials. El reconeixement precís del LF permet capturar emocions i extreure idees semàntiques. La traducció automàtica també requereix una representació precisa del LF per reflectir el significat previst i els matisos culturals. Per tant, és rellevant desenvolupar mètodes computacionals que ajudin els ordinadors a comprendre i interpretar el LF. Fer un pont entre la comprensió humana i màquina del LF pot millorar la comunicació, desenvolupar aplicacions de llenguatge i obrir noves possibilitats per a la interacció home-màquina. Paraules clau: llenguatge figuratiu, processament del llenguatge natural, interacció home-màquina
    corecore