22 research outputs found

    A Matrix-Based Heuristic Algorithm for Extracting Multiword Expressions from a Corpus

    Get PDF
    This paper describes an algorithm for automatically extracting multiword expressions (MWEs) from a corpus. The algorithm is node-based, ie extracts MWEs that contain the item specified by the user, using a fixed window-size around the node. The main idea is to detect the frequency anomalies that occur at the starting and ending points of an ngram that constitutes a MWE. This is achieved by locally comparing matrices of observed frequencies to matrices of expected frequencies, and determining, for each individual input, one or more sub-sequences that have the highest probability of being a MWE. Top-performing sub-sequences are then combined in a score-aggregation and ranking stage, thus producing a single list of score-ranked MWE candidates, without having to indiscriminately generate all possible sub-sequences of the input strings. The knowledge-poor and computationally efficient algorithm attempts to solve certain recurring problems in MWE extraction, such as the inability to deal with MWEs of arbitrary length, the repetitive counting of nested ngrams, and excessive sensitivity to frequency. Evaluation results show that the best-performing version generates top-50 precision values between 0.71 and 0.88 on Turkish and English data, and performs better than the baseline method even at n= 1000

    Multiword expressions at length and in depth

    Get PDF
    The annual workshop on multiword expressions takes place since 2001 in conjunction with major computational linguistics conferences and attracts the attention of an ever-growing community working on a variety of languages, linguistic phenomena and related computational processing issues. MWE 2017 took place in Valencia, Spain, and represented a vibrant panorama of the current research landscape on the computational treatment of multiword expressions, featuring many high-quality submissions. Furthermore, MWE 2017 included the first shared task on multilingual identification of verbal multiword expressions. The shared task, with extended communal work, has developed important multilingual resources and mobilised several research groups in computational linguistics worldwide. This book contains extended versions of selected papers from the workshop. Authors worked hard to include detailed explanations, broader and deeper analyses, and new exciting results, which were thoroughly reviewed by an internationally renowned committee. We hope that this distinctly joint effort will provide a meaningful and useful snapshot of the multilingual state of the art in multiword expressions modelling and processing, and will be a point point of reference for future work

    A metáfora na adquisición de expresións idiomáticas. Un análise empírico nas variantes do inglés nativo e inglés como lingua estranxeira

    Get PDF
    Traballo Fin de Grao en Lingua e Literatura Inglesas. Curso 2018-2019The teaching and learning of idioms have recently received attention in Cognitive Linguistics literature, most notably in Cognitive Semantics (cf. Lakoff & Johnson, 1980; Gibbs, 1993, 1994; Kövecses & Szabó, 1999, 2001; Boers, 2011; Ruiz de Mendoza & Pérez-Hernández, 2011). Broadly, cognitive accounts argue that the meaning of most idioms is systematically motivated by underlying conceptual mechanisms such as metaphor and metonymy. The goal of this BA thesis is to explore idioms from a metaphoric perspective and to show how they are acquired and used across L1 and L2 varieties of English, highlighting the importance of metaphoric competence (MC) in the processes of teaching and learning. Particular attention will be paid to metaphorical idioms, comparing traditional and cognitive semantic approaches and tackling such issues as the development of metaphor interpretation and production skills in L1, the relationship between metaphor interpretation and production skills in L1 and L2, as well as the MC contribution to the understanding and interpretation of idioms and the problems involved in L1-L2 transfer. In addition, an empirical analysis will be provided examining parallelisms and divergences in the use of metaphorical idioms across L1 and L2 varieties of English. Ultimately, this study is intended to shed light into the active metaphorical thinking process that takes place in L1 to facilitate both the learning and use of metaphor-based idioms in L

    Extended papers from the MWE 2017 workshop

    Get PDF
    The annual workshop on multiword expressions takes place since 2001 in conjunction with major computational linguistics conferences and attracts the attention of an ever-growing community working on a variety of languages, linguistic phenomena and related computational processing issues. MWE 2017 took place in Valencia, Spain, and represented a vibrant panorama of the current research landscape on the computational treatment of multiword expressions, featuring many high-quality submissions. Furthermore, MWE 2017 included the first shared task on multilingual identification of verbal multiword expressions. The shared task, with extended communal work, has developed important multilingual resources and mobilised several research groups in computational linguistics worldwide. This book contains extended versions of selected papers from the workshop. Authors worked hard to include detailed explanations, broader and deeper analyses, and new exciting results, which were thoroughly reviewed by an internationally renowned committee. We hope that this distinctly joint effort will provide a meaningful and useful snapshot of the multilingual state of the art in multiword expressions modelling and processing, and will be a point point of reference for future work

    The automatic processing of multiword expressions in Irish

    Get PDF
    It is well-documented that Multiword Expressions (MWEs) pose a unique challenge to a variety of NLP tasks such as machine translation, parsing, information retrieval, and more. For low-resource languages such as Irish, these challenges can be exacerbated by the scarcity of data, and a lack of research in this topic. In order to improve handling of MWEs in various NLP tasks for Irish, this thesis will address both the lack of resources specifically targeting MWEs in Irish, and examine how these resources can be applied to said NLP tasks. We report on the creation and analysis of a number of lexical resources as part of this PhD research. Ilfhocail, a lexicon of Irish MWEs, is created through extract- ing MWEs from other lexical resources such as dictionaries. A corpus annotated with verbal MWEs in Irish is created for the inclusion of Irish in the PARSEME Shared Task 1.2. Additionally, MWEs were tagged in a bilingual EN-GA corpus for inclusion in experiments in machine translation. For the purposes of annotation, a categorisation scheme for nine categories of MWEs in Irish is created, based on combining linguistic analysis on these types of constructions and cross-lingual frameworks for defining MWEs. A case study in applying MWEs to NLP tasks is undertaken, with the exploration of incorporating MWE information while training Neural Machine Translation systems. Finally, the topic of automatic identification of Irish MWEs is explored, documenting the training of a system capable of automatically identifying Irish MWEs from a variety of categories, and the challenges associated with developing such a system. This research contributes towards a greater understanding of Irish MWEs and their applications in NLP, and provides a foundation for future work in exploring other methods for the automatic discovery and identification of Irish MWEs, and further developing the MWE resources described above

    Explaining Russian-German code-mixing

    Get PDF
    The study of grammatical variation in language mixing has been at the core of research into bilingual language practices. Although various motivations have been proposed in the literature to account for possible mixing patterns, some of them are either controversial, or remain untested. Little is still known about whether and how frequency of use of linguistic elements can contribute to the patterning of bilingual talk. This book is the first to systematically explore the factor usage frequency in a corpus of bilingual speech. The two aims are (i) to describe and analyze the variation in mixing patterns in the speech of Russia German adolescents and young adults in Germany, and (ii) to propose and test usage-based explanations of variation in mixing patterns in three morphosyntactic contexts: the adjective-modified noun phrase, the prepositional phrase, and the plural marking of German noun insertions in bilingual sentences. In these contexts, German noun insertions combine with either Russian or German words and grammatical markers, thus yielding mixed bilingual and German monolingual constituents in otherwise Russian sentences, the latter also labelled as embedded-language islands. The results suggest that the frequency with which words are used together mediates the distribution of mixing patterns in each of the examined contexts. The differing impacts of co-occurrence frequency are attributed to the distributional and semantic specifics of the analyzed morphosyntactic configurations. Lexical frequency has been found to be another important determinant in this variation. Other factors include recency, or lexical priming, in discourse in the case of prepositional phrases, and phonological and structural similarities and differences in the inflectional systems of the contact languages in the case of plural marking

    English phrasal verbs: usage, knowledge, acquisition

    Get PDF
    Formulaic language constitutes an essential part of English vocabulary and is necessary for performing a wide range of communicative functions, but knowledge and acquisition of formulaic sequences is typically found to be lacking and problematic for L2 learners. Whilst much research has been carried out on formulaic sequences such as idioms and collocations, comparatively little has been done on phrasal verbs which are nonetheless commonly perceived as one of the most challenging aspects of English vocabulary. This thesis attempts to contribute to filling this gap by exploring the usage, knowledge and acquisition of phrasal verbs by native and non-native speakers of English. Study 1 explores the semantic frequencies of the 150 most frequently used phrasal verbs using the Corpus of Contemporary American English (COCA). Results show that, whilst the vast majority of these phrasal verbs are polysemous, only two meaning senses on average are enough to cover three-quarters of the occurrences of each of them. The most frequent meaning senses of all 150 phrasal verbs are listed in the PHrasal VErb pedagogical List (PHaVE List), in frequency ranking order along with frequency percentages. The list thus offers teachers and learners the possibility of prioritising these most frequent, and thus most important, meaning senses, thereby allowing for a more systematic approach to tackling phrasal verbs. Study 2 explores L2 learners’ knowledge of a sample of phrasal verbs and meaning senses on the PHaVE List at a form-recall level of mastery, and the effect of a number of factors on this knowledge. Results show that only about 40 % of meaning senses were known, with a 20 % chance that all the various meaning senses attached to a given phrasal verb would be known. A mixed-effect modelling analysis reveals a significant effect of two factors on scores: item frequency and learner engagement in leisure activities in the L2 such as reading and social networking. This is consistent with previous research showing the robust effect of frequency for L2 knowledge of individual words and formulaic sequences, and the benefits of reading for L2 language acquisition. Study 3 investigates L2 learners’ acquisition of novel phrasal verbs through three intentional, word-focused learning activities: rote memorisation, textbook exercises, and guessing from context. Knowledge of the items was measured both immediately and one week after the teaching treatment at meaning-recall and meaning-recognition levels of mastery. Results show encouraging learning gains, similar to those found by previous research for individual words and idioms, with higher L2 proficiency and general vocabulary knowledge leading to significantly higher scores. A Friedman test reveals no significant difference in learning gains between the three activities. Taken together, these studies provide empirical evidence for the gap in L2 learners’ knowledge of phrasal verbs, but suggest that a restricted number of phrasal verbs and meaning senses can go a long way and be effectively learned using the same explicit activities commonly used for learning single words. Overall, they offer useful insights for learning and teaching English phrasal verbs in a more systematic and efficient manner

    K + K = 120 : Papers dedicated to László Kálmán and András Kornai on the occasion of their 60th birthdays

    Get PDF