71 research outputs found

    Automatic identification and translation of multiword expressions

    Get PDF
    A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy.Multiword Expressions (MWEs) belong to a class of phraseological phenomena that is ubiquitous in the study of language. They are heterogeneous lexical items consisting of more than one word and feature lexical, syntactic, semantic and pragmatic idiosyncrasies. Scholarly research on MWEs benefits both natural language processing (NLP) applications and end users. This thesis involves designing new methodologies to identify and translate MWEs. In order to deal with MWE identification, we first develop datasets of annotated verb-noun MWEs in context. We then propose a method which employs word embeddings to disambiguate between literal and idiomatic usages of the verb-noun expressions. Existence of expression types with various idiomatic and literal distributions leads us to re-examine their modelling and evaluation. We propose a type-aware train and test splitting approach to prevent models from overfitting and avoid misleading evaluation results. Identification of MWEs in context can be modelled with sequence tagging methodologies. To this end, we devise a new neural network architecture, which is a combination of convolutional neural networks and long-short term memories with an optional conditional random field layer on top. We conduct extensive evaluations on several languages demonstrating a better performance compared to the state-of-the-art systems. Experiments show that the generalisation power of the model in predicting unseen MWEs is significantly better than previous systems. In order to find translations for verb-noun MWEs, we propose a bilingual distributional similarity approach derived from a word embedding model that supports arbitrary contexts. The technique is devised to extract translation equivalents from comparable corpora which are an alternative resource to costly parallel corpora. We finally conduct a series of experiments to investigate the effects of size and quality of comparable corpora on automatic extraction of translation equivalents

    A Bigger Fish to Fry:Scaling up the Automatic Understanding of Idiomatic Expressions

    Get PDF
    In this thesis, we are concerned with idiomatic expressions and how to handle them within NLP. Idiomatic expressions are a type of multiword phrase which have a meaning that is not a direct combination of the meaning of its parts, e.g. 'at a crossroads' and 'move the goalposts'.In Part I, we provide a general introduction to idiomatic expressions and an overview of observations regarding idioms based on corpus data. In addition, we discuss existing research on idioms from an NLP perspective, providing an overview of existing tasks, approaches, and datasets. In Part II, we focus on the building of a large idiom corpus, consisting of developing a system for the automatic extraction of potentially idiom expressions and building a large corpus of idiom using crowdsourced annotation. Finally, in Part III, we improve an existing unsupervised classifier and compare it to other existing classifiers. Given the relatively poor performance of this unsupervised classifier, we also develop a supervised deep neural network-based system and find that a model involving two separate modules looking at different information sources yields the best performance, surpassing previous state-of-the-art approaches.In conclusion, this work shows the feasibility of building a large corpus of sense-annotated potentially idiomatic expressions, and the benefits such a corpus provides for further research. It provides the possibility for quick testing of hypotheses about the distribution and usage of idioms, it enables the training of data-hungry machine learning methods for PIE disambiguation systems, and it permits fine-grained, reliable evaluation of such systems

    Multiword expressions at length and in depth

    Get PDF
    The annual workshop on multiword expressions takes place since 2001 in conjunction with major computational linguistics conferences and attracts the attention of an ever-growing community working on a variety of languages, linguistic phenomena and related computational processing issues. MWE 2017 took place in Valencia, Spain, and represented a vibrant panorama of the current research landscape on the computational treatment of multiword expressions, featuring many high-quality submissions. Furthermore, MWE 2017 included the first shared task on multilingual identification of verbal multiword expressions. The shared task, with extended communal work, has developed important multilingual resources and mobilised several research groups in computational linguistics worldwide. This book contains extended versions of selected papers from the workshop. Authors worked hard to include detailed explanations, broader and deeper analyses, and new exciting results, which were thoroughly reviewed by an internationally renowned committee. We hope that this distinctly joint effort will provide a meaningful and useful snapshot of the multilingual state of the art in multiword expressions modelling and processing, and will be a point point of reference for future work

    Current trends

    Get PDF
    Deep parsing is the fundamental process aiming at the representation of the syntactic structure of phrases and sentences. In the traditional methodology this process is based on lexicons and grammars representing roughly properties of words and interactions of words and structures in sentences. Several linguistic frameworks, such as Headdriven Phrase Structure Grammar (HPSG), Lexical Functional Grammar (LFG), Tree Adjoining Grammar (TAG), Combinatory Categorial Grammar (CCG), etc., offer different structures and combining operations for building grammar rules. These already contain mechanisms for expressing properties of Multiword Expressions (MWE), which, however, need improvement in how they account for idiosyncrasies of MWEs on the one hand and their similarities to regular structures on the other hand. This collaborative book constitutes a survey on various attempts at representing and parsing MWEs in the context of linguistic theories and applications

    Representation and parsing of multiword expressions

    Get PDF
    This book consists of contributions related to the definition, representation and parsing of MWEs. These reflect current trends in the representation and processing of MWEs. They cover various categories of MWEs such as verbal, adverbial and nominal MWEs, various linguistic frameworks (e.g. tree-based and unification-based grammars), various languages including English, French, Modern Greek, Hebrew, Norwegian), and various applications (namely MWE detection, parsing, automatic translation) using both symbolic and statistical approaches

    Quantitative determinants of prefabs: A corpus-based, experimental study of multiword units in the lexicon

    Get PDF
    In recent years many researchers have been rethinking the Words and Rules\u27 model of syntax (Pinker 1999), instead arguing that language processing relies on a large number of preassembled multiword units, or \u27prefabs\u27 (Bolinger 1976). A usage-based perspective predicts that linguistic units, including prefabs, arise via repeated use, and prefabs should thus be associated with the frequency with which words co-occur (Langacker 1987). Indeed, in several recent experiments, corpus analysis is found to be associated with behavioral measures for multiword sequences (Kapatsinski and Radicke 2009, Ellis and Simpson-Vlach 2009). This dissertation supplements such findings with two new psycholinguistic investigations of prefabs. Study 1 revisits a dictation experiment by Schmitt et al. (2004), in which participants are asked to listen to stretches of speech and repeat the input verbatim, after performing a distractor task intended to encourage reliance on prefabs. I describe the results of an updated experiment which demonstrates that participants are less likely to interrupt or partially alter high-frequency multiword sequences. Although the original study by Schmitt et al. (2004) reported null findings, the revised methodology suggests that frequency indeed plays a role in the creation of prefabs. Study 2 investigates the distribution of affix positioning errors (he go aheads) which give evidence that some multiword sequences (e.g., go ahead) are retrieved from memory as a unit. As part of this study, I describe a novel methodology which elicits the errors of interest in an experimental setting. Errors evincing holistic retrieval are induced more often among multiword sequences that are high in Mutual Dependency, a corpus measure that weighs a sequence\u27s frequency against the frequencies of its component words. Followup analyses indicate that sequence frequency is positively associated with affix errors, but only if component-word frequencies are included as variables in the model. In sum, the studies in this dissertation provide evidence that prefabricated, multiword units are associated with high frequency of a sequence, in addition to statistical measures that take component words\u27 frequency into account. These findings provide further support for a usage-based model of the lexicon, in which linguistic units are both gradient and changeable with experience

    An analysis of the pragmatic functions of idiomatic expressions in the Egyptian novel ‘Taxi’

    Get PDF
    The purpose of the study is to investigate the idiomatic expressions and their pragmatic functions in the conversations of the novel Taxi in the light of Speech Act Theory. The study adopts a qualitative linguistic analysis method of research. After analyzing the 58 episodes of the novel ’Taxi’, the study reveals 80 idiomatic expressions fulfilling 13 pragmatic functions: describing with six subcategories, complaining, stating, concluding, and swearing, thanking, condoling, sympathizing, deploring and excusing, agreeing and opposing and advising. These pragmatic functions have been classified based on four of Searle’s speech acts: (1) representatives, (2) expressives, (3) commissives and (4) directives. Hence, the study shows that idiomatic expressions fulfill a satisfying number of pragmatic functions which in turn operate in facilitating conversations among speakers as they are stored in their memory and easily retrieved in diverse contexts. In addition, the study shows the high usage of negative pragmatic functions such as complaining, deploring, describing negative issues in comparison with positive ones in the conversations of taxi drivers. It has been also observed that these positive and negative functions shed light on a multitude of cultural aspects in Egyptian society. The study suggests pedagogical implication: the result that the idiomatic expressions perform various pragmatic functions and fulfill cultural aspects provides a rationale for including them in Arabic foreign language classes as their main aim is to use the language appropriately and achieve cultural competence as well

    A Computational Lexicon and Representational Model for Arabic Multiword Expressions

    Get PDF
    The phenomenon of multiword expressions (MWEs) is increasingly recognised as a serious and challenging issue that has attracted the attention of researchers in various language-related disciplines. Research in these many areas has emphasised the primary role of MWEs in the process of analysing and understanding language, particularly in the computational treatment of natural languages. Ignoring MWE knowledge in any NLP system reduces the possibility of achieving high precision outputs. However, despite the enormous wealth of MWE research and language resources available for English and some other languages, research on Arabic MWEs (AMWEs) still faces multiple challenges, particularly in key computational tasks such as extraction, identification, evaluation, language resource building, and lexical representations. This research aims to remedy this deficiency by extending knowledge of AMWEs and making noteworthy contributions to the existing literature in three related research areas on the way towards building a computational lexicon of AMWEs. First, this study develops a general understanding of AMWEs by establishing a detailed conceptual framework that includes a description of an adopted AMWE concept and its distinctive properties at multiple linguistic levels. Second, in the use of AMWE extraction and discovery tasks, the study employs a hybrid approach that combines knowledge-based and data-driven computational methods for discovering multiple types of AMWEs. Third, this thesis presents a representative system for AMWEs which consists of multilayer encoding of extensive linguistic descriptions. This project also paves the way for further in-depth AMWE-aware studies in NLP and linguistics to gain new insights into this complicated phenomenon in standard Arabic. The implications of this research are related to the vital role of the AMWE lexicon, as a new lexical resource, in the improvement of various ANLP tasks and the potential opportunities this lexicon provides for linguists to analyse and explore AMWE phenomena

    Extended papers from the MWE 2017 workshop

    Get PDF
    The annual workshop on multiword expressions takes place since 2001 in conjunction with major computational linguistics conferences and attracts the attention of an ever-growing community working on a variety of languages, linguistic phenomena and related computational processing issues. MWE 2017 took place in Valencia, Spain, and represented a vibrant panorama of the current research landscape on the computational treatment of multiword expressions, featuring many high-quality submissions. Furthermore, MWE 2017 included the first shared task on multilingual identification of verbal multiword expressions. The shared task, with extended communal work, has developed important multilingual resources and mobilised several research groups in computational linguistics worldwide. This book contains extended versions of selected papers from the workshop. Authors worked hard to include detailed explanations, broader and deeper analyses, and new exciting results, which were thoroughly reviewed by an internationally renowned committee. We hope that this distinctly joint effort will provide a meaningful and useful snapshot of the multilingual state of the art in multiword expressions modelling and processing, and will be a point point of reference for future work
    • …
    corecore