202 research outputs found

    Multiword expressions at length and in depth

    Get PDF
    The annual workshop on multiword expressions takes place since 2001 in conjunction with major computational linguistics conferences and attracts the attention of an ever-growing community working on a variety of languages, linguistic phenomena and related computational processing issues. MWE 2017 took place in Valencia, Spain, and represented a vibrant panorama of the current research landscape on the computational treatment of multiword expressions, featuring many high-quality submissions. Furthermore, MWE 2017 included the first shared task on multilingual identification of verbal multiword expressions. The shared task, with extended communal work, has developed important multilingual resources and mobilised several research groups in computational linguistics worldwide. This book contains extended versions of selected papers from the workshop. Authors worked hard to include detailed explanations, broader and deeper analyses, and new exciting results, which were thoroughly reviewed by an internationally renowned committee. We hope that this distinctly joint effort will provide a meaningful and useful snapshot of the multilingual state of the art in multiword expressions modelling and processing, and will be a point point of reference for future work

    Learning multiword expressions from corpora and dictionaries

    Get PDF
    [Abstract] The purpose of the present thesis is to examine Spanish as a foreign language (SFL) learners’ needs when it comes to enhancing their collocation competence and use, with a view to designing an online collocation learning tool aimed at learners of Spanish. Accordingly, the research presented here corresponds to the following three aims. Firstly, SFL learners’ collocation use is explored through a learner corpus study carried out using material from the CEDEL2 corpus. SFL learners’ collocation use is compared to that of native speakers, while learner collocation errors are also examined. Secondly, the thesis examines the design and functionalities of existing learning tools that can support collocation learning, such as collocation dictionaries and corpus-based tools. More specifically, it describes a usability experiment focusing on the interface of the Diccionario de colocaciones del español, as well as a study testing SFL learners’ ability to autonomously correct collocation errors with the help of concordance data obtained from corpus. Thirdly, taking into account the findings of these studies, the design of an online collocation learning tool aimed at SFL learners is described.[Resumen] El propósito de la presente tesis es examinar las necesidades de los aprendices de español como lengua extranjera (ELE) en lo que respecta el desarrollo de su competencia y uso colocacional con el objetivo de diseñar una nueva herramienta didáctica dirigida a aprendices de español. Por consiguiente, la investigación que presentamos corresponde a los siguientes tres objetivos principales. En primer lugar, exploramos el uso colocacional de aprendices de ELE mediante un estudio de corpus de aprendices que se ha llevado a cabo utilizando datos del corpus CEDEL2. Comparamos el uso de colocaciones de aprendices al de hablantes nativos del español, y, al mismo tiempo, examinamos los errores colocacionales de aprendices. En segundo lugar, la tesis examina el diseño y las funcionalidades de herramientas didácticas existentes que pueden ser aprovechados en el aprendizaje de colocaciones como son los diccionarios de colocaciones y herramientas basadas en datos de corpus. Más específicamente, presentamos un experimento de usabilidad del Diccionario de colocaciones del español, así como un estudio que examina la destreza de aprendices de ELE en corregir errores colocacionales autónomamente con la ayuda de concordancias obtenidas de corpus. En tercer lugar, teniendo en cuenta los resultados de estos estudios, describimos el diseño de una herramienta en línea centrada en colocaciones y destinada a aprendices de ELE.[Resumo] O propósito da presente tese é examinar as necesidades dos aprendices de español como lingua estranxeira (ELE) no que respecta ao desenvolvemento da súa competencia e uso colocacional, co obxectivo de deseñar unha nova ferramenta didáctica dirixida a aprendices de español. Por conseguinte, a investigación que presentamos corresponde aos seguintes tres obxectivos principais. En primeiro lugar, exploramos o uso colocacional de aprendices de ELE mediante un estudo de corpus de aprendices que se levou a cabo utilizando datos do corpus CEDEL2. Comparamos o uso de colocacións de aprendices ao de falantes nativos de español e, asemade, analizamos os erros colocacionais de aprendices. En segundo lugar, a tese examina o deseño e as funcionalidades de ferramentas didácticas existentes que poden ser aproveitados para a aprendizaxe de colocacións, como son os dicionarios de colocacións e ferramentas baseadas en datos de corpus. Máis especificamente, presentamos un experimento de usabilidade do Diccionario de colocaciones del español, así como un estudo que examina a destreza de aprendices de ELE na corrección de erros colocacionais autonomamente coa axuda de concordancias obtidas de corpus. En terceiro lugar, tendo en conta os resultados destes estudos, describimos o deseño dunha ferramenta en liña centrada en colocacións e destinada a aprendices de ELE

    Representation and Processing of Composition, Variation and Approximation in Language Resources and Tools

    Get PDF
    In my habilitation dissertation, meant to validate my capacity of and maturity for directingresearch activities, I present a panorama of several topics in computational linguistics, linguisticsand computer science.Over the past decade, I was notably concerned with the phenomena of compositionalityand variability of linguistic objects. I illustrate the advantages of a compositional approachto the language in the domain of emotion detection and I explain how some linguistic objects,most prominently multi-word expressions, defy the compositionality principles. I demonstratethat the complex properties of MWEs, notably variability, are partially regular and partiallyidiosyncratic. This fact places the MWEs on the frontiers between different levels of linguisticprocessing, such as lexicon and syntax.I show the highly heterogeneous nature of MWEs by citing their two existing taxonomies.After an extensive state-of-the art study of MWE description and processing, I summarizeMultiflex, a formalism and a tool for lexical high-quality morphosyntactic description of MWUs.It uses a graph-based approach in which the inflection of a MWU is expressed in function ofthe morphology of its components, and of morphosyntactic transformation patterns. Due tounification the inflection paradigms are represented compactly. Orthographic, inflectional andsyntactic variants are treated within the same framework. The proposal is multilingual: it hasbeen tested on six European languages of three different origins (Germanic, Romance and Slavic),I believe that many others can also be successfully covered. Multiflex proves interoperable. Itadapts to different morphological language models, token boundary definitions, and underlyingmodules for the morphology of single words. It has been applied to the creation and enrichmentof linguistic resources, as well as to morphosyntactic analysis and generation. It can be integratedinto other NLP applications requiring the conflation of different surface realizations of the sameconcept.Another chapter of my activity concerns named entities, most of which are particular types ofMWEs. Their rich semantic load turned them into a hot topic in the NLP community, which isdocumented in my state-of-the art survey. I present the main assumptions, processes and resultsissued from large annotation tasks at two levels (for named entities and for coreference), parts ofthe National Corpus of Polish construction. I have also contributed to the development of bothrule-based and probabilistic named entity recognition tools, and to an automated enrichment ofProlexbase, a large multilingual database of proper names, from open sources.With respect to multi-word expressions, named entities and coreference mentions, I pay aspecial attention to nested structures. This problem sheds new light on the treatment of complexlinguistic units in NLP. When these units start being modeled as trees (or, more generally, asacyclic graphs) rather than as flat sequences of tokens, long-distance dependencies, discontinu-ities, overlapping and other frequent linguistic properties become easier to represent. This callsfor more complex processing methods which control larger contexts than what usually happensin sequential processing. Thus, both named entity recognition and coreference resolution comesvery close to parsing, and named entities or mentions with their nested structures are analogous3to multi-word expressions with embedded complements.My parallel activity concerns finite-state methods for natural language and XML processing.My main contribution in this field, co-authored with 2 colleagues, is the first full-fledged methodfor tree-to-language correction, and more precisely for correcting XML documents with respectto a DTD. We have also produced interesting results in incremental finite-state algorithmics,particularly relevant to data evolution contexts such as dynamic vocabularies or user updates.Multilingualism is the leitmotif of my research. I have applied my methods to several naturallanguages, most importantly to Polish, Serbian, English and French. I have been among theinitiators of a highly multilingual European scientific network dedicated to parsing and multi-word expressions. I have used multilingual linguistic data in experimental studies. I believethat it is particularly worthwhile to design NLP solutions taking declension-rich (e.g. Slavic)languages into account, since this leads to more universal solutions, at least as far as nominalconstructions (MWUs, NEs, mentions) are concerned. For instance, when Multiflex had beendeveloped with Polish in mind it could be applied as such to French, English, Serbian and Greek.Also, a French-Serbian collaboration led to substantial modifications in morphological modelingin Prolexbase in its early development stages. This allowed for its later application to Polishwith very few adaptations of the existing model. Other researchers also stress the advantages ofNLP studies on highly inflected languages since their morphology encodes much more syntacticinformation than is the case e.g. in English.In this dissertation I am also supposed to demonstrate my ability of playing an active rolein shaping the scientific landscape, on a local, national and international scale. I describemy: (i) various scientific collaborations and supervision activities, (ii) roles in over 10 regional,national and international projects, (iii) responsibilities in collective bodies such as program andorganizing committees of conferences and workshops, PhD juries, and the National UniversityCouncil (CNU), (iv) activity as an evaluator and a reviewer of European collaborative projects.The issues addressed in this dissertation open interesting scientific perspectives, in whicha special impact is put on links among various domains and communities. These perspectivesinclude: (i) integrating fine-grained language data into the linked open data, (ii) deep parsingof multi-word expressions, (iii) modeling multi-word expression identification in a treebank as atree-to-language correction problem, and (iv) a taxonomy and an experimental benchmark fortree-to-language correction approaches

    Current trends

    Get PDF
    Deep parsing is the fundamental process aiming at the representation of the syntactic structure of phrases and sentences. In the traditional methodology this process is based on lexicons and grammars representing roughly properties of words and interactions of words and structures in sentences. Several linguistic frameworks, such as Headdriven Phrase Structure Grammar (HPSG), Lexical Functional Grammar (LFG), Tree Adjoining Grammar (TAG), Combinatory Categorial Grammar (CCG), etc., offer different structures and combining operations for building grammar rules. These already contain mechanisms for expressing properties of Multiword Expressions (MWE), which, however, need improvement in how they account for idiosyncrasies of MWEs on the one hand and their similarities to regular structures on the other hand. This collaborative book constitutes a survey on various attempts at representing and parsing MWEs in the context of linguistic theories and applications

    Representation and parsing of multiword expressions

    Get PDF
    This book consists of contributions related to the definition, representation and parsing of MWEs. These reflect current trends in the representation and processing of MWEs. They cover various categories of MWEs such as verbal, adverbial and nominal MWEs, various linguistic frameworks (e.g. tree-based and unification-based grammars), various languages including English, French, Modern Greek, Hebrew, Norwegian), and various applications (namely MWE detection, parsing, automatic translation) using both symbolic and statistical approaches

    A Computational Lexicon and Representational Model for Arabic Multiword Expressions

    Get PDF
    The phenomenon of multiword expressions (MWEs) is increasingly recognised as a serious and challenging issue that has attracted the attention of researchers in various language-related disciplines. Research in these many areas has emphasised the primary role of MWEs in the process of analysing and understanding language, particularly in the computational treatment of natural languages. Ignoring MWE knowledge in any NLP system reduces the possibility of achieving high precision outputs. However, despite the enormous wealth of MWE research and language resources available for English and some other languages, research on Arabic MWEs (AMWEs) still faces multiple challenges, particularly in key computational tasks such as extraction, identification, evaluation, language resource building, and lexical representations. This research aims to remedy this deficiency by extending knowledge of AMWEs and making noteworthy contributions to the existing literature in three related research areas on the way towards building a computational lexicon of AMWEs. First, this study develops a general understanding of AMWEs by establishing a detailed conceptual framework that includes a description of an adopted AMWE concept and its distinctive properties at multiple linguistic levels. Second, in the use of AMWE extraction and discovery tasks, the study employs a hybrid approach that combines knowledge-based and data-driven computational methods for discovering multiple types of AMWEs. Third, this thesis presents a representative system for AMWEs which consists of multilayer encoding of extensive linguistic descriptions. This project also paves the way for further in-depth AMWE-aware studies in NLP and linguistics to gain new insights into this complicated phenomenon in standard Arabic. The implications of this research are related to the vital role of the AMWE lexicon, as a new lexical resource, in the improvement of various ANLP tasks and the potential opportunities this lexicon provides for linguists to analyse and explore AMWE phenomena

    The TXM Portal Software giving access to Old French Manuscripts Online

    Get PDF
    Texte intégral en ligne : http://www.lrec-conf.org/proceedings/lrec2012/workshops/13.ProceedingsCultHeritage.pdfInternational audiencehttp://www.lrec-conf.org/proceedings/lrec2012/workshops/13.ProceedingsCultHeritage.pdf This paper presents the new TXM software platform giving online access to Old French Text Manuscripts images and tagged transcriptions for concordancing and text mining. This platform is able to import medieval sources encoded in XML according to the TEI Guidelines for linking manuscript images to transcriptions, encode several diplomatic levels of transcription including abbreviations and word level corrections. It includes a sophisticated tokenizer able to deal with TEI tags at different levels of linguistic hierarchy. Words are tagged on the fly during the import process using IMS TreeTagger tool with a specific language model. Synoptic editions displaying side by side manuscript images and text transcriptions are automatically produced during the import process. Texts are organized in a corpus with their own metadata (title, author, date, genre, etc.) and several word properties indexes are produced for the CQP search engine to allow efficient word patterns search to build different type of frequency lists or concordances. For syntactically annotated texts, special indexes are produced for the Tiger Search engine to allow efficient syntactic concordances building. The platform has also been tested on classical Latin, ancient Greek, Old Slavonic and Old Hieroglyphic Egyptian corpora (including various types of encoding and annotations)

    Baltic Journal of English Language, Literature and Culture, Vol.10

    Get PDF
    Kontorslokaler nyttjas generellt cirka 2500 av årets 8760 timmar. Ett vanligt problem med kontorslokaler är det termiska klimatet, antingen är det för varmt, för kallt, eller så drar det. Höga temperaturer, över ca 26°C, bidrar till trötthet, nedsatt koncentration och gör att luften känns mindre fräsch. Stora variationen av lasten mellan dag och nattetid kan också resultera i att lokalerna överventileras under nattetid och underventileras under dagtid. Syftet med examensarbetet var att undersöka och jämföra Ecoclimes komforttaks lösning med andra olika värme och kylsystem i kontorslokaler. Att undersöka vilka eventuella fördelar Ecoclimes komforttak har gällande komfort, kyla, ventilation och ur energisynpunkt. Simuleringsprogrammet IDA ICE har använts för att simulera komforten och rumstemperaturer för ett kontor och ett konferensrum i en byggnad placerad i centrala Umeå. Resultaten från simuleringar indikerar att Ecoclimes komforttak, sänker den operativa temperaturen och höjer komforten med en mindre andel missnöjda i sitt rum jämfört med andra system trots samma rumstemperatur. För att bedömma andelen missnöjda i ett rum har komfortindexet PMV(Predicted mean vote) och PPD(Predicted percentage dissatisfied) använts. Den höga passiva effekten bidrar också till mindre energianvändning av ventilationsfläktar ifall ett VAV-system med rumstempertaurreglering används. Vidare har en känslighetsanalys genomförts på komforttaken där det undersöks hur kyleffekten påverkar kyltider, temperatur och komfort. Känslighetsanalysen visar att en ökning eller minskning av kyleffekten med 10% påverkar resultaten mest under en mycket varm dag jämfört med en normalvarm. Skillnaden i komfort var dock liten, endast 0,2 procentenheter från grundfallet

    Extended papers from the MWE 2017 workshop

    Get PDF
    The annual workshop on multiword expressions takes place since 2001 in conjunction with major computational linguistics conferences and attracts the attention of an ever-growing community working on a variety of languages, linguistic phenomena and related computational processing issues. MWE 2017 took place in Valencia, Spain, and represented a vibrant panorama of the current research landscape on the computational treatment of multiword expressions, featuring many high-quality submissions. Furthermore, MWE 2017 included the first shared task on multilingual identification of verbal multiword expressions. The shared task, with extended communal work, has developed important multilingual resources and mobilised several research groups in computational linguistics worldwide. This book contains extended versions of selected papers from the workshop. Authors worked hard to include detailed explanations, broader and deeper analyses, and new exciting results, which were thoroughly reviewed by an internationally renowned committee. We hope that this distinctly joint effort will provide a meaningful and useful snapshot of the multilingual state of the art in multiword expressions modelling and processing, and will be a point point of reference for future work
    corecore