19 research outputs found

    Un environnement générique et ouvert pour le traitement des expressions polylexicales

    Get PDF
    The treatment of multiword expressions (MWEs), like take off, bus stop and big deal, is a challenge for NLP applications. This kind of linguistic construction is not only arbitrary but also much more frequent than one would initially guess. This thesis investigates the behaviour of MWEs across different languages, domains and construction types, proposing and evaluating an integrated methodological framework for their acquisition. There have been many theoretical proposals to define, characterise and classify MWEs. We adopt generic definition stating that MWEs are word combinations which must be treated as a unit at some level of linguistic processing. They present a variable degree of institutionalisation, arbitrariness, heterogeneity and limited syntactic and semantic variability. There has been much research on automatic MWE acquisition in the recent decades, and the state of the art covers a large number of techniques and languages. Other tasks involving MWEs, namely disambiguation, interpretation, representation and applications, have received less emphasis in the field. The first main contribution of this thesis is the proposal of an original methodological framework for automatic MWE acquisition from monolingual corpora. This framework is generic, language independent, integrated and contains a freely available implementation, the mwetoolkit. It is composed of independent modules which may themselves use multiple techniques to solve a specific sub-task in MWE acquisition. The evaluation of MWE acquisition is modelled using four independent axes. We underline that the evaluation results depend on parameters of the acquisition context, e.g., nature and size of corpora, language and type of MWE, analysis depth, and existing resources. The second main contribution of this thesis is the application-oriented evaluation of our methodology proposal in two applications: computer-assisted lexicography and statistical machine translation. For the former, we evaluate the usefulness of automatic MWE acquisition with the mwetoolkit for creating three lexicons: Greek nominal expressions, Portuguese complex predicates and Portuguese sentiment expressions. For the latter, we test several integration strategies in order to improve the treatment given to English phrasal verbs when translated by a standard statistical MT system into Portuguese. Both applications can benefit from automatic MWE acquisition, as the expressions acquired automatically from corpora can both speed up and improve the quality of the results. The promising results of previous and ongoing experiments encourage further investigation about the optimal way to integrate MWE treatment into other applications. Thus, we conclude the thesis with an overview of the past, ongoing and future work

    Can humain association norm evaluate latent semantic analysis?

    Get PDF
    This paper presents the comparison of word association norm created by a psycholinguistic experiment to association lists generated by algorithms operating on text corpora. We compare lists generated by Church and Hanks algorithm and lists generated by LSA algorithm. An argument is presented on how those automatically generated lists reflect real semantic relations

    D7.4 Third evaluation report. Evaluation of PANACEA v3 and produced resources

    Get PDF
    D7.4 reports on the evaluation of the different components integrated in the PANACEA third cycle of development as well as the final validation of the platform itself. All validation and evaluation experiments follow the evaluation criteria already described in D7.1. The main goal of WP7 tasks was to test the (technical) functionalities and capabilities of the middleware that allows the integration of the various resource-creation components into an interoperable distributed environment (WP3) and to evaluate the quality of the components developed in WP5 and WP6. The content of this deliverable is thus complementary to D8.2 and D8.3 that tackle advantages and usability in industrial scenarios. It has to be noted that the PANACEA third cycle of development addressed many components that are still under research. The main goal for this evaluation cycle thus is to assess the methods experimented with and their potentials for becoming actual production tools to be exploited outside research labs. For most of the technologies, an attempt was made to re-interpret standard evaluation measures, usually in terms of accuracy, precision and recall, as measures related to a reduction of costs (time and human resources) in the current practices based on the manual production of resources. In order to do so, the different tools had to be tuned and adapted to maximize precision and for some tools the possibility to offer confidence measures that could allow a separation of the resources that still needed manual revision has been attempted. Furthermore, the extension to other languages in addition to English, also a PANACEA objective, has been evaluated. The main facts about the evaluation results are now summarized

    Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme

    Get PDF
    Computational Linguistics; Germanic Languages; Artificial Intelligence (incl. Robotics); Computing Methodologie

    Investigating Frequency and Type of Lexical Collocations in Applied Linguistics Journal Articles Written in English by Iranian and Norwegian Scholars

    Get PDF
    Master's thesis in Literacy StudiesIn today’s academic world, the research interest in corpus linguistics has shifted towards word co-occurrence rather than single words. Accordingly, a great body of literature has been devoted to investigations of recurrent word combinations in academic prose using frequency and dispersion parameters. This has resulted in analysis of corpus in different fields of study to collect comprehensive lists of academic collocations. Moreover, many contrastive studies have been conducted to compare the collocations used by native and non-native speakers of English. However, to the author’s knowledge, few studies have been conducted to compare the most frequent collocations in two corpora of research articles written by non-native speakers of English published in international journals in the field of applied linguistics. To fill this gap in the literature, the current study investigated the most frequent collocations used by Iranian and Norwegian scholars in a corpus of 17 articles published in the Journal of Pragmatics through a frequency-based approach. Nine out of 17 articles were written by Iranian scholars including 67,673 words and eight out of 17 articles were written by Norwegian scholars comprising of 64,682 words. The data of this study were collected using Collocation Extract software. The results of the study were presented in three phases. In the first phase, 15 most frequent lexical collocations in both corpora were identified which were classified under three types of lexical collocations. Based on what was obtained, Adj+N collocation type had the most proportion in the corpora while Adv+Adj type had the least proportion. In the second phase, the lexical collocations of the Iranian corpus were presented including a total of 818 collocations classified under five types. According to the results, Adj+N was the most frequent type while N+V was the least frequent one. Similar to the Iranian corpus, lexical collocations of the Norwegian corpus were identified. They were classified under four types including a total of 462, among which Adj+N was the most frequent type while Adv+Adj was the least frequent one. In the third phase, frequencies of lexical collocations were compared in the two corpora. According to the obtained results, the two corpora did not have any had significant difference in the use of all types of collocation except for Adj+N type of lexical collocations

    Acquiring translation equivalences of multiword expressions by normalized correlation frequencies

    No full text

    Acquiring Translation Equivalences of Multiword Expressions by Normalized Correlation Frequencies

    No full text
    [[sponsorship]]資訊科學研究所,資訊科技創新研究中心[[note]]已出版;有審查制度;具代表

    Weathered Words : Formulaic Language and Verbal Art

    Get PDF
    Formulaic phraseology presents the epitome of words worn and weathered by trial and the tests of time. Scholarship on weathered words is exceptionally diverse and interdisciplinary. This volume focuses on verbal art, which makes Oral-Formulaic Theory (OFT) a major point of reference. Yet weathered words are but a part of OFT, and OFT is only a part of scholarship on weathered words. Each of the eighteen essays gathered here brings particular aspects of formulaic language into focus. No volume on such a diverse topic can be all-encompassing, but the essays highlight aspects of the phenomenon that may be eclipsed elsewhere: they diverge not only in style, but sometimes even in how they choose to define “formula.” As such, they offer overlapping frames that complement one another both in their convergences and their contrasts. While they view formulaicity from multifarious angles, they unite in a Picasso of perspectives on which the reader can reflect and draw insight.Peer reviewe

    Visualización del lenguaje a través de corpus

    Get PDF
    Digital version of the print publication, published in A Coruña: Universidade da Coruña, Servizo de Publicacións, 2010 (ISBN 978-84-9749-401-4)This book contains the papers presented at the Second International Conference on Corpus Linguistics held at the University of A Coruña in 2010 and organised by the MuStE group. The essays deal with different aspects of corpus linguistics both as a methodology and as a branch of Linguistics.[Abstract] The collection of essays we are presenting here are just a mere sample of the interest the topics relating to Corpus Linguistics have arisen everywhere. Such different topics as those related to Computational Linguistics found in “Obtaining computational resources for languages with scarce resources from closely related computationally-developed languages. The Galician and Portuguese case“ or “Corpus-Based Modelling of Lexical Changes in Manic Depression Disorders: The Case of Edgar Allan Poe” belonging to the field of Corpus and Literary Studies can be found in the ensuing pages. Almost all research areas can nowadays be investigated using Corpus Linguistics as a valid methodology. This is reason why Language Windowing through Corpora gathers papers dealing with discourse, variation and change, grammatical studies, lexicology and lexicography, corpus design, contrastive analyses, language acquisition and learning or translation. This work’s title aims at reflecting not only the great variety of topics gathered in it but also the worldwide interest awaken by the computer processing of language. In fact, researchers from many different institutions all over the world have contributed to this book. Apart from the twenty-two Spanish Universities, people from other Higher Education Institutions have authored and co-authored the essays contained here, namely, Russia, Venezuela, Brazil, UK, Finland, Portugal, Poland, Austria, Mexico, Thailand, Iran, the Netherlands, Belgium, Japan, Turkey, China, Italy, Malaysia, Romania and Sweden. All these essays have been alphabetically arranged, by the names of their authors, in two parts. Part 1 contains the papers by authors from A to K and Part 2, those of authors from L to Z
    corecore