227 research outputs found

    A corpus study of verbal multiword expressions in Brazilian Portuguese

    Get PDF
    Verbal multiword expressions (VMWEs) such as to make ends meet require special attention in NLP and linguistic research, and annotated corpora are valuable resources for studying them. Corpora annotated with VMWEs in several languages, including Brazilian Portuguese, were made freely available in the PARSEME shared task. The goal of this paper is to describe and analyze this corpus in terms of the characteristics of annotated VMWEs in Brazilian Portuguese. First, we summarize and exemplify the criteria used to annotate VMWEs. Then, we analyze their frequency, average length, discontinuities and variability. We further discuss challenging constructions and borderline cases. We believe that this analysis can improve the annotated corpus and its results can be used to develop systems for automatic VMWE identification

    Collocations in Portuguese: A corpus-based approach to lexical patterns

    Get PDF
    Collocations and, more generally, multiword expressions, have been extensively studied for the English language and a large set of resources are available in terms of linguistic description and tools for language learning. On the contrary, combinatorial resources for Portuguese are scarce, although specific types of collocations, such as light verb constructions, nominal compounds and proverbs, have been the topic of many studies. This chapter reviews different theoretical perspectives on multiword expressions and collocations in Portuguese and presents in more detail the results of the COMBINA-PT project, a corpus-based approach to the study of collocations.info:eu-repo/semantics/publishedVersio

    Automatic Detection of Proverbs and their Variants

    Get PDF
    This article presents the task of automatic detection of proverbs in Brazilian Portuguese, from the intersection of the regular syntactic structure of proverbs and their core elements. We created finite-state automata that enabled us to look for these word combinations in running texts. The rationale behind this method consists in the fact that although proverbs may have a normal sentence structure and often a very commonly used lexicon, their specific word-combinations may enables us to identify them and their variants irrespective of the syntactic or structural changes the proverb may undergo. The goal of this task is to gather the largest number of proverbs and their variants. The results showed precision 60.15%

    The Reference Corpus of Contemporary Portuguese and related resources

    Get PDF
    The extraordinary growth of computer applications, particularly over the last two decades, has enabled the easy compilation and exploration of large corpora and lexica. These linguistic resources play a fundamental role in the areas of theoretical linguistics and natural language engineering. Combining these two areas of knowledge can, in fact, result in the development of a large number of applications, such as new and straightforward descriptions of languages based on real data; contrastive studies between varieties of a particular language aiming at finding factors of unity and diversity; cross-linguistic contrastive studies; grammars; lexica and dictionaries; terminologies; assisted translation materials; language teaching materials; computer tools and applications for processing natural language. Having this principle in mind and following the tradition at the Centre of Linguistics of the University of Lisbon (CLUL)i of collecting and studying real language data, a large electronic corpus – the Corpus de Referência do Português Contemporâneo (Reference Corpus of Contemporary Portuguese, CRPC) – is being compiled at CLUL since 1988. The CRPC currently contains approximately 310 million words, searchable through a user-friendly interface, and it is envisaged as a monitor corpus (from which one can extract balanced subcorpora) that can serve as a sample of the Portuguese language (both in its written and spoken varieties). In the next sections, we will describe the CRPC and how it forms the basis for important resources developed at CLUL.info:eu-repo/semantics/publishedVersio

    One book, two language varieties

    Get PDF
    This paper presents a comparative study of alignment pairs, either contrasting expressions or stylistic variants of the same expression in the European (EP) and the Brazilian (BP) varieties of Portuguese. The alignments were collected semi-automatically using the CLUE-Aligner tool, which allows to record all pairs of paraphrastic units resulting from the alignment task in a database. The corpus used was a children’s literature book Os livros que devoraram o meu pai (The Books that Devoured My Father) by the Portuguese author Afonso Cruz and the Brazilian adaptation of this book. The main goal of the work presented here is to gather equivalent phrasal expressions and different syntactic constructions, which convey the same meaning in EP and BP, and contribute to the optimisation of editorial processes compulsory in the adaptation of texts, but which are suitable for any type of editorial process. This study provides a scientific basis for future work in the area of editing, proofreading and converting text to and from any variety of Portuguese from a computational point of view, namely to be used in a paraphrasing system with a variety adaptation functionality, even in the case of a literary text. We contemplate “challenging” cases, from a literary point of view, looking for alternatives that do not tamper with the imagery richness of the original version .info:eu-repo/semantics/acceptedVersio

    The lexico-phraseology of THE and A/AN in spoken English: a corpus-based study

    Get PDF
    The English articles (THE, A, AN) are normally described in terms of the grammar of the language. This is only natural, since they are extremely frequent, fit into certain well-defined syntactic slots, and usually help to communicate only very broad aspects of textual meaning. However, as John Sinclair has pointed out (1999, pp.160-161), the articles are also found as components of many lexico-phraseological units, and in such cases a normal grammatical description may not be of relevance. An example he gives is the presence of A in the phrase 'come to a head', where ‘A has little more status than that of a letter of the alphabet’ (p.161). Sinclair also makes the observation that, ‘I do not know of an estimate of the proportion of instances of A, for example, that are not a realisation of the choice of article but of the realisation of part of a multi-word expression.’ (p.161). The present paper addresses the questions raised by Sinclair, and does so with reference to both the definite and the indefinite article. It focuses, in particular, on the spoken language, and presents the results of analyses of random samples of the articles in the spoken component of the British National Corpus (hereafter BNC-spkn). According to the data in Leech et al (2001, p.144), THE is the most frequent word in BNC-spkn and A is the sixth most frequent (a rank position which remains unaltered when the frequencies of A and AN are combined). Using the BNCweb interface, and specifying that the relevant word forms should be ‘articles’, the total numbers of tokens are: an 19,049; a 200,004; the 409,060. Since the numbers are very high, the samples investigated also contained a reasonably large number of tokens (500). The relative samples corresponded to the following proportions of tokens in BNC-spkn: an 2.62%, a 0.25%, the 0.12%. The latter two are very low percentages, and for this reason, three separate samples of each were investigated, in order to see the extent to which the samples differed. Analysis of article usage was carried out in the first instance by reading right-sorted concordance lines. Whenever doubts arose, larger contexts were retrieved from the corpus. Various reference works were also consulted, including Berry (1993), Francis et al (1998), and various corpus-based dictionaries and grammars. The data presented includes: description of the various types of lexico-phraseological unit found; the proportions of the samples judged to involve the different lexico-phraseological phenomena identified; the problems encountered when deciding whether or not phraseology is an important factor in specific instances of article usage; and the number of tokens in each sample which were in some way irrelevant, for example because they involved speaker repetition of the article, or the non-completion of a noun phrase

    Multiword expressions at length and in depth

    Get PDF
    The annual workshop on multiword expressions takes place since 2001 in conjunction with major computational linguistics conferences and attracts the attention of an ever-growing community working on a variety of languages, linguistic phenomena and related computational processing issues. MWE 2017 took place in Valencia, Spain, and represented a vibrant panorama of the current research landscape on the computational treatment of multiword expressions, featuring many high-quality submissions. Furthermore, MWE 2017 included the first shared task on multilingual identification of verbal multiword expressions. The shared task, with extended communal work, has developed important multilingual resources and mobilised several research groups in computational linguistics worldwide. This book contains extended versions of selected papers from the workshop. Authors worked hard to include detailed explanations, broader and deeper analyses, and new exciting results, which were thoroughly reviewed by an internationally renowned committee. We hope that this distinctly joint effort will provide a meaningful and useful snapshot of the multilingual state of the art in multiword expressions modelling and processing, and will be a point point of reference for future work

    Un environnement générique et ouvert pour le traitement des expressions polylexicales

    Get PDF
    The treatment of multiword expressions (MWEs), like take off, bus stop and big deal, is a challenge for NLP applications. This kind of linguistic construction is not only arbitrary but also much more frequent than one would initially guess. This thesis investigates the behaviour of MWEs across different languages, domains and construction types, proposing and evaluating an integrated methodological framework for their acquisition. There have been many theoretical proposals to define, characterise and classify MWEs. We adopt generic definition stating that MWEs are word combinations which must be treated as a unit at some level of linguistic processing. They present a variable degree of institutionalisation, arbitrariness, heterogeneity and limited syntactic and semantic variability. There has been much research on automatic MWE acquisition in the recent decades, and the state of the art covers a large number of techniques and languages. Other tasks involving MWEs, namely disambiguation, interpretation, representation and applications, have received less emphasis in the field. The first main contribution of this thesis is the proposal of an original methodological framework for automatic MWE acquisition from monolingual corpora. This framework is generic, language independent, integrated and contains a freely available implementation, the mwetoolkit. It is composed of independent modules which may themselves use multiple techniques to solve a specific sub-task in MWE acquisition. The evaluation of MWE acquisition is modelled using four independent axes. We underline that the evaluation results depend on parameters of the acquisition context, e.g., nature and size of corpora, language and type of MWE, analysis depth, and existing resources. The second main contribution of this thesis is the application-oriented evaluation of our methodology proposal in two applications: computer-assisted lexicography and statistical machine translation. For the former, we evaluate the usefulness of automatic MWE acquisition with the mwetoolkit for creating three lexicons: Greek nominal expressions, Portuguese complex predicates and Portuguese sentiment expressions. For the latter, we test several integration strategies in order to improve the treatment given to English phrasal verbs when translated by a standard statistical MT system into Portuguese. Both applications can benefit from automatic MWE acquisition, as the expressions acquired automatically from corpora can both speed up and improve the quality of the results. The promising results of previous and ongoing experiments encourage further investigation about the optimal way to integrate MWE treatment into other applications. Thus, we conclude the thesis with an overview of the past, ongoing and future work
    corecore