221 research outputs found

    The Reference Corpus of Contemporary Portuguese and related resources

    Get PDF
    The extraordinary growth of computer applications, particularly over the last two decades, has enabled the easy compilation and exploration of large corpora and lexica. These linguistic resources play a fundamental role in the areas of theoretical linguistics and natural language engineering. Combining these two areas of knowledge can, in fact, result in the development of a large number of applications, such as new and straightforward descriptions of languages based on real data; contrastive studies between varieties of a particular language aiming at finding factors of unity and diversity; cross-linguistic contrastive studies; grammars; lexica and dictionaries; terminologies; assisted translation materials; language teaching materials; computer tools and applications for processing natural language. Having this principle in mind and following the tradition at the Centre of Linguistics of the University of Lisbon (CLUL)i of collecting and studying real language data, a large electronic corpus – the Corpus de Referência do Português Contemporâneo (Reference Corpus of Contemporary Portuguese, CRPC) – is being compiled at CLUL since 1988. The CRPC currently contains approximately 310 million words, searchable through a user-friendly interface, and it is envisaged as a monitor corpus (from which one can extract balanced subcorpora) that can serve as a sample of the Portuguese language (both in its written and spoken varieties). In the next sections, we will describe the CRPC and how it forms the basis for important resources developed at CLUL.info:eu-repo/semantics/publishedVersio

    CEPLEXicon - A Lexicon of Child European Portuguese

    Get PDF
    CEPLEXicon (version 1.1) is a child lexicon resulting from the automatic tagging of two child corpora: the corpus Santos (Santos, 2006; Santos et al. 2014) and the corpus Child – Adult Interaction (Freitas et al. 2012), which integrates information from the corpus Freitas (Freitas, 1997). This lexicon includes spontaneous speech produced by seven children (1;02.00 to 3;11.12) during approximately 86h of child-adult interaction. The automatic tagging comprised the lemmatization and morphosyntactic classification of the speech produced by the seven children included in the two child corpora; the lexicon contains information pertaining to lemmas and syntactic categories as well as absolute number of occurrences and frequencies in three age intervals: < 2 years; ≥ 2 years and < 3 years; ≥ 3 years. The information included in this lexicon and the format in which it is presented enables research in different areas and allows researchers to obtain measures of lexical growth. CEPLEXicon is available through the ELRA catalogue.info:eu-repo/semantics/publishedVersio

    Providing on-line access to Portuguese language resources: corpora and lexicons

    Get PDF
    Several Language Resources (LRs) for Portuguese, developed at the Center of Linguistics of the Lisbon University (CLUL), are available on-line at CLUL’s webpage: www.clul.ul.pt/english/sectores/projecto_rld.html. These LRs have been extracted from or developed based on the Reference Corpus of Contemporary Portuguese(CRPC1), a monitor corpus containing, at the present, more than 350 million words, taken by sampling from several types of written text (literary, newspaper, technical, didactic, juridical, parlamentary, etc.) and spoken text (informal and formal), pertaining to national and regional varieties of Portuguese (including European, Brazilian, African and Asian Portuguese).The LRs available for on-line queries include: a) several subcorpora (written and spoken, tagged and untagged) compiled and extracted from CRPC for specific CLUL’s projects and now available for on-line queries; b) a published sample of “Português Fundamental”, a spoken CRPC subcorpus, available for texts download; c) a frequency lexicon extracted from a CRPC subcorpus available for both on-line queries and download. Other RLs available for Portugueseare also referred: C-ORAL-ROM - Integrated Reference Corpora for Spoken Romance Languages, a CD-ROM edition of a spoken corpus with text-to-sound alignment; the LE-PAROLE corpus; the LE-PAROLE Lexicon and the SIMPLE Lexicon.info:eu-repo/semantics/publishedVersio

    Basic research in phonology, resources and applications–the case of frequency

    Get PDF

    On the use of comparable corpora of African varieties of Portuguese for linguistic description and teaching/learning applications

    Get PDF
    This presentation focuses on the use of five comparable corpora of African varieties of Portuguese (AVP), namely Angola, Cape Verde, Guinea-Bissau, Mozambique and Sao Tome and Principe, for multiple contrastive linguistic analyses and for the production of teaching and learning applications. Five contrastive lexicons have been corpus-extracted and further annotated with POS and lemma information and have been crucial to establish for each variety a core and peripheral vocabulary. Studies on AVP-specific morphological processes and on variation in verb complementation will also be discussed. These are first steps towards an integrated description of the five varieties and towards the elaboration of teaching and learning materials to be used by teachers of students from those five African countries with Portuguese as official language.info:eu-repo/semantics/publishedVersio

    A Lexical Database of Portuguese Multiword Expressions

    Get PDF
    This presentation focuses on an ongoing project which aims at the creation of a large lexical database of Portuguese multiword (MW) units, automatically extracted through the analysis of a balanced 50 million word corpus, statistically interpreted with lexical association measures and validated by hand. This database covers different types of MW units, like named entities, and lexical associations ranging from sets of favoured co-occurring forms with high corpus frequency and low cohesion to strongly lexicalized expressions with no, or minimum, variation. This new resource has a two-fold objective: to be an important research tool which supports the development of collocation typologies and their integration in a larger theory of MW units; to be of major help in developing and evaluating language processing tools able of dealing with MW expressions.info:eu-repo/semantics/publishedVersio

    Lexical analysis of pre and post revolution discourse in Portugal

    Get PDF
    This paper presents a lexical comparison of pre (1954-74) and post (1974-94) revolution parliamentary discourse in four comparable sub-corpora extracted from the Reference Corpus of Contemporary Portuguese (CRPC). After introducing the CRPC, including annotation and meta-data, we focus on a subset of the corpus dealing with parliamentary discourses, more particularly a time frame of forty years divided into four comparable sub-corpora, each covering a ten-year period, two pre revolution and two post revolution. We extract lexical density information as well as salient terms pertaining to each period to make a comparative evaluation of the periods. Our results show how a linguistic analysis essentially based on the use of simple n-gram statistics can produce key insights into the use, change and evolution of the Portuguese language around a critical time period in its history.info:eu-repo/semantics/publishedVersio

    A Lexical Database of Portuguese Multiword Expressions

    Get PDF
    This presentation focuses on an ongoing project which aims at the creation of a large lexical database of Portuguese multiword (MW) units, automatically extracted through the analysis of a balanced 50 million word corpus, statistically interpreted with lexical association measures and validated by hand. This database covers different types of MW units, like named entities, and lexical associations ranging from sets of favoured co-occurring forms with high corpus frequency and low cohesion to strongly lexicalized expressions with no, or minimum, variation. This new resource has a two-fold objective: to be an important research tool which supports the development of collocation typologies and their integration in a larger theory of MW units; to be of major help in developing and evaluating language processing tools able of dealing with MW expressions.info:eu-repo/semantics/publishedVersio

    Procura-PALavras (P-PAL): a new measure of word frequency for contemporary European Portuguese

    Get PDF
    Neste trabalho apresentamos as estratégias e os procedimentos adoptados na constituição de uma nova medida de frequência lexical do Português Europeu contemporâneo, o Procura-PALavras (P- -PAL). Baseado num corpus de mais de 227 milhões de palavras, o P-PAL é uma aplicação web que oferece, por defeito, valores de frequência lexical para todas as suas entradas lexicais (lemas e formas) e que permite a computação de uma grande diversidade de outras medidas objectivas (lexicais e sublexicais) e subjectivas. Descrevemos ainda o processo de defi nição das suas entradas lexicais e da extracção dos respectivos valores de frequência. O elevado número de índices e de entradas lexicais tornam o P-PAL numa aplicação web avançada e indispensável à promoção e internacionalização da investigação em Portugal. O P-PAL encontra-se disponível em http://p-pal.di.uminho.pt/toolsIn this paper we present the strategies and procedures undertaken in the development of a new measure of lexical frequency of the contemporary European Portuguese – Procura-PALavras (P-PAL). Based on a corpus of over 227 million words, P-PAL offers the default frequency per million words (lemmas and wordforms), and the computation of several other objective (lexical and sublexical) and subjective word metrics. We also describe lexical entry integration and word frequency extraction. The high number of indices and lexical entries makes P-PAL an advanced and indispensable web application for the promotion and internationalization of Portuguese research. P-PAL is available at http://p-pal.di.uminho.pt/toolsFundação para a Ciência e a Tecnologia (FCT

    Morphological Tagging of a Spoken Portuguese Corpus Using Available Resources

    Get PDF
    This paper discusses the experience of reusing annotation tools developed for written corpora to tag a spoken corpus with POS information. Eric Brill’s tagger, initially trained over a written and tagged corpus of 250.000 words, is being used to tag the Portuguese C-ORAL-ROM spoken corpus, of 300.000 words. First, we address issues related with the tagset definition as well as the tagger performance over the written corpus. We discuss important options concerning the spoken corpus transcription, with direct impact on the tagging task, as well as the additional tags required. Transcription options allow in some cases for automatic tag identification and replacement, through a post-tagger process. Other cases, like the annotation of discourse markers, are more complex and require manual revision (and eventual listening). Since the final annotation will not only include the POS tag but also the wordform lemma, the paper also addresses issues related to the lemmatisation task. The positive results obtained show that the process of tagging and lemmatising a spoken Portuguese corpus through the reuse of already available resources may constitute an example of how to minimize the costs of such a task, without compromising the results. Finally, we discuss some possible developments to improve the tagger’s performance.info:eu-repo/semantics/publishedVersio