Search CORE

221 research outputs found

The Reference Corpus of Contemporary Portuguese and related resources

Author: Antunes Sandra
Mendes Amália
Nascimento Maria Fernanda Bacelar do
Pereira Luísa
Publication venue: Bloomsbury Publishing
Publication date: 01/01/2014
Field of study

The extraordinary growth of computer applications, particularly over the last two decades, has enabled the easy compilation and exploration of large corpora and lexica. These linguistic resources play a fundamental role in the areas of theoretical linguistics and natural language engineering. Combining these two areas of knowledge can, in fact, result in the development of a large number of applications, such as new and straightforward descriptions of languages based on real data; contrastive studies between varieties of a particular language aiming at finding factors of unity and diversity; cross-linguistic contrastive studies; grammars; lexica and dictionaries; terminologies; assisted translation materials; language teaching materials; computer tools and applications for processing natural language. Having this principle in mind and following the tradition at the Centre of Linguistics of the University of Lisbon (CLUL)i of collecting and studying real language data, a large electronic corpus – the Corpus de Referência do Português Contemporâneo (Reference Corpus of Contemporary Portuguese, CRPC) – is being compiled at CLUL since 1988. The CRPC currently contains approximately 310 million words, searchable through a user-friendly interface, and it is envisaged as a monitor corpus (from which one can extract balanced subcorpora) that can serve as a sample of the Portuguese language (both in its written and spoken varieties). In the next sections, we will describe the CRPC and how it forms the basis for important resources developed at CLUL.info:eu-repo/semantics/publishedVersio

Universidade de Lisboa: Repositório.UL

CEPLEXicon - A Lexicon of Child European Portuguese

Author: Cardoso Aida
Freitas Maria João
Santos Ana Lúcia
Publication venue: European Language Resources Association
Publication date: 01/01/2016
Field of study

CEPLEXicon (version 1.1) is a child lexicon resulting from the automatic tagging of two child corpora: the corpus Santos (Santos, 2006; Santos et al. 2014) and the corpus Child – Adult Interaction (Freitas et al. 2012), which integrates information from the corpus Freitas (Freitas, 1997). This lexicon includes spontaneous speech produced by seven children (1;02.00 to 3;11.12) during approximately 86h of child-adult interaction. The automatic tagging comprised the lemmatization and morphosyntactic classification of the speech produced by the seven children included in the two child corpora; the lexicon contains information pertaining to lemmas and syntactic categories as well as absolute number of occurrences and frequencies in three age intervals: < 2 years; ≥ 2 years and < 3 years; ≥ 3 years. The information included in this lexicon and the format in which it is presented enables research in different areas and allows researchers to obtain measures of lexical growth. CEPLEXicon is available through the ELRA catalogue.info:eu-repo/semantics/publishedVersio

Universidade de Lisboa: Repositório.UL

Providing on-line access to Portuguese language resources: corpora and lexicons

Author: Mendes Amália
Nascimento Maria Fernanda Bacelar do
Pereira Luísa
Publication venue: European Language Resources Association
Publication date: 01/01/2004
Field of study

Several Language Resources (LRs) for Portuguese, developed at the Center of Linguistics of the Lisbon University (CLUL), are available on-line at CLUL’s webpage: www.clul.ul.pt/english/sectores/projecto_rld.html. These LRs have been extracted from or developed based on the Reference Corpus of Contemporary Portuguese(CRPC1), a monitor corpus containing, at the present, more than 350 million words, taken by sampling from several types of written text (literary, newspaper, technical, didactic, juridical, parlamentary, etc.) and spoken text (informal and formal), pertaining to national and regional varieties of Portuguese (including European, Brazilian, African and Asian Portuguese).The LRs available for on-line queries include: a) several subcorpora (written and spoken, tagged and untagged) compiled and extracted from CRPC for specific CLUL’s projects and now available for on-line queries; b) a published sample of “Português Fundamental”, a spoken CRPC subcorpus, available for texts download; c) a frequency lexicon extracted from a CRPC subcorpus available for both on-line queries and download. Other RLs available for Portugueseare also referred: C-ORAL-ROM - Integrated Reference Corpora for Spoken Romance Languages, a CD-ROM edition of a spoken corpus with text-to-sound alignment; the LE-PAROLE corpus; the LE-PAROLE Lexicon and the SIMPLE Lexicon.info:eu-repo/semantics/publishedVersio

Universidade de Lisboa: Repositório.UL

Basic research in phonology, resources and applications–the case of frequency

Author
Publication venue: 'Universidade Estadual de Campinas'
Publication date
Field of study

Crossref

On the use of comparable corpora of African varieties of Portuguese for linguistic description and teaching/learning applications

Author: Estrela Antónia
Mendes Amália
Nascimento Maria Fernanda Bacelar do
Pereira Luísa
Publication venue: European Language Resources Association
Publication date: 01/01/2008
Field of study

This presentation focuses on the use of five comparable corpora of African varieties of Portuguese (AVP), namely Angola, Cape Verde, Guinea-Bissau, Mozambique and Sao Tome and Principe, for multiple contrastive linguistic analyses and for the production of teaching and learning applications. Five contrastive lexicons have been corpus-extracted and further annotated with POS and lemma information and have been crucial to establish for each variety a core and peripheral vocabulary. Studies on AVP-specific morphological processes and on variation in verb complementation will also be discussed. These are first steps towards an integrated description of the five varieties and towards the elaboration of teaching and learning materials to be used by teachers of students from those five African countries with Portuguese as official language.info:eu-repo/semantics/publishedVersio

Universidade de Lisboa: Repositório.UL

A Lexical Database of Portuguese Multiword Expressions

Author: Antunes Sandra
Nascimento Maria Fernanda Bacelar do
Casteleiro João Miguel
Mendes Amália
Pereira Luísa
Sá Tiago
Publication venue: Sringer-Verlag
Publication date: 01/01/2006
Field of study

This presentation focuses on an ongoing project which aims at the creation of a large lexical database of Portuguese multiword (MW) units, automatically extracted through the analysis of a balanced 50 million word corpus, statistically interpreted with lexical association measures and validated by hand. This database covers different types of MW units, like named entities, and lexical associations ranging from sets of favoured co-occurring forms with high corpus frequency and low cohesion to strongly lexicalized expressions with no, or minimum, variation. This new resource has a two-fold objective: to be an important research tool which supports the development of collocation typologies and their integration in a larger theory of MW units; to be of major help in developing and evaluating language processing tools able of dealing with MW expressions.info:eu-repo/semantics/publishedVersio

Universidade de Lisboa: Repositório.UL

OpenEdition

Lexical analysis of pre and post revolution discourse in Portugal

Author: Bacelar do Nascimento Maria Fernanda
Généreux Michel
Mendes Amália
Santos Pereira Luísa Alice
Publication venue: European Language Resources Association
Publication date: 01/01/2010
Field of study

This paper presents a lexical comparison of pre (1954-74) and post (1974-94) revolution parliamentary discourse in four comparable sub-corpora extracted from the Reference Corpus of Contemporary Portuguese (CRPC). After introducing the CRPC, including annotation and meta-data, we focus on a subset of the corpus dealing with parliamentary discourses, more particularly a time frame of forty years divided into four comparable sub-corpora, each covering a ten-year period, two pre revolution and two post revolution. We extract lexical density information as well as salient terms pertaining to each period to make a comparative evaluation of the periods. Our results show how a linguistic analysis essentially based on the use of simple n-gram statistics can produce key insights into the use, change and evolution of the Portuguese language around a critical time period in its history.info:eu-repo/semantics/publishedVersio

Universidade de Lisboa: Repositório.UL

A Lexical Database of Portuguese Multiword Expressions

Author: G.A. Kjellmer
I. Sag
J. Bahns
J. Clear
J. Firth
J. Sinclair
K.W. Church
M.F. Bacelar do Nascimento
R. Mackin
T. Dunning
Publication venue: Sringer-Verlag
Publication date: 01/01/2006
Field of study

Crossref

Universidade de Lisboa: Repositório.UL

Procura-PALavras (P-PAL): a new measure of word frequency for contemporary European Portuguese

Author: Almeida J. J.
Comesaña Montserrat
Costa Ana
França Patrícia das Dores Marques da Silva Cunha
Iriarte Sanromán Álvaro
Machado João F.
Simões Alberto
Soares Ana Paula
Publication venue: 'FapUNIFESP (SciELO)'
Publication date: 01/01/2014
Field of study

Neste trabalho apresentamos as estratégias e os procedimentos adoptados na constituição de uma nova medida de frequência lexical do Português Europeu contemporâneo, o Procura-PALavras (P- -PAL). Baseado num corpus de mais de 227 milhões de palavras, o P-PAL é uma aplicação web que oferece, por defeito, valores de frequência lexical para todas as suas entradas lexicais (lemas e formas) e que permite a computação de uma grande diversidade de outras medidas objectivas (lexicais e sublexicais) e subjectivas. Descrevemos ainda o processo de defi nição das suas entradas lexicais e da extracção dos respectivos valores de frequência. O elevado número de índices e de entradas lexicais tornam o P-PAL numa aplicação web avançada e indispensável à promoção e internacionalização da investigação em Portugal. O P-PAL encontra-se disponível em http://p-pal.di.uminho.pt/toolsIn this paper we present the strategies and procedures undertaken in the development of a new measure of lexical frequency of the contemporary European Portuguese – Procura-PALavras (P-PAL). Based on a corpus of over 227 million words, P-PAL offers the default frequency per million words (lemmas and wordforms), and the computation of several other objective (lexical and sublexical) and subjective word metrics. We also describe lexical entry integration and word frequency extraction. The high number of indices and lexical entries makes P-PAL an advanced and indispensable web application for the promotion and internationalization of Portuguese research. P-PAL is available at http://p-pal.di.uminho.pt/toolsFundação para a Ciência e a Tecnologia (FCT

Universidade do Minho: RepositoriUM

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Directory of Open Access Journals

Morphological Tagging of a Spoken Portuguese Corpus Using Available Resources

Author: Amaro Raquel
Mendes Amália
Nascimento Maria Fernanda Bacelar do
Publication venue: Colibri
Publication date: 01/01/2004
Field of study

This paper discusses the experience of reusing annotation tools developed for written corpora to tag a spoken corpus with POS information. Eric Brill’s tagger, initially trained over a written and tagged corpus of 250.000 words, is being used to tag the Portuguese C-ORAL-ROM spoken corpus, of 300.000 words. First, we address issues related with the tagset definition as well as the tagger performance over the written corpus. We discuss important options concerning the spoken corpus transcription, with direct impact on the tagging task, as well as the additional tags required. Transcription options allow in some cases for automatic tag identification and replacement, through a post-tagger process. Other cases, like the annotation of discourse markers, are more complex and require manual revision (and eventual listening). Since the final annotation will not only include the POS tag but also the wordform lemma, the paper also addresses issues related to the lemmatisation task. The positive results obtained show that the process of tagging and lemmatising a spoken Portuguese corpus through the reuse of already available resources may constitute an example of how to minimize the costs of such a task, without compromising the results. Finally, we discuss some possible developments to improve the tagger’s performance.info:eu-repo/semantics/publishedVersio

Universidade de Lisboa: Repositório.UL