639 research outputs found
The Reference Corpus of Contemporary Portuguese and related resources
The extraordinary growth of computer applications, particularly over the last two decades, has enabled the easy compilation and exploration of large corpora and lexica. These linguistic resources play a fundamental role in the areas of theoretical linguistics and natural language engineering. Combining these two areas of knowledge can, in fact, result in the development of a large number of applications, such as new and straightforward descriptions of languages based on real data; contrastive studies between varieties of a particular language aiming at finding factors of unity and diversity; cross-linguistic contrastive studies; grammars; lexica and dictionaries; terminologies; assisted translation materials; language teaching materials; computer tools and applications for processing natural language. Having this principle in mind and following the tradition at the Centre of Linguistics of the University of Lisbon (CLUL)i of collecting and studying real language data, a large electronic corpus – the Corpus de Referência do Português Contemporâneo (Reference Corpus of Contemporary Portuguese, CRPC) – is being compiled at CLUL since 1988. The CRPC currently contains approximately 310 million words, searchable through a user-friendly interface, and it is envisaged as a monitor corpus (from which one can extract balanced subcorpora) that can serve as a sample of the Portuguese language (both in its written and spoken varieties). In the next sections, we will describe the CRPC and how it forms the basis for important resources developed at CLUL.info:eu-repo/semantics/publishedVersio
Morphological Tagging of a Spoken Portuguese Corpus Using Available Resources
This paper discusses the experience of reusing annotation tools developed for written corpora to tag a spoken corpus with POS information. Eric Brill’s tagger, initially trained over a written and tagged corpus of 250.000 words, is being used to tag the Portuguese C-ORAL-ROM spoken corpus, of 300.000 words. First, we address issues related with the tagset definition as well as the tagger performance over the written corpus. We discuss important options concerning the spoken corpus transcription, with direct impact on the tagging task, as well as the additional tags required. Transcription options allow in some cases for automatic tag identification and replacement, through a post-tagger process. Other cases, like the annotation of discourse markers, are more complex and require manual revision (and eventual listening). Since the final annotation will not only include the POS tag but also the wordform lemma, the paper also addresses issues related to the lemmatisation task. The positive results obtained show that the process of tagging and lemmatising a spoken Portuguese corpus through the reuse of already available resources may constitute an example of how to minimize the costs of such a task, without compromising the results. Finally, we discuss some possible developments to improve the tagger’s performance.info:eu-repo/semantics/publishedVersio
D3.8 Lexical-semantic analytics for NLP
UIDB/03213/2020
UIDP/03213/2020The present document illustrates the work carried out in task 3.3 (work package 3) of ELEXIS project focused on lexical-semantic analytics for Natural Language Processing (NLP). This task aims at computing analytics for lexical-semantic information such as words, senses and domains in the available resources, investigating their role in NLP applications. Specifically, this task concentrates on three research directions, namely i) sense clustering, in which grouping senses based on their semantic similarity improves the performance of NLP tasks such as Word Sense Disambiguation (WSD), ii) domain labeling of text, in which the lexicographic resources made available by the ELEXIS project for research purposes allow better performances to be achieved, and finally iii) analysing the diachronic distribution of senses, for which a software package is made available.publishersversionpublishe
Minority language Twitter: part-of-speech tagging and analysis of Irish Tweets
Noisy user-generated text poses problems for natural language processing.
In this paper, we show that this statement also holds true for the Irish
language. Irish is regarded as a low-resourced language, with limited
annotated corpora available to NLP researchers and linguists to fully
analyse the linguistic patterns in language use in social media. We
contribute to recent advances in this area of research by reporting on the
development of part-of speech annotation scheme and annotated corpus for
Irish language tweets. We also report on state-of-the-art tagging results of
training and testing three existing POStaggers on our new dataset
Establishing a New State-of-the-Art for French Named Entity Recognition
The French TreeBank developed at the University Paris 7 is the main source of
morphosyntactic and syntactic annotations for French. However, it does not
include explicit information related to named entities, which are among the
most useful information for several natural language processing tasks and
applications. Moreover, no large-scale French corpus with named entity
annotations contain referential information, which complement the type and the
span of each mention with an indication of the entity it refers to. We have
manually annotated the French TreeBank with such information, after an
automatic pre-annotation step. We sketch the underlying annotation guidelines
and we provide a few figures about the resulting annotations
A 38 million words Dutch text corpus and its users
The use of text corpora has increased considerably in the past few years, not only in the field of lexicography but also in computational linguistics and language technology. Consequently, corpus data and expertise developed by lexicographical institutions have gained a broader scope of application. In the European context this has led to a revised view of corpus design. In line with these developments, the Institute for Dutch Lexicology (INL) has since 1994 been providing external access to steadily improving corpora via Internet. In August 1996, the 38 Million Words Corpus was available for consultation by the international research community. The present paper reports on the characteristics of this corpus (design, text classification, linguistic annotation) and on its use, both in dictionary projects and in linguistic research. In spite of limitations with respect to corpus design, the INL corpora accessible via Internet have proved to meet external needs. By providing these facilities, the INL has acquired a much broader experience in corpus-building than before, which is essential for new, internal dictionary projects. Giving external access to corpus data which was developed primarily for internal purposes, may be profitable for all parties involved.Keywords: large electronic dutch text corpus, corpus design, text classification, topic, publication medium, linguistic annotation, on-line access via internet, corpus user
Web 2.0, language resources and standards to automatically build a multilingual named entity lexicon
This paper proposes to advance in the current state-of-the-art of automatic Language Resource (LR) building by taking into consideration three elements: (i) the knowledge available in existing LRs, (ii) the vast amount of information available from the collaborative paradigm that has emerged from the Web 2.0 and (iii) the use of standards to improve interoperability. We present a case study in which a set of LRs for different languages (WordNet for English and Spanish and Parole-Simple-Clips for Italian) are
extended with Named Entities (NE) by exploiting Wikipedia and the aforementioned LRs. The practical result is a multilingual NE lexicon connected to these LRs and to two ontologies: SUMO and SIMPLE. Furthermore, the paper addresses an important problem which affects the Computational Linguistics area in the present, interoperability, by making use of the ISO LMF standard to encode this lexicon. The different steps of the procedure (mapping, disambiguation, extraction, NE identification and postprocessing) are comprehensively explained and evaluated. The resulting resource contains 974,567, 137,583 and 125,806 NEs for English, Spanish and Italian respectively. Finally, in order to check the usefulness of the constructed resource, we apply it into a state-of-the-art Question Answering system and evaluate its impact; the NE lexicon improves the system’s accuracy by 28.1%. Compared to previous approaches to build NE repositories, the current proposal represents a step forward in terms of automation, language independence, amount of NEs acquired and richness of the information represented
“The LIPS Corpus (Lexicon of Spoken Italian by Foreigners) and the Acquisition of Vocabulary by Learners of Italian as L2”
The aim of this paper is to present corpus‐based research on the acquisition of the vocabulary of Italian as L2. The goal of the research was to study the lexical uses of non‐native speakers and the processes of lexical acquisition underlying these uses. The informants of the corpus were non‐native speakers learning Italian both within Italy and outside of it in order to compare the development of lexical competence in different learning contexts. The main results show how lexical competence develops above all quantitatively at the beginning and intermediate levels, as well as
how it develops qualitatively at the more advanced levels in particular. Different learning inputs greatly affect the development of lexical competence: learners acquiring Italian in Italy have a deeper knowledge of the Italian lexicon compared to learners learning Italian outside of Italy
- …