Search CORE

222 research outputs found

Part of Speech Tagging and Lemmatisation for the Spoken Dutch Corpus

Author: Eynde £
Frank Van
Jakub Zavrel
Walter Daelemans
Publication venue
Publication date: 11/04/2020
Field of study

Abstract This paper describes the lemmatisation and tagging guidelines developed for the "Spoken Dutch Corpus", and lays out the philosophy behind the high granularity tagset that was designed for the project. To bootstrap the annotation of large quantities of material (10 million words) with this new tagset we tested several existing taggers and tagger generators on initial samples of the corpus. The results show that the most effective method, when trained on the small samples, is a high quality implementation of a Hidden Markov Model tagger generator

CiteSeerX

African language technology: the data-driven perspective

Author: De Pauw Guy
de Schryver Gilles-Maurice
Publication venue: 'European Academy of Applied and Social Sciences (EURAASS)'
Publication date: 01/01/2009
Field of study

Ghent University Academic Bibliography

Institutional Repository Universiteit Antwerpen

Morphological Tagging of a Spoken Portuguese Corpus Using Available Resources

Author: Amaro Raquel
Mendes Amália
Nascimento Maria Fernanda Bacelar do
Publication venue: Colibri
Publication date: 01/01/2004
Field of study

This paper discusses the experience of reusing annotation tools developed for written corpora to tag a spoken corpus with POS information. Eric Brill’s tagger, initially trained over a written and tagged corpus of 250.000 words, is being used to tag the Portuguese C-ORAL-ROM spoken corpus, of 300.000 words. First, we address issues related with the tagset definition as well as the tagger performance over the written corpus. We discuss important options concerning the spoken corpus transcription, with direct impact on the tagging task, as well as the additional tags required. Transcription options allow in some cases for automatic tag identification and replacement, through a post-tagger process. Other cases, like the annotation of discourse markers, are more complex and require manual revision (and eventual listening). Since the final annotation will not only include the POS tag but also the wordform lemma, the paper also addresses issues related to the lemmatisation task. The positive results obtained show that the process of tagging and lemmatising a spoken Portuguese corpus through the reuse of already available resources may constitute an example of how to minimize the costs of such a task, without compromising the results. Finally, we discuss some possible developments to improve the tagger’s performance.info:eu-repo/semantics/publishedVersio

Universidade de Lisboa: Repositório.UL

Bootstrapping a Tagged Corpus through Combination of Existing Heterogeneous Taggers

Author: Daelemans Walter
Zavrel Jakub
Publication venue
Publication date: 01/01/2000
Field of study

This paper describes a new method, Combi-bootstrap, to exploit existing taggers and lexical resources for the annotation of corpora with new tagsets. Combi-bootstrap uses existing resources as features for a second level machine learning module, that is trained to make the mapping to the new tagset on a very small sample of annotated corpus material. Experiments show that Combi-bootstrap: i) can integrate a wide variety of existing resources, and ii) achieves much higher accuracy (up to 44.7 % error reduction) than both the best single tagger and an ensemble tagger constructed out of the same small training sample.Comment: 4 page

arXiv.org e-Print Archive

CiteSeerX

Institutional Repository Universiteit Antwerpen

Tilburg University Repository

Multilingual domain modeling in Twenty-One: automatic creation of a bi-directional translation lexicon from a parallel corpus

Author: Hiemstra Djoerd
Publication venue: Rodopi
Publication date: 01/01/1998
Field of study

Within the project Twenty-One, which aims at the effective dissemination of information on ecology and sustainable development, a sytem is developed that supports cross-language information retrieval in any of the four languages Dutch, English, French and German. Knowledge of this application domain is needed to enhance existing translation resources for the purpose of lexical disambiguation. This paper describes an algorithm for the automated acquisition of a translation lexicon from a parallel corpus. New about the presented algorithm is the statistical language model used. Because the algorithm is based on a symmetric translation model it becomes possible to identify one-to-many and many-to-one relations between words of a language pair. We claim that the presented method has two advantages over algorithms that have been published before. Firstly, because the translation model is more powerful, the resulting bilingual lexicon will be more accurate. Secondly, the resulting bilingual lexicon can be used to translate in both directions between a language pair. Different versions of the algorithm were evaluated on the Dutch and English version of the Agenda 21 corpus, which is a UN document on the application domain of sustainable development

CiteSeerX

Radboud Repository

University of Twente Research Information

Dutch parallel corpus: a balanced parallel corpus for Dutch-English and Dutch-French

Author: FJ Och
G Sutter De
G Vanderbauwhede
Isabelle Delaere
L Macken
L Macken
Lieve Macken
M Kay
M Simard
MP Marcus
P Keirsbilck Van
PF Brown
R Moore
W Daelemans
WA Gale
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

status: publishe

Lirias

Crossref

Springer - Publisher Connector

Ghent University Academic Bibliography

D3.8 Lexical-semantic analytics for NLP

Author: Campagnano Cesare
Costa Rute
de Does Jesse
Dobrovoljc Kaja
Frontini Francesca
Gantar Polona
Kallas Jelena
Koppel Kristina
Krek Simon
Langemets Margit
Martelli Federico
Maru Marco
Munda Tina
Navigli Roberto
Nimb Sanni
Olsen Sussi
Quochi Valeria
Salgado Ana de Castro
Tempelaars Rob
Tiberius Carole
Ureña-Ruiz Rafael-J.
Velardi Paola
Čibej Jaka
Publication venue: ELEXIS - European Lexicographic Infrastructure
Publication date: 01/01/2022
Field of study

UIDB/03213/2020 UIDP/03213/2020The present document illustrates the work carried out in task 3.3 (work package 3) of ELEXIS project focused on lexical-semantic analytics for Natural Language Processing (NLP). This task aims at computing analytics for lexical-semantic information such as words, senses and domains in the available resources, investigating their role in NLP applications. Specifically, this task concentrates on three research directions, namely i) sense clustering, in which grouping senses based on their semantic similarity improves the performance of NLP tasks such as Word Sense Disambiguation (WSD), ii) domain labeling of text, in which the lexicographic resources made available by the ELEXIS project for research purposes allow better performances to be achieved, and finally iii) analysing the diachronic distribution of senses, for which a software package is made available.publishersversionpublishe

Repositório da Universidade Nova de Lisboa

Transitive probabilistic CLIR models.

Author: Jong F.M.G. de
Kraaij W.
Publication venue: Centre de hautes etudes internationales (CID)
Publication date: 01/01/2004
Field of study

Transitive translation could be a useful technique to enlarge the number of supported language pairs for a cross-language information retrieval (CLIR) system in a cost-effective manner. The paper describes several setups for transitive translation based on probabilistic translation models. The transitive CLIR models were evaluated on the CLEF test collection and yielded a retrieval effectiveness\ud up to 83% of monolingual performance, which is significantly better than a baseline using the synonym operator

CiteSeerX

University of Twente Research Information

The ParlaMint corpora of parliamentary proceedings

Author: Agnoloni Tommaso
Barkarson Starkaður
Coole Matthew
Darǵis Roberts
de Does Jesse
de Macedo Luciana D.
Depuydt Katrien
Erjavec Tomaž
Fišer Darja
Kopp Matyáš
Krilavičius Tomas
Ljubešić Nikola
Luxardo Giancarlo
Marx Maarten
Morkevičius Vaidas
Navarretta Costanza
Ogrodniczuk Maciej
Osenova Petya
Pančur Andrej
Pérez María Calzada
Rayson Paul
Ring Orsolya
Rudolf Michał
Simov Kiril
Steingrímsson Steinþór
van Heusden Ruben
Venturi Giulia
Çöltekin Çağrı
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

This paper presents the ParlaMint corpora containing transcriptions of the sessions of the 17 European national parliaments with half a billion words. The corpora are uniformly encoded, contain rich meta-data about 11 thousand speakers, and are linguistically annotated following the Universal Dependencies formalism and with named entities. Samples of the corpora and conversion scripts are available from the project’s GitHub repository, and the complete corpora are openly available via the CLARIN.SI repository for download, as well as through the NoSketch Engine and KonText concordancers and the Parlameter interface for on-line exploration and analysis

PubMed Central

Copenhagen University Research Information System

Repositori Institucional de la Universitat Jaume I

Lancaster E-Prints