Search CORE

17 research outputs found

Error Analysis in Croatian Morphosyntactic Tagging

Author: Agić Željko
Dovedan Zdravko
Tadić Marko
Publication venue: Srce - University of Zagreb, University Computing Centre
Publication date: 01/01/2009
Field of study

In this paper, we provide detailed insight on properties of errors generated by a stochastic morphosyntactic tagger assigning Multext-East morphosyntactic descriptions to Croatian texts. Tagging the Croatia Weekly newspaper corpus by the CroTag tagger in stochastic mode revealed that approximately 85 percent of all tagging errors occur on nouns, adjectives, pronouns and verbs. Moreover, approximately 50 percent of these are shown to be incorrect assignments of case values. We provide various other distributional properties of errors in assigning morphosyntactic descriptions for these and other parts of speech. On the basis of these properties, we propose rule-based and stochastic strategies which could be integrated in the tagging module, creating a hybrid procedure in order to raise overall tagging accuracy for Croatian

Repozitorij Filozofskog fakulteta u Zagrebu' at University of Zagreb

Crossref

Digitalni arhiv Filozofskog fakulteta u Zagrebu

Tagset Reductions in Morphosyntactic Tagging of Croatian Texts

Author: Agić Željko
Dovedan Zdravko
Tadić Marko
Publication venue: Department of Information Sciences, Faculty of Humanities and Social Sciences, University of Zagreb
Publication date: 01/11/2009
Field of study

Morphosyntactic tagging of Croatian texts is performed with stochastic taggersby using a language model built on a manually annotated corpus implementingthe Multext East version 3 specifications for Croatian. Tagging accuracy in thisframework is basically predefined, i.e. proportionally dependent of two things:the size of the training corpus and the number of different morphosyntactic tagsencompassed by that corpus. Being that the 100 kw Croatia Weekly newspapercorpus by definition makes a rather small language model in terms of stochastictagging of free domain texts, the paper presents an approach dealing withtagset reductions. Several meaningful subsets of the Croatian Multext-East version3 morphosyntactic tagset specifications are created and applied on Croatiantexts with the CroTag stochastic tagger, measuring overall tagging accuracyand F1-measures. Obtained results are discussed in terms of applying differentreductions in different natural language processing systems and specifictasks defined by specific user requirements

Repozitorij Filozofskog fakulteta u Zagrebu' at University of Zagreb

Digitalni arhiv Filozofskog fakulteta u Zagrebu

Building and Annotating the Linguistically Diverse NTU-MC (NTU-Multilingual Corpus)

Author: Bond Francis
Tan Liling
Publication venue: Institute of Digital Enhancement of Cognitive Processing, Waseda University
Publication date: 01/01/2011
Field of study

Waseda University Repository

Statisztika megbízhatósága a nyelvészetben : széljegyzetek egy szótárbővítés ürügyén

Author: Naszódi Mátyás
Publication venue
Publication date: 01/01/2015
Field of study

Manapság szinte korlátlan mennyiségben lehet természetes nyelvű szövegeket elérni a www jóvoltából. Emiatt a nyelvi kutatásoknál, eszközök fejlesztésénél erősen támaszkodnak nyelvi statisztikákra. A megbízhatóság kérdésével viszont kevesen foglalkoznak, pedig ez kulcskérdése a tömeges adatok felhasználhatóságának. Ez a cikk azzal foglalkozik, milyen jellegű objektív korlátai vannak a statisztikáknak, és hogyan lehet becsülni a megbízhatóságot

University of Szeged

Standardization of the formal representation of lexical information for NLP

Author: Romary Laurent
Publication venue: Mouton de Gruyter
Publication date: 01/12/2013
Field of study

International audienceA survey of dictionary models and formats is presented as well as a presentation of corresponding recent standardisation activities

INRIA a CCSD electronic archive server

TALC-sef, Un corpus étiqueté de traductions littéraires en serbe, anglais et français

Author: Balvet Antonio
Miletic Aleksandra
Stosic Dejan
Publication venue: 'EDP Sciences'
Publication date: 01/01/2014
Field of study

International audienceLe corpus TALC-sef (TAgged Literary Corpus in Serbian, English, French) est un corpus parallèle d'ouvrages littéraires en serbe, anglais et français, étiquetés en parties du discours et librement consultables via une interface en ligne. Il a été constitué par l'Université d'Arras, en collaboration avec l'Université Lille 3 et l'Université de Belgrade, dans une perspective d'études comparées en stylistique et linguistique. Le corpus TALC-sef représente au total plus de 2 millions de mots, il intègre notamment un corpus étiqueté, corrigé manuellement pour la langue serbe, de 150 000 mots. Dans cet article, nous présentons le mode de constitution du corpus parallèle dans son ensemble, puis nous nous attachons plus spécifiquement à l'élaboration du sous-corpus serbe étiqueté. Nous détaillons les choix linguistiques et techniques sous-jacents à la constitution de ce sous-corpus, qui vient compléter l'offre existante pour la linguistique sur corpus en serbe: à ce jour, le seul corpus librement disponible consiste en une traduction du roman 1984 de G. Orwell (100 000 mots), alors que nous proposons un corpus d'œuvres écrites à l'origine en Serbe, de 150 000 mots. La constitution de ce sous-corpus a permis l'élaboration de modèles d'étiquetage automatique pour trois étiqueteurs syntaxiques, dont Treetagger, TnT et BTagger, le plus efficace d'entre eux. Enfin, nous présentons les perspectives d'évolution du corpus existant, en termes d'enrichissement des annotations syntaxiques (analyses en dépendance en parallèle sur les trois langues), ainsi que les apports d'un tel corpus parallèle étiqueté pour la linguistique du français

Crossref

Scientific Publications of the University of Toulouse II Le Mirail

EDP Sciences OAI-PMH repository (1.2.0)

Directory of Open Access Journals

HAL Descartes

Dependency-based translation equivalents for factored machine translation

Author: Alexandru Ceauşu
Irimia Elena
Publication venue
Publication date: 23/04/2020
Field of study

Abstract. One of the major concerns of the machine translation practitioners is to create good translation models: correctly extracted translation equivalents and a reduced size of the translation table are the most important evaluation criteria. This paper presents a method for extracting translation examples using the dependency linkage of both the source and target sentence. To decompose the source/target sentence into fragments, we identified two types of dependency link-structures -super-links and chains -and used these structures to set the translation example borders. The option for the dependency-linked ngrams approach is based on the assumption that a decomposition of the sentence in coherent segments, with complete syntactical structure and which accounts for extra-phrasal syntactic dependency would guarantee "better" translation examples and would make a better use of the storage space. The performance of the dependency-based approach is measured with the BLEU-NIST score and in comparison with a baseline system

CiteSeerX

MONDILEX – towards the research infrastructure for digital resources in Slavic lexicography

Author
Publication venue: 'Institute of Slavic Studies Polish Academy of Sciences'
Publication date
Field of study

Crossref

Slavic corpus and computational linguistics

Author: Divjak D.S.
Erjavec T.
Sharoff S.
Publication venue: 'Project Muse'
Publication date: 01/01/2017
Field of study

In this paper, we focus on corpus-linguistic studies that address theoretical questions and on computational linguistic work on corpus annotation, that makes corpora useful for linguistic work. First, we discuss why the corpus linguistic approach was discredited by generative linguists in the second half of the 20th century, how it made a comeback through advances in computing and was adopted by usage-based linguistics at the beginning of the 21st century. Then, we move on to an overview of necessary and common annotation layers and the issues that are encountered when performing automatic annotation, with special emphasis on Slavic languages. Finally, we survey the types of research requiring corpora that Slavic linguists are involved in world-wide, and the resources they have at their disposal

Crossref

University of Birmingham Research Portal

White Rose Research Online

Implementacija učinkovitega sistema za gradnjo, uporabo in evaluacijo lematizatorjev tipa RDR

Author: Juršič Matjaž
Publication venue
Publication date: 18/06/2007
Field of study

Lemmatization is the process of determining teh canonical form of a word, called lemma, from its inflectional variants. We have developed a language independent system, LemmaGen, consisting of a set of tools for automatically learning of lemmatizers from lexicons of pre-lemmatized words. The system consists of three modules that can be used independently or sequentially. The input to the first module is a lexicon of lemmatized words from which it learns Ripple Down Rules that best describe word lemmatization. The next module takes these rules, which are in the form of RDR trees, and produces an efficient structure for fast lemmatizatio - the actual lemmatizer. In the last step we use the lemmatizer to transform the original input text into a set of lemmatized words. LemmaGen was applied to 14 different Multext and Multext-East lexicons and produced efficient lemmatizers for the corresponding languages. Its evaluation on the 14 lexicins shows that LemmaGen considerably outperforms the lemmatizers generated by the previously developed RDR leraning algorithm, both in terms of accuracy and efficiency. We used lemmatization also as a step in the analysisof a corpus of press-agency news and show improved result inerpretation, achieved by using LemmaGen in news preprocessing