17 research outputs found
Error Analysis in Croatian Morphosyntactic Tagging
In this paper, we provide detailed
insight on properties of errors generated by a
stochastic morphosyntactic tagger assigning
Multext-East morphosyntactic descriptions to
Croatian texts. Tagging the Croatia Weekly
newspaper corpus by the CroTag tagger in
stochastic mode revealed that approximately 85
percent of all tagging errors occur on nouns,
adjectives, pronouns and verbs. Moreover,
approximately 50 percent of these are shown to
be incorrect assignments of case values. We
provide various other distributional properties of
errors in assigning morphosyntactic descriptions
for these and other parts of speech. On the basis
of these properties, we propose rule-based and
stochastic strategies which could be integrated in
the tagging module, creating a hybrid procedure
in order to raise overall tagging accuracy for
Croatian
Tagset Reductions in Morphosyntactic Tagging of Croatian Texts
Morphosyntactic tagging of Croatian texts is performed with stochastic taggersby using a language model built on a manually annotated corpus implementingthe Multext East version 3 specifications for Croatian. Tagging accuracy in thisframework is basically predefined, i.e. proportionally dependent of two things:the size of the training corpus and the number of different morphosyntactic tagsencompassed by that corpus. Being that the 100 kw Croatia Weekly newspapercorpus by definition makes a rather small language model in terms of stochastictagging of free domain texts, the paper presents an approach dealing withtagset reductions. Several meaningful subsets of the Croatian Multext-East version3 morphosyntactic tagset specifications are created and applied on Croatiantexts with the CroTag stochastic tagger, measuring overall tagging accuracyand F1-measures. Obtained results are discussed in terms of applying differentreductions in different natural language processing systems and specifictasks defined by specific user requirements
Statisztika megbĂzhatĂłsága a nyelvĂ©szetben : szĂ©ljegyzetek egy szĂłtárbĹ‘vĂtĂ©s ĂĽrĂĽgyĂ©n
Manapság szinte korlátlan mennyisĂ©gben lehet termĂ©szetes nyelvű szövegeket elĂ©rni a www jĂłvoltábĂłl. Emiatt a nyelvi kutatásoknál, eszközök fejlesztĂ©sĂ©nĂ©l erĹ‘sen támaszkodnak nyelvi statisztikákra. A megbĂzhatĂłság kĂ©rdĂ©sĂ©vel viszont kevesen foglalkoznak, pedig ez kulcskĂ©rdĂ©se a tömeges adatok felhasználhatĂłságának. Ez a cikk azzal foglalkozik, milyen jellegű objektĂv korlátai vannak a statisztikáknak, Ă©s hogyan lehet becsĂĽlni a megbĂzhatĂłságot
Standardization of the formal representation of lexical information for NLP
International audienceA survey of dictionary models and formats is presented as well as a presentation of corresponding recent standardisation activities
TALC-sef, Un corpus étiqueté de traductions littéraires en serbe, anglais et français
International audienceLe corpus TALC-sef (TAgged Literary Corpus in Serbian, English, French) est un corpus parallèle d'ouvrages littéraires en serbe, anglais et français, étiquetés en parties du discours et librement consultables via une interface en ligne. Il a été constitué par l'Université d'Arras, en collaboration avec l'Université Lille 3 et l'Université de Belgrade, dans une perspective d'études comparées en stylistique et linguistique. Le corpus TALC-sef représente au total plus de 2 millions de mots, il intègre notamment un corpus étiqueté, corrigé manuellement pour la langue serbe, de 150 000 mots. Dans cet article, nous présentons le mode de constitution du corpus parallèle dans son ensemble, puis nous nous attachons plus spécifiquement à l'élaboration du sous-corpus serbe étiqueté. Nous détaillons les choix linguistiques et techniques sous-jacents à la constitution de ce sous-corpus, qui vient compléter l'offre existante pour la linguistique sur corpus en serbe: à ce jour, le seul corpus librement disponible consiste en une traduction du roman 1984 de G. Orwell (100 000 mots), alors que nous proposons un corpus d'œuvres écrites à l'origine en Serbe, de 150 000 mots. La constitution de ce sous-corpus a permis l'élaboration de modèles d'étiquetage automatique pour trois étiqueteurs syntaxiques, dont Treetagger, TnT et BTagger, le plus efficace d'entre eux. Enfin, nous présentons les perspectives d'évolution du corpus existant, en termes d'enrichissement des annotations syntaxiques (analyses en dépendance en parallèle sur les trois langues), ainsi que les apports d'un tel corpus parallèle étiqueté pour la linguistique du français
Dependency-based translation equivalents for factored machine translation
Abstract. One of the major concerns of the machine translation practitioners is to create good translation models: correctly extracted translation equivalents and a reduced size of the translation table are the most important evaluation criteria. This paper presents a method for extracting translation examples using the dependency linkage of both the source and target sentence. To decompose the source/target sentence into fragments, we identified two types of dependency link-structures -super-links and chains -and used these structures to set the translation example borders. The option for the dependency-linked ngrams approach is based on the assumption that a decomposition of the sentence in coherent segments, with complete syntactical structure and which accounts for extra-phrasal syntactic dependency would guarantee "better" translation examples and would make a better use of the storage space. The performance of the dependency-based approach is measured with the BLEU-NIST score and in comparison with a baseline system
Slavic corpus and computational linguistics
In this paper, we focus on corpus-linguistic studies that address theoretical questions and on computational linguistic work on corpus annotation, that makes corpora useful for linguistic work. First, we discuss why the corpus linguistic approach was discredited by generative linguists in the second half of the 20th century, how it made a comeback through advances in computing and was adopted by usage-based linguistics at the beginning of the 21st century. Then, we move on to an overview of necessary and common annotation layers and the issues that are encountered when performing automatic annotation, with special emphasis on Slavic languages. Finally, we survey the types of research requiring corpora that Slavic linguists are involved in world-wide, and the resources they have at their disposal
Implementacija uÄŤinkovitega sistema za gradnjo, uporabo in evaluacijo lematizatorjev tipa RDR
Lemmatization is the process of determining teh canonical form of a word, called lemma, from its inflectional variants. We have developed a language independent system, LemmaGen, consisting of a set of tools for automatically learning of lemmatizers from lexicons of pre-lemmatized words. The system consists of three modules that can be used independently or sequentially. The input to the first module is a lexicon of lemmatized words from which it learns Ripple Down Rules that best describe word lemmatization. The next module takes these rules, which are in the form of RDR trees, and produces an efficient structure for fast lemmatizatio - the actual lemmatizer. In the last step we use the lemmatizer to transform the original input text into a set of lemmatized words. LemmaGen was applied to 14 different Multext and Multext-East lexicons and produced efficient lemmatizers for the corresponding languages. Its evaluation on the 14 lexicins shows that LemmaGen considerably outperforms the lemmatizers generated by the previously developed RDR leraning algorithm, both in terms of accuracy and efficiency. We used lemmatization also as a step in the analysisof a corpus of press-agency news and show improved result inerpretation, achieved by using LemmaGen in news preprocessing