17 research outputs found

    Error Analysis in Croatian Morphosyntactic Tagging

    Get PDF
    In this paper, we provide detailed insight on properties of errors generated by a stochastic morphosyntactic tagger assigning Multext-East morphosyntactic descriptions to Croatian texts. Tagging the Croatia Weekly newspaper corpus by the CroTag tagger in stochastic mode revealed that approximately 85 percent of all tagging errors occur on nouns, adjectives, pronouns and verbs. Moreover, approximately 50 percent of these are shown to be incorrect assignments of case values. We provide various other distributional properties of errors in assigning morphosyntactic descriptions for these and other parts of speech. On the basis of these properties, we propose rule-based and stochastic strategies which could be integrated in the tagging module, creating a hybrid procedure in order to raise overall tagging accuracy for Croatian

    Tagset Reductions in Morphosyntactic Tagging of Croatian Texts

    Get PDF
    Morphosyntactic tagging of Croatian texts is performed with stochastic taggersby using a language model built on a manually annotated corpus implementingthe Multext East version 3 specifications for Croatian. Tagging accuracy in thisframework is basically predefined, i.e. proportionally dependent of two things:the size of the training corpus and the number of different morphosyntactic tagsencompassed by that corpus. Being that the 100 kw Croatia Weekly newspapercorpus by definition makes a rather small language model in terms of stochastictagging of free domain texts, the paper presents an approach dealing withtagset reductions. Several meaningful subsets of the Croatian Multext-East version3 morphosyntactic tagset specifications are created and applied on Croatiantexts with the CroTag stochastic tagger, measuring overall tagging accuracyand F1-measures. Obtained results are discussed in terms of applying differentreductions in different natural language processing systems and specifictasks defined by specific user requirements

    Building and Annotating the Linguistically Diverse NTU-MC (NTU-Multilingual Corpus)

    Get PDF

    Statisztika megbízhatósága a nyelvészetben : széljegyzetek egy szótárbővítés ürügyén

    Get PDF
    Manapság szinte korlátlan mennyiségben lehet természetes nyelvű szövegeket elérni a www jóvoltából. Emiatt a nyelvi kutatásoknál, eszközök fejlesztésénél erősen támaszkodnak nyelvi statisztikákra. A megbízhatóság kérdésével viszont kevesen foglalkoznak, pedig ez kulcskérdése a tömeges adatok felhasználhatóságának. Ez a cikk azzal foglalkozik, milyen jellegű objektív korlátai vannak a statisztikáknak, és hogyan lehet becsülni a megbízhatóságot

    Standardization of the formal representation of lexical information for NLP

    Get PDF
    International audienceA survey of dictionary models and formats is presented as well as a presentation of corresponding recent standardisation activities

    TALC-sef, Un corpus étiqueté de traductions littéraires en serbe, anglais et français

    Get PDF
    International audienceLe corpus TALC-sef (TAgged Literary Corpus in Serbian, English, French) est un corpus parallèle d'ouvrages littéraires en serbe, anglais et français, étiquetés en parties du discours et librement consultables via une interface en ligne. Il a été constitué par l'Université d'Arras, en collaboration avec l'Université Lille 3 et l'Université de Belgrade, dans une perspective d'études comparées en stylistique et linguistique. Le corpus TALC-sef représente au total plus de 2 millions de mots, il intègre notamment un corpus étiqueté, corrigé manuellement pour la langue serbe, de 150 000 mots. Dans cet article, nous présentons le mode de constitution du corpus parallèle dans son ensemble, puis nous nous attachons plus spécifiquement à l'élaboration du sous-corpus serbe étiqueté. Nous détaillons les choix linguistiques et techniques sous-jacents à la constitution de ce sous-corpus, qui vient compléter l'offre existante pour la linguistique sur corpus en serbe: à ce jour, le seul corpus librement disponible consiste en une traduction du roman 1984 de G. Orwell (100 000 mots), alors que nous proposons un corpus d'œuvres écrites à l'origine en Serbe, de 150 000 mots. La constitution de ce sous-corpus a permis l'élaboration de modèles d'étiquetage automatique pour trois étiqueteurs syntaxiques, dont Treetagger, TnT et BTagger, le plus efficace d'entre eux. Enfin, nous présentons les perspectives d'évolution du corpus existant, en termes d'enrichissement des annotations syntaxiques (analyses en dépendance en parallèle sur les trois langues), ainsi que les apports d'un tel corpus parallèle étiqueté pour la linguistique du français

    Dependency-based translation equivalents for factored machine translation

    Get PDF
    Abstract. One of the major concerns of the machine translation practitioners is to create good translation models: correctly extracted translation equivalents and a reduced size of the translation table are the most important evaluation criteria. This paper presents a method for extracting translation examples using the dependency linkage of both the source and target sentence. To decompose the source/target sentence into fragments, we identified two types of dependency link-structures -super-links and chains -and used these structures to set the translation example borders. The option for the dependency-linked ngrams approach is based on the assumption that a decomposition of the sentence in coherent segments, with complete syntactical structure and which accounts for extra-phrasal syntactic dependency would guarantee "better" translation examples and would make a better use of the storage space. The performance of the dependency-based approach is measured with the BLEU-NIST score and in comparison with a baseline system

    MONDILEX – towards the research infrastructure for digital resources in Slavic lexicography

    Get PDF

    Slavic corpus and computational linguistics

    Get PDF
    In this paper, we focus on corpus-linguistic studies that address theoretical questions and on computational linguistic work on corpus annotation, that makes corpora useful for linguistic work. First, we discuss why the corpus linguistic approach was discredited by generative linguists in the second half of the 20th century, how it made a comeback through advances in computing and was adopted by usage-based linguistics at the beginning of the 21st century. Then, we move on to an overview of necessary and common annotation layers and the issues that are encountered when performing automatic annotation, with special emphasis on Slavic languages. Finally, we survey the types of research requiring corpora that Slavic linguists are involved in world-wide, and the resources they have at their disposal

    Implementacija uÄŤinkovitega sistema za gradnjo, uporabo in evaluacijo lematizatorjev tipa RDR

    Get PDF
    Lemmatization is the process of determining teh canonical form of a word, called lemma, from its inflectional variants. We have developed a language independent system, LemmaGen, consisting of a set of tools for automatically learning of lemmatizers from lexicons of pre-lemmatized words. The system consists of three modules that can be used independently or sequentially. The input to the first module is a lexicon of lemmatized words from which it learns Ripple Down Rules that best describe word lemmatization. The next module takes these rules, which are in the form of RDR trees, and produces an efficient structure for fast lemmatizatio - the actual lemmatizer. In the last step we use the lemmatizer to transform the original input text into a set of lemmatized words. LemmaGen was applied to 14 different Multext and Multext-East lexicons and produced efficient lemmatizers for the corresponding languages. Its evaluation on the 14 lexicins shows that LemmaGen considerably outperforms the lemmatizers generated by the previously developed RDR leraning algorithm, both in terms of accuracy and efficiency. We used lemmatization also as a step in the analysisof a corpus of press-agency news and show improved result inerpretation, achieved by using LemmaGen in news preprocessing
    corecore