242 research outputs found

    Error Analysis in Croatian Morphosyntactic Tagging

    Get PDF
    In this paper, we provide detailed insight on properties of errors generated by a stochastic morphosyntactic tagger assigning Multext-East morphosyntactic descriptions to Croatian texts. Tagging the Croatia Weekly newspaper corpus by the CroTag tagger in stochastic mode revealed that approximately 85 percent of all tagging errors occur on nouns, adjectives, pronouns and verbs. Moreover, approximately 50 percent of these are shown to be incorrect assignments of case values. We provide various other distributional properties of errors in assigning morphosyntactic descriptions for these and other parts of speech. On the basis of these properties, we propose rule-based and stochastic strategies which could be integrated in the tagging module, creating a hybrid procedure in order to raise overall tagging accuracy for Croatian

    External Lexical Information for Multilingual Part-of-Speech Tagging

    Get PDF
    Morphosyntactic lexicons and word vector representations have both proven useful for improving the accuracy of statistical part-of-speech taggers. Here we compare the performances of four systems on datasets covering 16 languages, two of these systems being feature-based (MEMMs and CRFs) and two of them being neural-based (bi-LSTMs). We show that, on average, all four approaches perform similarly and reach state-of-the-art results. Yet better performances are obtained with our feature-based models on lexically richer datasets (e.g. for morphologically rich languages), whereas neural-based results are higher on datasets with less lexical variability (e.g. for English). These conclusions hold in particular for the MEMM models relying on our system MElt, which benefited from newly designed features. This shows that, under certain conditions, feature-based approaches enriched with morphosyntactic lexicons are competitive with respect to neural methods

    Tagset Reductions in Morphosyntactic Tagging of Croatian Texts

    Get PDF
    Morphosyntactic tagging of Croatian texts is performed with stochastic taggersby using a language model built on a manually annotated corpus implementingthe Multext East version 3 specifications for Croatian. Tagging accuracy in thisframework is basically predefined, i.e. proportionally dependent of two things:the size of the training corpus and the number of different morphosyntactic tagsencompassed by that corpus. Being that the 100 kw Croatia Weekly newspapercorpus by definition makes a rather small language model in terms of stochastictagging of free domain texts, the paper presents an approach dealing withtagset reductions. Several meaningful subsets of the Croatian Multext-East version3 morphosyntactic tagset specifications are created and applied on Croatiantexts with the CroTag stochastic tagger, measuring overall tagging accuracyand F1-measures. Obtained results are discussed in terms of applying differentreductions in different natural language processing systems and specifictasks defined by specific user requirements

    Mimicking Word Embeddings using Subword RNNs

    Full text link
    Word embeddings improve generalization over lexical features by placing each word in a lower-dimensional space, using distributional information obtained from unlabeled data. However, the effectiveness of word embeddings for downstream NLP tasks is limited by out-of-vocabulary (OOV) words, for which embeddings do not exist. In this paper, we present MIMICK, an approach to generating OOV word embeddings compositionally, by learning a function from spellings to distributional embeddings. Unlike prior work, MIMICK does not require re-training on the original word embedding corpus; instead, learning is performed at the type level. Intrinsic and extrinsic evaluations demonstrate the power of this simple approach. On 23 languages, MIMICK improves performance over a word-based baseline for tagging part-of-speech and morphosyntactic attributes. It is competitive with (and complementary to) a supervised character-based model in low-resource settings.Comment: EMNLP 201

    CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages

    Full text link
    We present CLASSLA-Stanza, a pipeline for automatic linguistic annotation of the South Slavic languages, which is based on the Stanza natural language processing pipeline. We describe the main improvements in CLASSLA-Stanza with respect to Stanza, and give a detailed description of the model training process for the latest 2.1 release of the pipeline. We also report performance scores produced by the pipeline for different languages and varieties. CLASSLA-Stanza exhibits consistently high performance across all the supported languages and outperforms or expands its parent pipeline Stanza at all the supported tasks. We also present the pipeline's new functionality enabling efficient processing of web data and the reasons that led to its implementation.Comment: 17 pages, 14 tables, 1 figur

    Uvid u automatsko izlučivanje metaforičkih kolokacija

    Get PDF
    Collocations have been the subject of much scientific research over the years. The focus of this research is on a subset of collocations, namely metaphorical collocations. In metaphorical collocations, a semantic shift has taken place in one of the components, i.e., one of the components takes on a transferred meaning. The main goal of this paper is to review the existing literature and provide a systematic overview of the existing research on collocation extraction, as well as the overview of existing methods, measures, and resources. The existing research is classified according to the approach (statistical, hybrid, and distributional semantics) and presented in three separate sections. The insights gained from existing research serve as a first step in exploring the possibility of developing a method for automatic extraction of metaphorical collocations. The methods, tools, and resources that may prove useful for future work are highlighted.Kolokacije su već dugi niz godina tema mnogih znanstvenih istraživanja. U fokusu ovoga istraživanja podskupina je kolokacija koju čine metaforičke kolokacije. Kod metaforičkih je kolokacija kod jedne od sastavnica doÅ”lo do semantičkoga pomaka, tj. jedna od sastavnica poprima preneseno značenje. Glavni su ciljevi ovoga rada istražiti postojeću literaturu te dati sustavan pregled postojećih istraživanja na temu izlučivanja kolokacija i postojećih metoda, mjera i resursa. Postojeća istraživanja opisana su i klasificirana prema različitim pristupima (statistički, hibridni i zasnovani na distribucijskoj semantici). Također su opisane različite asocijativne mjere i postojeći načini procjene rezultata automatskoga izlučivanja kolokacija. Metode, alati i resursi koji su koriÅ”teni u prethodnim istraživanjima, a mogli bi biti korisni za naÅ” budući rad posebno su istaknuti. Stečeni uvidi u postojeća istraživanja čine prvi korak u razmatranju mogućnosti razvijanja postupka za automatsko izlučivanje metaforičkih kolokacija

    Kompiliranje korpusa u digitalnim humanističkim znanostima u jezicima s ograničenim resursima: o praksi kompiliranja tematskih korpusa iz digitalnih medija za srpski, hrvatski i slovenski

    Get PDF
    The digital era has unlocked unprecedented possibilities of compiling corpora of social discourse, which has brought corpus linguistic methods into closer interaction with other methods of discourse analysis and the humanities. Even when not using any specific techniques of corpus linguistics, drawing on some sort of corpus is increasingly resorted to for empiricallyā€“grounded socialā€“scientific analysis (sometimes dubbed ā€˜corpusā€“assisted discourse analysisā€™ or ā€˜corpusā€“based critical discourse analysisā€™, cf. Hardtā€“Mautner 1995; Baker 2016). In the postā€“Yugoslav space, recent corpus developments have brought tableā€“turning advantages in many areas of discourse research, along with an ongoing proliferation of corpora and tools. Still, for linguists and discourse analysts who embark on collecting specialized corpora for their own research purposes, many questions persist ā€“ partly due to the fastā€“changing background of these issues, but also due to the fact that there is still a gap in the corpus method, and in guidelines for corpus compilation, when applied beyond the anglophone contexts. In this paper we aim to discuss some possible solutions to these difficulties, by presenting one stepā€“byā€“step account of a corpus building procedure specifically for Croatian, Serbian and Slovenian, through an example of compiling a thematic corpus from digital media sources (news articles and reader comments). Following an overview of corpus types, uses and advantages in social sciences and digital humanities, we present the corpus compilation possibilities in the South Slavic language contexts, including data scraping options, permissions and ethical issues, the factors that facilitate or complicate automated collection, and corpus annotation and processing possibilities. The study shows expanding possibilities for work with the given languages, but also some persistently grey areas where researchers need to make decisions based on research expectations. Overall, the paper aims to recapitulate our own corpus compilation experience in the wider context of Southā€“Slavic corpus linguistics and corpus linguistic approaches in the humanities more generallyDigitalno doba otvorilo je nove mogućnosti za sastavljanje korpusa druÅ”tvenog diskursa, Å”to je korpusnolingvističke metode približilo drugim metodama analize diskursa te humanističkim znanostima. Čak i kada se ne koriste nikakve specifične tehnike korpusne lingvistike, danas je za empirijski utemeljenu druÅ”tvenoā€“znanstvenu analizu sve učestalije koriÅ”tenje neke vrste korpusa (ā€˜korpusnoā€“asistirana analiza diskursaā€™ ili ā€˜kritička korpusna analizaā€™, Hardtā€“Mautner 1995; Baker 2016). U postjugoslavenskom prostoru, nedavni razvoj korpusne lingvistike donio je prednosti u mnogim područjima istraživanja. Ipak, za lingviste i analitičare diskursa koji se upuÅ”taju u prikupljanje specijaliziranih korpusa za vlastite istraživačke svrhe, i dalje ostaju otvorena mnoga pitanja ā€“ djelomično zbog pozadine korpusne lingvistike koja se brzo mijenja, ali i zbog činjenice da joÅ” uvijek postoji rascjep u poznavanju korpusnih metoda, kao i metodologije sastavljanja korpusa izvan anglofonskog konteksta. Ovim radom pokuÅ”avamo smanjiti spomenuti rascjep predstavljajući jedan postupni prikaz postupka izgradnje korpusa za hrvatski, srpski i slovenski, kroz primjer sastavljanja tematskog korpusa iz digitalnih medija (novinski članci i komentari čitatelja). Nakon pregleda tipova korpusa, koriÅ”tenja i prednosti u druÅ”tvenim znanostima i digitalnim humanističkim znanostima, predstavljamo mogućnosti sastavljanja korpusa u južnoslavenskim jezičnim kontekstima, uključujući opcije preuzimanja podataka s mreže, dozvola i etičkih pitanja, čimbenika koji olakÅ”avaju ili otežavaju automatizirano prikupljanje i označavanje korpusa i mogućnosti obrade. Studija otkriva sve veće mogućnosti za rad s danim jezicima, ali i neka uporno siva područja u kojima istraživači trebaju donositi odluke na temelju istraživačkih očekivanja. Općenito, rad ima za cilj rekapitulirati vlastito iskustvo sastavljanja korpusa u Å”irem kontekstu južnoslavenske korpusne lingvistike i korpusnih lingvističkih pristupa u humanističkim znanostima općenito

    Data sparsity in highly inflected languages: the case of morphosyntactic tagging in Polish

    Get PDF
    In morphologically complex languages, many high-level tasks in natural language processing rely on accurate morphosyntactic analyses of the input. However, in light of the risk of error propagation in present-day pipeline architectures for basic linguistic pre-processing, the state of the art for morphosyntactic tagging is still not satisfactory. The main obstacle here is data sparsity inherent to natural lan- guage in general and highly inflected languages in particular. In this work, we investigate whether semi-supervised systems may alleviate the data sparsity problem. Our approach uses word clusters obtained from large amounts of unlabelled text in an unsupervised manner in order to provide a su- pervised probabilistic tagger with morphologically informed features. Our evalua- tions on a number of datasets for the Polish language suggest that this simple technique improves tagging accuracy, especially with regard to out-of-vocabulary words. This may prove useful to increase cross-domain performance of taggers, and to alleviate the dependency on large amounts of supervised training data, which is especially important from the perspective of less-resourced languages

    Increasing Quality of the Corpus of Frequency Dictionary of Contemporary Polish for Morphosyntactic Tagging of the Polish Language

    Get PDF
    The paper is devoted to the issue of correction of the erroneous and ambiguous corpus of Frequency Dictionary of Contemporary Polish (FDCP) and its application to morphosyntactic tagging of the Polish language. Several stages of corpus transformation are presented and baseline part-of-speech tagging algorithms are evaluated, too
    • ā€¦
    corecore