813 research outputs found

    Cross-lingual Dependency Parsing of Related Languages with Rich Morphosyntactic Tagsets

    Get PDF
    This paper addresses cross-lingual dependency parsing using rich morphosyntactic tagsets. In our case study, we experiment with three related Slavic languages: Croatian, Serbian and Slovene. Four different dependency treebanks are used for monolingual parsing, direct cross-lingual parsing, and a recently introduced crosslingual parsing approach that utilizes statistical machine translation and annotation projection. We argue for the benefits of using rich morphosyntactic tagsets in cross-lingual parsing and empirically support the claim by showing large improvements over an impoverished common feature representation in form of a reduced part-of-speech tagset. In the process, we improve over the previous state-of-the-art scores in dependency parsing for all three languages.Published versio

    CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages

    Full text link
    We present CLASSLA-Stanza, a pipeline for automatic linguistic annotation of the South Slavic languages, which is based on the Stanza natural language processing pipeline. We describe the main improvements in CLASSLA-Stanza with respect to Stanza, and give a detailed description of the model training process for the latest 2.1 release of the pipeline. We also report performance scores produced by the pipeline for different languages and varieties. CLASSLA-Stanza exhibits consistently high performance across all the supported languages and outperforms or expands its parent pipeline Stanza at all the supported tasks. We also present the pipeline's new functionality enabling efficient processing of web data and the reasons that led to its implementation.Comment: 17 pages, 14 tables, 1 figur

    Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark

    Full text link
    We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 18 datasets annotated with named entities in a cross-lingual consistent schema across 12 diverse languages. In this paper, we detail the dataset creation and composition of UNER; we also provide initial modeling baselines on both in-language and cross-lingual learning settings. We release the data, code, and fitted models to the public

    Universal NER:A Gold-Standard Multilingual Named Entity Recognition Benchmark

    Get PDF
    We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 18 datasets annotated with named entities in a cross-lingual consistent schema across 12 diverse languages. In this paper, we detail the dataset creation and composition of UNER; we also provide initial modeling baselines on both in-language and cross-lingual learning settings. We release the data, code, and fitted models to the public

    Kompiliranje korpusa u digitalnim humanističkim znanostima u jezicima s ograničenim resursima: o praksi kompiliranja tematskih korpusa iz digitalnih medija za srpski, hrvatski i slovenski

    Get PDF
    The digital era has unlocked unprecedented possibilities of compiling corpora of social discourse, which has brought corpus linguistic methods into closer interaction with other methods of discourse analysis and the humanities. Even when not using any specific techniques of corpus linguistics, drawing on some sort of corpus is increasingly resorted to for empiricallyā€“grounded socialā€“scientific analysis (sometimes dubbed ā€˜corpusā€“assisted discourse analysisā€™ or ā€˜corpusā€“based critical discourse analysisā€™, cf. Hardtā€“Mautner 1995; Baker 2016). In the postā€“Yugoslav space, recent corpus developments have brought tableā€“turning advantages in many areas of discourse research, along with an ongoing proliferation of corpora and tools. Still, for linguists and discourse analysts who embark on collecting specialized corpora for their own research purposes, many questions persist ā€“ partly due to the fastā€“changing background of these issues, but also due to the fact that there is still a gap in the corpus method, and in guidelines for corpus compilation, when applied beyond the anglophone contexts. In this paper we aim to discuss some possible solutions to these difficulties, by presenting one stepā€“byā€“step account of a corpus building procedure specifically for Croatian, Serbian and Slovenian, through an example of compiling a thematic corpus from digital media sources (news articles and reader comments). Following an overview of corpus types, uses and advantages in social sciences and digital humanities, we present the corpus compilation possibilities in the South Slavic language contexts, including data scraping options, permissions and ethical issues, the factors that facilitate or complicate automated collection, and corpus annotation and processing possibilities. The study shows expanding possibilities for work with the given languages, but also some persistently grey areas where researchers need to make decisions based on research expectations. Overall, the paper aims to recapitulate our own corpus compilation experience in the wider context of Southā€“Slavic corpus linguistics and corpus linguistic approaches in the humanities more generallyDigitalno doba otvorilo je nove mogućnosti za sastavljanje korpusa druÅ”tvenog diskursa, Å”to je korpusnolingvističke metode približilo drugim metodama analize diskursa te humanističkim znanostima. Čak i kada se ne koriste nikakve specifične tehnike korpusne lingvistike, danas je za empirijski utemeljenu druÅ”tvenoā€“znanstvenu analizu sve učestalije koriÅ”tenje neke vrste korpusa (ā€˜korpusnoā€“asistirana analiza diskursaā€™ ili ā€˜kritička korpusna analizaā€™, Hardtā€“Mautner 1995; Baker 2016). U postjugoslavenskom prostoru, nedavni razvoj korpusne lingvistike donio je prednosti u mnogim područjima istraživanja. Ipak, za lingviste i analitičare diskursa koji se upuÅ”taju u prikupljanje specijaliziranih korpusa za vlastite istraživačke svrhe, i dalje ostaju otvorena mnoga pitanja ā€“ djelomično zbog pozadine korpusne lingvistike koja se brzo mijenja, ali i zbog činjenice da joÅ” uvijek postoji rascjep u poznavanju korpusnih metoda, kao i metodologije sastavljanja korpusa izvan anglofonskog konteksta. Ovim radom pokuÅ”avamo smanjiti spomenuti rascjep predstavljajući jedan postupni prikaz postupka izgradnje korpusa za hrvatski, srpski i slovenski, kroz primjer sastavljanja tematskog korpusa iz digitalnih medija (novinski članci i komentari čitatelja). Nakon pregleda tipova korpusa, koriÅ”tenja i prednosti u druÅ”tvenim znanostima i digitalnim humanističkim znanostima, predstavljamo mogućnosti sastavljanja korpusa u južnoslavenskim jezičnim kontekstima, uključujući opcije preuzimanja podataka s mreže, dozvola i etičkih pitanja, čimbenika koji olakÅ”avaju ili otežavaju automatizirano prikupljanje i označavanje korpusa i mogućnosti obrade. Studija otkriva sve veće mogućnosti za rad s danim jezicima, ali i neka uporno siva područja u kojima istraživači trebaju donositi odluke na temelju istraživačkih očekivanja. Općenito, rad ima za cilj rekapitulirati vlastito iskustvo sastavljanja korpusa u Å”irem kontekstu južnoslavenske korpusne lingvistike i korpusnih lingvističkih pristupa u humanističkim znanostima općenito

    FinEst BERT and CroSloEngual BERT: less is more in multilingual models

    Full text link
    Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. The research has been mostly focused on English language, though. While massively multilingual models exist, studies have shown that monolingual models produce much better results. We train two trilingual BERT-like models, one for Finnish, Estonian, and English, the other for Croatian, Slovenian, and English. We evaluate their performance on several downstream tasks, NER, POS-tagging, and dependency parsing, using the multilingual BERT and XLM-R as baselines. The newly created FinEst BERT and CroSloEngual BERT improve the results on all tasks in most monolingual and cross-lingual situationsComment: 10 pages, accepted at TSD 2020 conferenc

    Vidska uporaba u kontekstima ponavljanih radnji u hrvatskome, srpskome i ruskome

    Get PDF
    The article deals with the interaction of temporal quantifiers (adverbs of quantification) with aspect choice in Croatian, and some comparisons with Serbian, Russian and some other Slavic languages are given. The analysis is given for rijetko \u27seldom\u27, ponekad \u27sometimes\u27, često \u27often\u27, uvijek \u27always\u27, svake godine \u27every year\u27; dva puta / dvaput \u27twice\u27, tri puta / triput \u27thrice\u27, nekoliko puta \u27few times\u27, viÅ”e puta \u27several times\u27, puno / mnogo puta / nebrojeno puta \u27many times / innumerable times\u27, and their counterparts in languages mentioned.Predmet je ovoga rada izbor glagolskoga vida u kontekstima s vremenskim kvantifikatorima u hrvatskome, srpskome i ruskome. Kvantifikacija događaja inherentna je u odnosu svrÅ”enih i nesvrÅ”enih glagola ako se opreka između jedne pojavnice događaja i viÅ”estrukih pojavnica događaja uzme kao jedan od čimbenika njihove semantičke različitosti. Nakon uvodnih napomena, u drugome dijelu rada razmatraju se neka opća pitanja vremenske kvantifikacije, odnosno kvantifikacije događaja. Da bi se ispitala međuuvjetovanost vremenskih kvatifikatora i izbora vida u hrvatskome, promatraju se dvije skupine priložnih izraza koje upućuju na ponavljane radnje: prva skupina (\u27rijetko\u27, \u27ponekad\u27, \u27često\u27, \u27uvijek\u27) upućuje na relativnu kvantitetu, a druga (\u27dva puta\u27 / \u27dvaput\u27, \u27tri puta\u27 / \u27triput\u27, \u27nekoliko puta\u27, \u27viÅ”e puta\u27, \u27puno\u27/\u27mnoga puta\u27 / \u27nebrojeno puta\u27) na apsolutnu kvantitetu. U trećem dijelu iznosi se brojčana analiza primjera s tim izrazima (primjeri su izdvojeni iz Tridesetmilijunskoga korprusa hrvatskoga jezika). Analiza pokazuje da viÅ”i postotak pojave nesvrÅ”enoga vida nije automatski povezan s priložnim izrazima koji upućuju na redovitije ponavljanje radnje, odnosno da porast redovitosti ponavljanja nužno ne prati porast uporabe nesvrÅ”enoga vida (primjerice, postotak nesvrÅ”enih glagola u prezentu u kontekstima s \u27uvijek\u27 niži je od postotka za \u27rijetko\u27 i \u27često\u27). U četvrtome dijelu uspoređuje se kvantifikacija događaja i vid u ruskome i hrvatskome, a u petome se analizira vidska uporaba u kontekstima ponavljanih radnji u hrvatskome i srpskome. Analizu su potaknuli primjeri iz korpusa te neki zaključci M.lvić (1985) i S. M. Dickeya (2000) o toj problematici. Posebna se pozomost posvećuje kontekstnim čimbenicima koji utječu na uporabu svrÅ”enoga vida u kontekstima ponavljanih radnji. Neke naznake iz Dickeyove analize da bi se hrvatski i srpski mogli razlikovati u odnosu na prihvatljivost svrÅ”enoga vida u kontekstima ponavljanih radnji, tj. u odnosu na prototipno značenje svrÅ”enoga vida, potvrđuju se: u standardnome hrvatskome svrÅ”eni je vid prihvatljiv u mnogim kontekstima s ponavljanim radnjama u kojima je nesvrÅ”eni vid puno prihvatljivija ili jedina mogućnost u standardnome srpskome

    Slavic Psycholinguistics in the 21st Century

    Full text link
    This article provides an update on research in Slavic psycholinguistics since 2000 following my first review (Sekerina 2006), published as a position paper for the workshop The Future of Slavic Linguistics in America (SLING2K). The focus remains on formal experimental psycholinguistics understood in the narrow sense, i.e., experimental studies conducted with monolingual healthy adults. I review five dimensions characteristic of Slavic psycholinguisticsā€”populations, methods, domains, theoretical approaches, and specific languagesā€”and summarize the experimental data from Slavic languages published in general non-Slavic psycholinguistic journals and proceedings from the leading two conferences on Slavic linguistics, FASL and FDSL, since 2000. I argue that the current research trends in Slavic psycholinguistics are (1) a shift from adult monolingual participants to special population groups, such as children, people with aphasia, and bilingual learners, (2) a continuing move in the direction of cognitive neuroscience, with more emphasis on online experimental techniques, such as eye-tracking and neuroimaging, and (3) a focus on Slavic-specific phenomena that contribute to the ongoing debates in general psycholinguistics. The current infrastructural trends are (1) development of psycholinguistic databases and resources for Slavic languages and (2) a rise of psycholinguistic research conducted in Eastern European countries and disseminated in Slavic languages
    • ā€¦
    corecore