Search CORE

813 research outputs found

Cross-lingual Dependency Parsing of Related Languages with Rich Morphosyntactic Tagsets

Author: Agić Željko
Dobrovoljc Kaja
Krek Simon
Merkler Danijela
Moze Sara
Tiedemann Jörg
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2014
Field of study

This paper addresses cross-lingual dependency parsing using rich morphosyntactic tagsets. In our case study, we experiment with three related Slavic languages: Croatian, Serbian and Slovene. Four different dependency treebanks are used for monolingual parsing, direct cross-lingual parsing, and a recently introduced crosslingual parsing approach that utilizes statistical machine translation and annotation projection. We argue for the benefits of using rich morphosyntactic tagsets in cross-lingual parsing and empirically support the claim by showing large improvements over an impoverished common feature representation in form of a reduced part-of-speech tagset. In the process, we improve over the previous state-of-the-art scores in dependency parsing for all three languages.Published versio

Wolverhampton Intellectual Repository and E-theses

CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages

Author: Ljubešić Nikola
Terčon Luka
Publication venue
Publication date: 08/08/2023
Field of study

We present CLASSLA-Stanza, a pipeline for automatic linguistic annotation of the South Slavic languages, which is based on the Stanza natural language processing pipeline. We describe the main improvements in CLASSLA-Stanza with respect to Stanza, and give a detailed description of the model training process for the latest 2.1 release of the pipeline. We also report performance scores produced by the pipeline for different languages and varieties. CLASSLA-Stanza exhibits consistently high performance across all the supported languages and outperforms or expands its parent pipeline Stanza at all the supported tasks. We also present the pipeline's new functionality enabling efficient processing of web data and the reasons that led to its implementation.Comment: 17 pages, 14 tables, 1 figur

arXiv.org e-Print Archive

Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark

Author: Blevins Terra
Gonen Hila
Imperial Joseph Marvin
Karlsson Börje F.
Lin Peiqin
Liu Shuheng
Ljubešić Nikola
Mayhew Stephen
Miranda LJ
Pinter Yuval
Plank Barbara
Riabi Arij
Šuppa Marek
Publication venue
Publication date: 15/11/2023
Field of study

We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 18 datasets annotated with named entities in a cross-lingual consistent schema across 12 diverse languages. In this paper, we detail the dataset creation and composition of UNER; we also provide initial modeling baselines on both in-language and cross-lingual learning settings. We release the data, code, and fitted models to the public

arXiv.org e-Print Archive

Universal NER:A Gold-Standard Multilingual Named Entity Recognition Benchmark

Author: Blevins Terra
Gonen Hila
Imperial Joseph Marvin
Karlsson Börje F.
Lin Peiqin
Liu Shuheng
Ljubešić Nikola
Mayhew Stephen
Miranda LJ
Pinter Yuval
Plank Barbara
Riabi Arij
Šuppa Marek
Publication venue: 'Center for Open Science'
Publication date: 15/11/2023
Field of study

OPUS

Kompiliranje korpusa u digitalnim humanističkim znanostima u jezicima s ograničenim resursima: o praksi kompiliranja tematskih korpusa iz digitalnih medija za srpski, hrvatski i slovenski

Author: Batanović Vuk
Bogetić Ksenija
Ljubešić Nikola
Publication venue: 'Hrvatsko filolosko drustvo (Croatian Philological Society)'
Publication date: 01/01/2022
Field of study

The digital era has unlocked unprecedented possibilities of compiling corpora of social discourse, which has brought corpus linguistic methods into closer interaction with other methods of discourse analysis and the humanities. Even when not using any specific techniques of corpus linguistics, drawing on some sort of corpus is increasingly resorted to for empirically–grounded social–scientific analysis (sometimes dubbed ‘corpus–assisted discourse analysis’ or ‘corpus–based critical discourse analysis’, cf. Hardt–Mautner 1995; Baker 2016). In the post–Yugoslav space, recent corpus developments have brought table–turning advantages in many areas of discourse research, along with an ongoing proliferation of corpora and tools. Still, for linguists and discourse analysts who embark on collecting specialized corpora for their own research purposes, many questions persist – partly due to the fast–changing background of these issues, but also due to the fact that there is still a gap in the corpus method, and in guidelines for corpus compilation, when applied beyond the anglophone contexts. In this paper we aim to discuss some possible solutions to these difficulties, by presenting one step–by–step account of a corpus building procedure specifically for Croatian, Serbian and Slovenian, through an example of compiling a thematic corpus from digital media sources (news articles and reader comments). Following an overview of corpus types, uses and advantages in social sciences and digital humanities, we present the corpus compilation possibilities in the South Slavic language contexts, including data scraping options, permissions and ethical issues, the factors that facilitate or complicate automated collection, and corpus annotation and processing possibilities. The study shows expanding possibilities for work with the given languages, but also some persistently grey areas where researchers need to make decisions based on research expectations. Overall, the paper aims to recapitulate our own corpus compilation experience in the wider context of South–Slavic corpus linguistics and corpus linguistic approaches in the humanities more generallyDigitalno doba otvorilo je nove mogućnosti za sastavljanje korpusa društvenog diskursa, što je korpusnolingvističke metode približilo drugim metodama analize diskursa te humanističkim znanostima. Čak i kada se ne koriste nikakve specifične tehnike korpusne lingvistike, danas je za empirijski utemeljenu društveno–znanstvenu analizu sve učestalije korištenje neke vrste korpusa (‘korpusno–asistirana analiza diskursa’ ili ‘kritička korpusna analiza’, Hardt–Mautner 1995; Baker 2016). U postjugoslavenskom prostoru, nedavni razvoj korpusne lingvistike donio je prednosti u mnogim područjima istraživanja. Ipak, za lingviste i analitičare diskursa koji se upuštaju u prikupljanje specijaliziranih korpusa za vlastite istraživačke svrhe, i dalje ostaju otvorena mnoga pitanja – djelomično zbog pozadine korpusne lingvistike koja se brzo mijenja, ali i zbog činjenice da još uvijek postoji rascjep u poznavanju korpusnih metoda, kao i metodologije sastavljanja korpusa izvan anglofonskog konteksta. Ovim radom pokušavamo smanjiti spomenuti rascjep predstavljajući jedan postupni prikaz postupka izgradnje korpusa za hrvatski, srpski i slovenski, kroz primjer sastavljanja tematskog korpusa iz digitalnih medija (novinski članci i komentari čitatelja). Nakon pregleda tipova korpusa, korištenja i prednosti u društvenim znanostima i digitalnim humanističkim znanostima, predstavljamo mogućnosti sastavljanja korpusa u južnoslavenskim jezičnim kontekstima, uključujući opcije preuzimanja podataka s mreže, dozvola i etičkih pitanja, čimbenika koji olakšavaju ili otežavaju automatizirano prikupljanje i označavanje korpusa i mogućnosti obrade. Studija otkriva sve veće mogućnosti za rad s danim jezicima, ali i neka uporno siva područja u kojima istraživači trebaju donositi odluke na temelju istraživačkih očekivanja. Općenito, rad ima za cilj rekapitulirati vlastito iskustvo sastavljanja korpusa u širem kontekstu južnoslavenske korpusne lingvistike i korpusnih lingvističkih pristupa u humanističkim znanostima općenito

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia

FinEst BERT and CroSloEngual BERT: less is more in multilingual models

Author: Robnik-Šikonja Marko
Ulčar Matej
Publication venue
Publication date: 14/06/2020
Field of study

Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. The research has been mostly focused on English language, though. While massively multilingual models exist, studies have shown that monolingual models produce much better results. We train two trilingual BERT-like models, one for Finnish, Estonian, and English, the other for Croatian, Slovenian, and English. We evaluate their performance on several downstream tasks, NER, POS-tagging, and dependency parsing, using the multilingual BERT and XLM-R as baselines. The newly created FinEst BERT and CroSloEngual BERT improve the results on all tasks in most monolingual and cross-lingual situationsComment: 10 pages, accepted at TSD 2020 conferenc

arXiv.org e-Print Archive

Vidska uporaba u kontekstima ponavljanih radnji u hrvatskome, srpskome i ruskome

Author: Ljiljana Šarić
Publication venue: 'Croatian Academy of Sciences and Arts'
Publication date: 01/01/2000
Field of study

The article deals with the interaction of temporal quantifiers (adverbs of quantification) with aspect choice in Croatian, and some comparisons with Serbian, Russian and some other Slavic languages are given. The analysis is given for rijetko \u27seldom\u27, ponekad \u27sometimes\u27, često \u27often\u27, uvijek \u27always\u27, svake godine \u27every year\u27; dva puta / dvaput \u27twice\u27, tri puta / triput \u27thrice\u27, nekoliko puta \u27few times\u27, više puta \u27several times\u27, puno / mnogo puta / nebrojeno puta \u27many times / innumerable times\u27, and their counterparts in languages mentioned.Predmet je ovoga rada izbor glagolskoga vida u kontekstima s vremenskim kvantifikatorima u hrvatskome, srpskome i ruskome. Kvantifikacija događaja inherentna je u odnosu svršenih i nesvršenih glagola ako se opreka između jedne pojavnice događaja i višestrukih pojavnica događaja uzme kao jedan od čimbenika njihove semantičke različitosti. Nakon uvodnih napomena, u drugome dijelu rada razmatraju se neka opća pitanja vremenske kvantifikacije, odnosno kvantifikacije događaja. Da bi se ispitala međuuvjetovanost vremenskih kvatifikatora i izbora vida u hrvatskome, promatraju se dvije skupine priložnih izraza koje upućuju na ponavljane radnje: prva skupina (\u27rijetko\u27, \u27ponekad\u27, \u27često\u27, \u27uvijek\u27) upućuje na relativnu kvantitetu, a druga (\u27dva puta\u27 / \u27dvaput\u27, \u27tri puta\u27 / \u27triput\u27, \u27nekoliko puta\u27, \u27više puta\u27, \u27puno\u27/\u27mnoga puta\u27 / \u27nebrojeno puta\u27) na apsolutnu kvantitetu. U trećem dijelu iznosi se brojčana analiza primjera s tim izrazima (primjeri su izdvojeni iz Tridesetmilijunskoga korprusa hrvatskoga jezika). Analiza pokazuje da viši postotak pojave nesvršenoga vida nije automatski povezan s priložnim izrazima koji upućuju na redovitije ponavljanje radnje, odnosno da porast redovitosti ponavljanja nužno ne prati porast uporabe nesvršenoga vida (primjerice, postotak nesvršenih glagola u prezentu u kontekstima s \u27uvijek\u27 niži je od postotka za \u27rijetko\u27 i \u27često\u27). U četvrtome dijelu uspoređuje se kvantifikacija događaja i vid u ruskome i hrvatskome, a u petome se analizira vidska uporaba u kontekstima ponavljanih radnji u hrvatskome i srpskome. Analizu su potaknuli primjeri iz korpusa te neki zaključci M.lvić (1985) i S. M. Dickeya (2000) o toj problematici. Posebna se pozomost posvećuje kontekstnim čimbenicima koji utječu na uporabu svršenoga vida u kontekstima ponavljanih radnji. Neke naznake iz Dickeyove analize da bi se hrvatski i srpski mogli razlikovati u odnosu na prihvatljivost svršenoga vida u kontekstima ponavljanih radnji, tj. u odnosu na prototipno značenje svršenoga vida, potvrđuju se: u standardnome hrvatskome svršeni je vid prihvatljiv u mnogim kontekstima s ponavljanim radnjama u kojima je nesvršeni vid puno prihvatljivija ili jedina mogućnost u standardnome srpskome

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia

Slavic Psycholinguistics in the 21st Century

Author: Sekerina Irina A.
Publication venue: CUNY Academic Works
Publication date: 01/06/2017
Field of study

This article provides an update on research in Slavic psycholinguistics since 2000 following my first review (Sekerina 2006), published as a position paper for the workshop The Future of Slavic Linguistics in America (SLING2K). The focus remains on formal experimental psycholinguistics understood in the narrow sense, i.e., experimental studies conducted with monolingual healthy adults. I review five dimensions characteristic of Slavic psycholinguistics—populations, methods, domains, theoretical approaches, and specific languages—and summarize the experimental data from Slavic languages published in general non-Slavic psycholinguistic journals and proceedings from the leading two conferences on Slavic linguistics, FASL and FDSL, since 2000. I argue that the current research trends in Slavic psycholinguistics are (1) a shift from adult monolingual participants to special population groups, such as children, people with aphasia, and bilingual learners, (2) a continuing move in the direction of cognitive neuroscience, with more emphasis on online experimental techniques, such as eye-tracking and neuroimaging, and (3) a focus on Slavic-specific phenomena that contribute to the ongoing debates in general psycholinguistics. The current infrastructural trends are (1) development of psycholinguistic databases and resources for Slavic languages and (2) a rise of psycholinguistic research conducted in Eastern European countries and disseminated in Slavic languages

City University of New York