813 research outputs found
Cross-lingual Dependency Parsing of Related Languages with Rich Morphosyntactic Tagsets
This paper addresses cross-lingual dependency parsing using rich morphosyntactic tagsets. In our case study, we experiment with three related Slavic languages:
Croatian, Serbian and Slovene. Four different dependency treebanks are used for
monolingual parsing, direct cross-lingual
parsing, and a recently introduced crosslingual parsing approach that utilizes statistical machine translation and annotation projection. We argue for the benefits
of using rich morphosyntactic tagsets in
cross-lingual parsing and empirically support the claim by showing large improvements over an impoverished common feature representation in form of a reduced
part-of-speech tagset. In the process, we
improve over the previous state-of-the-art
scores in dependency parsing for all three
languages.Published versio
CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages
We present CLASSLA-Stanza, a pipeline for automatic linguistic annotation of
the South Slavic languages, which is based on the Stanza natural language
processing pipeline. We describe the main improvements in CLASSLA-Stanza with
respect to Stanza, and give a detailed description of the model training
process for the latest 2.1 release of the pipeline. We also report performance
scores produced by the pipeline for different languages and varieties.
CLASSLA-Stanza exhibits consistently high performance across all the supported
languages and outperforms or expands its parent pipeline Stanza at all the
supported tasks. We also present the pipeline's new functionality enabling
efficient processing of web data and the reasons that led to its
implementation.Comment: 17 pages, 14 tables, 1 figur
Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark
We introduce Universal NER (UNER), an open, community-driven project to
develop gold-standard NER benchmarks in many languages. The overarching goal of
UNER is to provide high-quality, cross-lingually consistent annotations to
facilitate and standardize multilingual NER research. UNER v1 contains 18
datasets annotated with named entities in a cross-lingual consistent schema
across 12 diverse languages. In this paper, we detail the dataset creation and
composition of UNER; we also provide initial modeling baselines on both
in-language and cross-lingual learning settings. We release the data, code, and
fitted models to the public
Universal NER:A Gold-Standard Multilingual Named Entity Recognition Benchmark
We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 18 datasets annotated with named entities in a cross-lingual consistent schema across 12 diverse languages. In this paper, we detail the dataset creation and composition of UNER; we also provide initial modeling baselines on both in-language and cross-lingual learning settings. We release the data, code, and fitted models to the public
Kompiliranje korpusa u digitalnim humanistiÄkim znanostima u jezicima s ograniÄenim resursima: o praksi kompiliranja tematskih korpusa iz digitalnih medija za srpski, hrvatski i slovenski
The digital era has unlocked unprecedented possibilities of compiling corpora of social discourse, which
has brought corpus linguistic methods into closer interaction with other methods of discourse analysis and the humanities. Even when not using any specific techniques of corpus linguistics, drawing
on some sort of corpus is increasingly resorted to for empiricallyāgrounded socialāscientific analysis
(sometimes dubbed ācorpusāassisted discourse analysisā or ācorpusābased critical discourse analysisā,
cf. HardtāMautner 1995; Baker 2016). In the postāYugoslav space, recent corpus developments have
brought tableāturning advantages in many areas of discourse research, along with an ongoing proliferation of corpora and tools. Still, for linguists and discourse analysts who embark on collecting specialized corpora for their own research purposes, many questions persist ā partly due to the fastāchanging
background of these issues, but also due to the fact that there is still a gap in the corpus method, and in
guidelines for corpus compilation, when applied beyond the anglophone contexts.
In this paper we aim to discuss some possible solutions to these difficulties, by presenting one
stepābyāstep account of a corpus building procedure specifically for Croatian, Serbian and Slovenian, through an example of compiling a thematic corpus from digital media sources (news articles
and reader comments). Following an overview of corpus types, uses and advantages in social sciences and digital humanities, we present the corpus compilation possibilities in the South Slavic
language contexts, including data scraping options, permissions and ethical issues, the factors that
facilitate or complicate automated collection, and corpus annotation and processing possibilities.
The study shows expanding possibilities for work with the given languages, but also some persistently grey areas where researchers need to make decisions based on research expectations. Overall, the paper aims to recapitulate our own corpus compilation experience in the wider context of
SouthāSlavic corpus linguistics and corpus linguistic approaches in the humanities more generallyDigitalno doba otvorilo je nove moguÄnosti za sastavljanje korpusa druÅ”tvenog diskursa, Å”to je
korpusnolingvistiÄke metode približilo drugim metodama analize diskursa te humanistiÄkim znanostima.
Äak i kada se ne koriste nikakve specifiÄne tehnike korpusne lingvistike, danas je za empirijski utemeljenu
druÅ”tvenoāznanstvenu analizu sve uÄestalije koriÅ”tenje neke vrste korpusa (ākorpusnoāasistirana analiza
diskursaā ili ākritiÄka korpusna analizaā, HardtāMautner 1995; Baker 2016). U postjugoslavenskom
prostoru, nedavni razvoj korpusne lingvistike donio je prednosti u mnogim podruÄjima istraživanja.
Ipak, za lingviste i analitiÄare diskursa koji se upuÅ”taju u prikupljanje specijaliziranih korpusa za vlastite
istraživaÄke svrhe, i dalje ostaju otvorena mnoga pitanja ā djelomiÄno zbog pozadine korpusne lingvistike
koja se brzo mijenja, ali i zbog Äinjenice da joÅ” uvijek postoji rascjep u poznavanju korpusnih metoda, kao
i metodologije sastavljanja korpusa izvan anglofonskog konteksta. Ovim radom pokuŔavamo smanjiti
spomenuti rascjep predstavljajuÄi jedan postupni prikaz postupka izgradnje korpusa za hrvatski, srpski
i slovenski, kroz primjer sastavljanja tematskog korpusa iz digitalnih medija (novinski Älanci i komentari
Äitatelja). Nakon pregleda tipova korpusa, koriÅ”tenja i prednosti u druÅ”tvenim znanostima i digitalnim
humanistiÄkim znanostima, predstavljamo moguÄnosti sastavljanja korpusa u južnoslavenskim jeziÄnim
kontekstima, ukljuÄujuÄi opcije preuzimanja podataka s mreže, dozvola i etiÄkih pitanja, Äimbenika koji
olakÅ”avaju ili otežavaju automatizirano prikupljanje i oznaÄavanje korpusa i moguÄnosti obrade. Studija
otkriva sve veÄe moguÄnosti za rad s danim jezicima, ali i neka uporno siva podruÄja u kojima istraživaÄi
trebaju donositi odluke na temelju istraživaÄkih oÄekivanja. OpÄenito, rad ima za cilj rekapitulirati
vlastito iskustvo sastavljanja korpusa u Ŕirem kontekstu južnoslavenske korpusne lingvistike i korpusnih
lingvistiÄkih pristupa u humanistiÄkim znanostima opÄenito
FinEst BERT and CroSloEngual BERT: less is more in multilingual models
Large pretrained masked language models have become state-of-the-art
solutions for many NLP problems. The research has been mostly focused on
English language, though. While massively multilingual models exist, studies
have shown that monolingual models produce much better results. We train two
trilingual BERT-like models, one for Finnish, Estonian, and English, the other
for Croatian, Slovenian, and English. We evaluate their performance on several
downstream tasks, NER, POS-tagging, and dependency parsing, using the
multilingual BERT and XLM-R as baselines. The newly created FinEst BERT and
CroSloEngual BERT improve the results on all tasks in most monolingual and
cross-lingual situationsComment: 10 pages, accepted at TSD 2020 conferenc
Vidska uporaba u kontekstima ponavljanih radnji u hrvatskome, srpskome i ruskome
The article deals with the interaction of temporal quantifiers (adverbs of quantification) with aspect choice in Croatian, and some comparisons with Serbian, Russian and some other Slavic languages are given. The analysis is given for rijetko \u27seldom\u27, ponekad \u27sometimes\u27, Äesto \u27often\u27, uvijek \u27always\u27, svake godine \u27every year\u27; dva puta / dvaput \u27twice\u27, tri puta / triput \u27thrice\u27, nekoliko puta \u27few times\u27, viÅ”e puta \u27several times\u27, puno / mnogo puta / nebrojeno puta \u27many times / innumerable times\u27, and their counterparts in languages mentioned.Predmet je ovoga rada izbor glagolskoga vida u kontekstima s vremenskim kvantifikatorima u hrvatskome, srpskome i ruskome. Kvantifikacija dogaÄaja inherentna je u odnosu svrÅ”enih i nesvrÅ”enih glagola ako se opreka izmeÄu jedne pojavnice dogaÄaja i viÅ”estrukih pojavnica dogaÄaja uzme kao jedan od Äimbenika njihove semantiÄke razliÄitosti. Nakon uvodnih napomena, u drugome dijelu rada razmatraju se neka opÄa pitanja vremenske kvantifikacije, odnosno kvantifikacije dogaÄaja. Da bi se ispitala meÄuuvjetovanost vremenskih kvatifikatora i izbora vida u hrvatskome, promatraju se dvije skupine priložnih izraza koje upuÄuju na ponavljane radnje: prva skupina (\u27rijetko\u27, \u27ponekad\u27, \u27Äesto\u27, \u27uvijek\u27) upuÄuje na relativnu kvantitetu, a druga (\u27dva puta\u27 / \u27dvaput\u27, \u27tri puta\u27 / \u27triput\u27, \u27nekoliko puta\u27, \u27viÅ”e puta\u27, \u27puno\u27/\u27mnoga puta\u27 / \u27nebrojeno puta\u27) na apsolutnu kvantitetu. U treÄem dijelu iznosi se brojÄana analiza primjera s tim izrazima (primjeri su izdvojeni iz Tridesetmilijunskoga korprusa hrvatskoga jezika). Analiza pokazuje da viÅ”i postotak pojave nesvrÅ”enoga vida nije automatski povezan s priložnim izrazima koji upuÄuju na redovitije ponavljanje radnje, odnosno da porast redovitosti ponavljanja nužno ne prati porast uporabe nesvrÅ”enoga vida (primjerice, postotak nesvrÅ”enih glagola u prezentu u kontekstima s \u27uvijek\u27 niži je od postotka za \u27rijetko\u27 i \u27Äesto\u27). U Äetvrtome dijelu usporeÄuje se kvantifikacija dogaÄaja i vid u ruskome i hrvatskome, a u petome se analizira vidska uporaba u kontekstima ponavljanih radnji u hrvatskome i srpskome. Analizu su potaknuli primjeri iz korpusa te neki zakljuÄci M.lviÄ (1985) i S. M. Dickeya (2000) o toj problematici. Posebna se pozomost posveÄuje kontekstnim Äimbenicima koji utjeÄu na uporabu svrÅ”enoga vida u kontekstima ponavljanih radnji. Neke naznake iz Dickeyove analize da bi se hrvatski i srpski mogli razlikovati u odnosu na prihvatljivost svrÅ”enoga vida u kontekstima ponavljanih radnji, tj. u odnosu na prototipno znaÄenje svrÅ”enoga vida, potvrÄuju se: u standardnome hrvatskome svrÅ”eni je vid prihvatljiv u mnogim kontekstima s ponavljanim radnjama u kojima je nesvrÅ”eni vid puno prihvatljivija ili jedina moguÄnost u standardnome srpskome
Slavic Psycholinguistics in the 21st Century
This article provides an update on research in Slavic psycholinguistics since 2000 following my first review (Sekerina 2006), published as a position paper for the workshop The Future of Slavic Linguistics in America (SLING2K). The focus remains on formal experimental psycholinguistics understood in the narrow sense, i.e., experimental studies conducted with monolingual healthy adults. I review five dimensions characteristic of Slavic psycholinguisticsāpopulations, methods, domains, theoretical approaches, and specific languagesāand summarize the experimental data from Slavic languages published in general non-Slavic psycholinguistic journals and proceedings from the leading two conferences on Slavic linguistics, FASL and FDSL, since 2000. I argue that the current research trends in Slavic psycholinguistics are (1) a shift from adult monolingual participants to special population groups, such as children, people with aphasia, and bilingual learners, (2) a continuing move in the direction of cognitive neuroscience, with more emphasis on online experimental techniques, such as eye-tracking and neuroimaging, and (3) a focus on Slavic-specific phenomena that contribute to the ongoing debates in general psycholinguistics. The current infrastructural trends are (1) development of psycholinguistic databases and resources for Slavic languages and (2) a rise of psycholinguistic research conducted in Eastern European countries and disseminated in Slavic languages
- ā¦