242 research outputs found
Error Analysis in Croatian Morphosyntactic Tagging
In this paper, we provide detailed
insight on properties of errors generated by a
stochastic morphosyntactic tagger assigning
Multext-East morphosyntactic descriptions to
Croatian texts. Tagging the Croatia Weekly
newspaper corpus by the CroTag tagger in
stochastic mode revealed that approximately 85
percent of all tagging errors occur on nouns,
adjectives, pronouns and verbs. Moreover,
approximately 50 percent of these are shown to
be incorrect assignments of case values. We
provide various other distributional properties of
errors in assigning morphosyntactic descriptions
for these and other parts of speech. On the basis
of these properties, we propose rule-based and
stochastic strategies which could be integrated in
the tagging module, creating a hybrid procedure
in order to raise overall tagging accuracy for
Croatian
External Lexical Information for Multilingual Part-of-Speech Tagging
Morphosyntactic lexicons and word vector representations have both proven
useful for improving the accuracy of statistical part-of-speech taggers. Here
we compare the performances of four systems on datasets covering 16 languages,
two of these systems being feature-based (MEMMs and CRFs) and two of them being
neural-based (bi-LSTMs). We show that, on average, all four approaches perform
similarly and reach state-of-the-art results. Yet better performances are
obtained with our feature-based models on lexically richer datasets (e.g. for
morphologically rich languages), whereas neural-based results are higher on
datasets with less lexical variability (e.g. for English). These conclusions
hold in particular for the MEMM models relying on our system MElt, which
benefited from newly designed features. This shows that, under certain
conditions, feature-based approaches enriched with morphosyntactic lexicons are
competitive with respect to neural methods
Tagset Reductions in Morphosyntactic Tagging of Croatian Texts
Morphosyntactic tagging of Croatian texts is performed with stochastic taggersby using a language model built on a manually annotated corpus implementingthe Multext East version 3 specifications for Croatian. Tagging accuracy in thisframework is basically predefined, i.e. proportionally dependent of two things:the size of the training corpus and the number of different morphosyntactic tagsencompassed by that corpus. Being that the 100 kw Croatia Weekly newspapercorpus by definition makes a rather small language model in terms of stochastictagging of free domain texts, the paper presents an approach dealing withtagset reductions. Several meaningful subsets of the Croatian Multext-East version3 morphosyntactic tagset specifications are created and applied on Croatiantexts with the CroTag stochastic tagger, measuring overall tagging accuracyand F1-measures. Obtained results are discussed in terms of applying differentreductions in different natural language processing systems and specifictasks defined by specific user requirements
Mimicking Word Embeddings using Subword RNNs
Word embeddings improve generalization over lexical features by placing each
word in a lower-dimensional space, using distributional information obtained
from unlabeled data. However, the effectiveness of word embeddings for
downstream NLP tasks is limited by out-of-vocabulary (OOV) words, for which
embeddings do not exist. In this paper, we present MIMICK, an approach to
generating OOV word embeddings compositionally, by learning a function from
spellings to distributional embeddings. Unlike prior work, MIMICK does not
require re-training on the original word embedding corpus; instead, learning is
performed at the type level. Intrinsic and extrinsic evaluations demonstrate
the power of this simple approach. On 23 languages, MIMICK improves performance
over a word-based baseline for tagging part-of-speech and morphosyntactic
attributes. It is competitive with (and complementary to) a supervised
character-based model in low-resource settings.Comment: EMNLP 201
CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages
We present CLASSLA-Stanza, a pipeline for automatic linguistic annotation of
the South Slavic languages, which is based on the Stanza natural language
processing pipeline. We describe the main improvements in CLASSLA-Stanza with
respect to Stanza, and give a detailed description of the model training
process for the latest 2.1 release of the pipeline. We also report performance
scores produced by the pipeline for different languages and varieties.
CLASSLA-Stanza exhibits consistently high performance across all the supported
languages and outperforms or expands its parent pipeline Stanza at all the
supported tasks. We also present the pipeline's new functionality enabling
efficient processing of web data and the reasons that led to its
implementation.Comment: 17 pages, 14 tables, 1 figur
Uvid u automatsko izluÄivanje metaforiÄkih kolokacija
Collocations have been the subject of much scientific research over the years. The focus of this research is on a subset of collocations, namely metaphorical collocations. In metaphorical collocations, a semantic shift has taken place in one of the components, i.e., one of the components takes on a transferred meaning. The main goal of this paper is to review the existing literature and provide a systematic overview of the existing research on collocation extraction, as well as the overview of existing methods, measures, and resources. The existing research is classified according to the approach (statistical, hybrid, and distributional semantics) and presented in three separate sections. The insights gained from existing research serve as a first step in exploring the possibility of developing a method for automatic extraction of metaphorical collocations. The methods, tools, and resources that may prove useful for future work are highlighted.Kolokacije su veÄ dugi niz godina tema mnogih znanstvenih istraživanja. U fokusu ovoga istraživanja podskupina je kolokacija koju Äine metaforiÄke kolokacije. Kod metaforiÄkih je kolokacija kod jedne od sastavnica doÅ”lo do semantiÄkoga pomaka, tj. jedna od sastavnica poprima preneseno znaÄenje. Glavni su ciljevi ovoga rada istražiti postojeÄu literaturu te dati sustavan pregled postojeÄih istraživanja na temu izluÄivanja kolokacija i postojeÄih metoda, mjera i resursa. PostojeÄa istraživanja opisana su i klasificirana prema razliÄitim pristupima (statistiÄki, hibridni i zasnovani na distribucijskoj semantici). TakoÄer su opisane razliÄite asocijativne mjere i postojeÄi naÄini procjene rezultata automatskoga izluÄivanja kolokacija. Metode, alati i resursi koji su koriÅ”teni u prethodnim istraživanjima, a mogli bi biti korisni za naÅ” buduÄi rad posebno su istaknuti. SteÄeni uvidi u postojeÄa istraživanja Äine prvi korak u razmatranju moguÄnosti razvijanja postupka za automatsko izluÄivanje metaforiÄkih kolokacija
Kompiliranje korpusa u digitalnim humanistiÄkim znanostima u jezicima s ograniÄenim resursima: o praksi kompiliranja tematskih korpusa iz digitalnih medija za srpski, hrvatski i slovenski
The digital era has unlocked unprecedented possibilities of compiling corpora of social discourse, which
has brought corpus linguistic methods into closer interaction with other methods of discourse analysis and the humanities. Even when not using any specific techniques of corpus linguistics, drawing
on some sort of corpus is increasingly resorted to for empiricallyāgrounded socialāscientific analysis
(sometimes dubbed ācorpusāassisted discourse analysisā or ācorpusābased critical discourse analysisā,
cf. HardtāMautner 1995; Baker 2016). In the postāYugoslav space, recent corpus developments have
brought tableāturning advantages in many areas of discourse research, along with an ongoing proliferation of corpora and tools. Still, for linguists and discourse analysts who embark on collecting specialized corpora for their own research purposes, many questions persist ā partly due to the fastāchanging
background of these issues, but also due to the fact that there is still a gap in the corpus method, and in
guidelines for corpus compilation, when applied beyond the anglophone contexts.
In this paper we aim to discuss some possible solutions to these difficulties, by presenting one
stepābyāstep account of a corpus building procedure specifically for Croatian, Serbian and Slovenian, through an example of compiling a thematic corpus from digital media sources (news articles
and reader comments). Following an overview of corpus types, uses and advantages in social sciences and digital humanities, we present the corpus compilation possibilities in the South Slavic
language contexts, including data scraping options, permissions and ethical issues, the factors that
facilitate or complicate automated collection, and corpus annotation and processing possibilities.
The study shows expanding possibilities for work with the given languages, but also some persistently grey areas where researchers need to make decisions based on research expectations. Overall, the paper aims to recapitulate our own corpus compilation experience in the wider context of
SouthāSlavic corpus linguistics and corpus linguistic approaches in the humanities more generallyDigitalno doba otvorilo je nove moguÄnosti za sastavljanje korpusa druÅ”tvenog diskursa, Å”to je
korpusnolingvistiÄke metode približilo drugim metodama analize diskursa te humanistiÄkim znanostima.
Äak i kada se ne koriste nikakve specifiÄne tehnike korpusne lingvistike, danas je za empirijski utemeljenu
druÅ”tvenoāznanstvenu analizu sve uÄestalije koriÅ”tenje neke vrste korpusa (ākorpusnoāasistirana analiza
diskursaā ili ākritiÄka korpusna analizaā, HardtāMautner 1995; Baker 2016). U postjugoslavenskom
prostoru, nedavni razvoj korpusne lingvistike donio je prednosti u mnogim podruÄjima istraživanja.
Ipak, za lingviste i analitiÄare diskursa koji se upuÅ”taju u prikupljanje specijaliziranih korpusa za vlastite
istraživaÄke svrhe, i dalje ostaju otvorena mnoga pitanja ā djelomiÄno zbog pozadine korpusne lingvistike
koja se brzo mijenja, ali i zbog Äinjenice da joÅ” uvijek postoji rascjep u poznavanju korpusnih metoda, kao
i metodologije sastavljanja korpusa izvan anglofonskog konteksta. Ovim radom pokuŔavamo smanjiti
spomenuti rascjep predstavljajuÄi jedan postupni prikaz postupka izgradnje korpusa za hrvatski, srpski
i slovenski, kroz primjer sastavljanja tematskog korpusa iz digitalnih medija (novinski Älanci i komentari
Äitatelja). Nakon pregleda tipova korpusa, koriÅ”tenja i prednosti u druÅ”tvenim znanostima i digitalnim
humanistiÄkim znanostima, predstavljamo moguÄnosti sastavljanja korpusa u južnoslavenskim jeziÄnim
kontekstima, ukljuÄujuÄi opcije preuzimanja podataka s mreže, dozvola i etiÄkih pitanja, Äimbenika koji
olakÅ”avaju ili otežavaju automatizirano prikupljanje i oznaÄavanje korpusa i moguÄnosti obrade. Studija
otkriva sve veÄe moguÄnosti za rad s danim jezicima, ali i neka uporno siva podruÄja u kojima istraživaÄi
trebaju donositi odluke na temelju istraživaÄkih oÄekivanja. OpÄenito, rad ima za cilj rekapitulirati
vlastito iskustvo sastavljanja korpusa u Ŕirem kontekstu južnoslavenske korpusne lingvistike i korpusnih
lingvistiÄkih pristupa u humanistiÄkim znanostima opÄenito
Data sparsity in highly inflected languages: the case of morphosyntactic tagging in Polish
In morphologically complex languages, many high-level tasks in natural language
processing rely on accurate morphosyntactic analyses of the input. However, in
light of the risk of error propagation in present-day pipeline architectures for basic
linguistic pre-processing, the state of the art for morphosyntactic tagging is still
not satisfactory. The main obstacle here is data sparsity inherent to natural lan-
guage in general and highly inflected languages in particular.
In this work, we investigate whether semi-supervised systems may alleviate the
data sparsity problem. Our approach uses word clusters obtained from large
amounts of unlabelled text in an unsupervised manner in order to provide a su-
pervised probabilistic tagger with morphologically informed features. Our evalua-
tions on a number of datasets for the Polish language suggest that this simple
technique improves tagging accuracy, especially with regard to out-of-vocabulary
words. This may prove useful to increase cross-domain performance of taggers,
and to alleviate the dependency on large amounts of supervised training data,
which is especially important from the perspective of less-resourced languages
Increasing Quality of the Corpus of Frequency Dictionary of Contemporary Polish for Morphosyntactic Tagging of the Polish Language
The paper is devoted to the issue of correction of the erroneous and ambiguous corpus of Frequency Dictionary of Contemporary Polish (FDCP) and its application to morphosyntactic tagging of the Polish language. Several stages of corpus transformation are presented and baseline part-of-speech tagging algorithms are evaluated, too
- ā¦